5 min read•august 14, 2024
Big data handling in R requires efficient tools. and are two powerful packages that excel at manipulating large datasets. They offer different approaches but can be combined for optimal performance and readability.
This section explores how data.table's speed and pair with dplyr's expressive syntax. You'll learn key functions and techniques for each package, and how to leverage their strengths together when working with massive datasets.
DT[i, j, by]
, where i
is the row selector, j
is the column selector, and by
is the grouping variable
DT[age > 18, .(mean_income = mean(income)), by = gender]
calculates the mean income for each gender group for individuals over 18 years oldDT[sales > 1000 & region == "North"]
quickly filters rows where sales exceed 1000 and the region is "North":=
operator, which performs in-place updates without copying the entire dataset
DT[, new_column := sales * 0.1]
creates a new column new_column
by multiplying the sales
column by 0.1DT[sales > 1000, .(total_sales = sum(sales)), by = region][order(-total_sales)]
calculates total sales by region for sales over 1000 and sorts the result in descending orderby
argument in data.table allows grouping data by one or more variables, facilitating aggregation operations on subsets of data
DT[, .(avg_price = mean(price)), by = category]
calculates the average price for each product categorysum()
, mean()
, min()
, max()
, and .N
(for counting rows), which can be applied efficiently to grouped data
DT[, .(total_sales = sum(sales), num_orders = .N), by = customer_id]
calculates the total sales and number of orders for each customer:=
operator in data.table allows the creation of new columns or modification of existing columns based on aggregation results
DT[, total_revenue := sum(sales * price), by = product]
calculates the total revenue for each product and assigns it to a new column total_revenue
dcast()
and melt()
functions, enabling easy transformation between wide and long formats
dcast(DT, customer_id ~ product, value.var = "quantity", sum)
reshapes the data from long to wide format, with customer_id as rows, products as columns, and the sum of quantities as valuessetkey(DT, customer_id)
sets the key column for efficient and subset operationsfilter()
, [select()](https://www.fiveableKeyTerm:select())
, [mutate()](https://www.fiveableKeyTerm:mutate())
, and summarise()
, allow for intuitive and readable data manipulation operations
filter(data, age > 18)
filters rows where age is greater than 18select(data, name, age, city)
selects specific columns (name, age, city) from the dataset%>%
) in dplyr enables chaining multiple operations together, improving code readability and reducing intermediate variables
data %>% filter(age > 18) %>% select(name, age, city)
filters rows where age is greater than 18 and then selects specific columnsdata %>% filter(age > 18) %>% mutate(age_squared = age^2)
delays the computation of age_squared
until the filtered dataset is actually needed[group_by()](https://www.fiveableKeyTerm:group_by())
function allows grouping data by one or more variables, facilitating aggregation and summary operations on subsets of data
data %>% group_by(city) %>% summarise(avg_age = mean(age))
calculates the average age for each citymutate()
function in dplyr enables the creation of new columns or modification of existing columns based on expressions or functions
data %>% mutate(age_category = ifelse(age < 18, "minor", "adult"))
creates a new column age_category
based on the value of age
library(data.table); DT <- data.table(data)
converts a data.frame to a data.table objectDT %>% filter(age > 18) %>% select(name, age, city)
applies dplyr functions on a data.table objectlibrary(dtplyr); lazy_dt(DT) %>% filter(age > 18) %>% select(name, age, city)
uses dplyr syntax on a data.table object with lazy evaluationDT[, .(avg_age = mean(age)), by = city]
and then use dplyr's arrange()
function to sort the result %>% arrange(desc(avg_age))
merge(DT1, DT2, by = "key")
and then use dplyr for subsequent data transformations %>% mutate(new_var = var1 + var2) %>% filter(new_var > 10)