4 min read•august 14, 2024
The package in R is a game-changer for data manipulation. It offers a set of powerful functions that make it easy to select, filter, arrange, and summarize data. These tools allow you to quickly wrangle your data into the shape you need.
With dplyr, you can chain operations together using the pipe operator, creating efficient data pipelines. This approach streamlines your code, making it more readable and easier to maintain. By mastering dplyr, you'll be able to handle complex data tasks with ease.
[select()](https://www.fiveableKeyTerm:select())
function
new_name = old_name
[filter()](https://www.fiveableKeyTerm:filter())
function
&
, |
, !
)filter(df, age > 18 & city == "New York")
[distinct()](https://www.fiveableKeyTerm:distinct())
function
distinct(df, id, name)
[slice()](https://www.fiveableKeyTerm:slice())
function
slice(df, 1:10)
selects the first 10 rows[arrange()](https://www.fiveableKeyTerm:arrange())
function
[desc()](https://www.fiveableKeyTerm:desc())
to sort in descending orderarrange(df, desc(age), name)
arrange()
with other dplyr functions for more complex sorting
df [%>%](https://www.fiveableKeyTerm:%>%) filter(city == "New York") %>% arrange(desc(salary))
[mutate()](https://www.fiveableKeyTerm:mutate())
function
mutate(df, new_col = old_col * 2, is_adult = age >= 18)
[transmute()](https://www.fiveableKeyTerm:transmute())
function to create new columns and drop all other columns
mutate()
but keeps only the newly created or modified columnstransmute(df, double_age = age * 2)
[across()](https://www.fiveableKeyTerm:across())
function within mutate()
starts_with()
, ends_with()
, contains()
)mutate(df, across(starts_with("score_"), ~ . / 100))
[summarize()](https://www.fiveableKeyTerm:summarize())
function
summarize(df, mean_age = mean(age), max_score = max(score))
across()
function within summarize()
to apply functions to multiple columns
summarize(df, across(starts_with("score_"), mean))
[count()](https://www.fiveableKeyTerm:count())
function
[group_by()](https://www.fiveableKeyTerm:group_by())
followed by summarize()
count(df, city)
counts the number of rows for each unique citygroup_by()
function
summarize()
, mutate()
) will be applied independently to each groupgroup_by(df, city, gender)
[ungroup()](https://www.fiveableKeyTerm:ungroup())
function
df %>% group_by(city) %>% summarize(mean_age = mean(age)) %>% ungroup()
[n()](https://www.fiveableKeyTerm:n())
function within summarize()
summarize(df, group_size = n())
[n_distinct()](https://www.fiveableKeyTerm:n_distinct())
function within summarize()
summarize(df, unique_cities = n_distinct(city))
[first()](https://www.fiveableKeyTerm:first())
, [last()](https://www.fiveableKeyTerm:last())
, or [nth()](https://www.fiveableKeyTerm:nth())
within summarize()
summarize(df, first_name = first(name), last_score = last(score))
%>%
) from the magrittr package to chain multiple dplyr functions together
df %>% filter(age > 18) %>% group_by(city) %>% summarize(mean_income = mean(income))
df %>% select(id, name, age) %>% filter(age >= 18) %>% mutate(adult = TRUE)
filtered_df <- filter(df, age > 18); summarized_df <- summarize(filtered_df, mean_age = mean(age))
, use df %>% filter(age > 18) %>% summarize(mean_age = mean(age))
df %>% select(id, name) %>% group_by(id) %>% summarize(name_count = n())
works because id
is selected before grouping