Modern Tidyverse Patterns
Best practices for modern tidyverse development with dplyr 1.1+ and R 4.3+
Core Principles
-
Use modern tidyverse patterns - Prioritize dplyr 1.1+ features, native pipe, and current APIs
-
Profile before optimizing - Use profvis and bench to identify real bottlenecks
-
Write readable code first - Optimize only when necessary and after profiling
-
Follow tidyverse style guide - Consistent naming, spacing, and structure
Pipe Usage (|> not %>% )
-
Always use native pipe |> instead of magrittr %>%
-
R 4.3+ provides all needed features
Good - Modern native pipe
data |> filter(year >= 2020) |> summarise(mean_value = mean(value))
Avoid - Legacy magrittr pipe
data %>% filter(year >= 2020) %>% summarise(mean_value = mean(value))
Join Syntax (dplyr 1.1+)
-
Use join_by() instead of character vectors for joins
-
Support for inequality, rolling, and overlap joins
Good - Modern join syntax
transactions |> inner_join(companies, by = join_by(company == id))
Good - Inequality joins
transactions |> inner_join(companies, join_by(company == id, year >= since))
Good - Rolling joins (closest match)
transactions |> inner_join(companies, join_by(company == id, closest(year >= since)))
Avoid - Old character vector syntax
transactions |> inner_join(companies, by = c("company" = "id"))
Multiple Match Handling
- Use multiple and unmatched arguments for quality control
Expect 1:1 matches, error on multiple
inner_join(x, y, by = join_by(id), multiple = "error")
Allow multiple matches explicitly
inner_join(x, y, by = join_by(id), multiple = "all")
Ensure all rows match
inner_join(x, y, by = join_by(id), unmatched = "error")
Data Masking and Tidy Selection
-
Understand the difference between data masking and tidy selection
-
Use {{}} (embrace) for function arguments
-
Use .data[[]] for character vectors
Data masking functions: arrange(), filter(), mutate(), summarise()
Tidy selection functions: select(), relocate(), across()
Function arguments - embrace with {{}}
my_summary <- function(data, group_var, summary_var) { data |> group_by({{ group_var }}) |> summarise(mean_val = mean({{ summary_var }})) }
Character vectors - use .data[[]]
for (var in names(mtcars)) { mtcars |> count(.data[[var]]) |> print() }
Multiple columns - use across()
data |> summarise(across({{ summary_vars }}, ~ mean(.x, na.rm = TRUE)))
Modern Grouping and Column Operations
-
Use .by for per-operation grouping (dplyr 1.1+)
-
Use pick() for column selection inside data-masking functions
-
Use across() for applying functions to multiple columns
-
Use reframe() for multi-row summaries
Good - Per-operation grouping (always returns ungrouped)
data |> summarise(mean_value = mean(value), .by = category)
Good - Multiple grouping variables
data |> summarise(total = sum(revenue), .by = c(company, year))
Good - pick() for column selection
data |> summarise( n_x_cols = ncol(pick(starts_with("x"))), n_y_cols = ncol(pick(starts_with("y"))) )
Good - across() for applying functions
data |> summarise(across(where(is.numeric), mean, .names = "mean_{.col}"), .by = group)
Good - reframe() for multi-row results
data |> reframe(quantiles = quantile(x, c(0.25, 0.5, 0.75)), .by = group)
Avoid - Old persistent grouping pattern
data |> group_by(category) |> summarise(mean_value = mean(value)) |> ungroup()
Modern purrr Patterns
-
Use map() |> list_rbind() instead of superseded map_dfr()
-
Use walk() for side effects (file writing, plotting)
-
Use in_parallel() for scaling across cores
Modern data frame row binding (purrr 1.0+)
models <- data_splits |> map((split) train_model(split)) |> list_rbind() # Replaces map_dfr()
Column binding
summaries <- data_list |> map((df) get_summary_stats(df)) |> list_cbind() # Replaces map_dfc()
Side effects with walk()
plots <- walk2(data_list, plot_names, (df, name) { p <- ggplot(df, aes(x, y)) + geom_point() ggsave(name, p) })
Parallel processing (purrr 1.1.0+)
library(mirai) daemons(4) results <- large_datasets |> map(in_parallel(expensive_computation)) daemons(0)
String Manipulation with stringr
-
Use stringr over base R string functions
-
Consistent str_ prefix and string-first argument order
-
Pipe-friendly and vectorized by design
Good - stringr (consistent, pipe-friendly)
text |> str_to_lower() |> str_trim() |> str_replace_all("pattern", "replacement") |> str_extract("\d+")
Common patterns
str_detect(text, "pattern") # vs grepl("pattern", text) str_extract(text, "pattern") # vs complex regmatches() str_replace_all(text, "a", "b") # vs gsub("a", "b", text) str_split(text, ",") # vs strsplit(text, ",") str_length(text) # vs nchar(text) str_sub(text, 1, 5) # vs substr(text, 1, 5)
String combination and formatting
str_c("a", "b", "c") # vs paste0() str_glue("Hello {name}!") # templating str_pad(text, 10, "left") # padding str_wrap(text, width = 80) # text wrapping
Case conversion
str_to_lower(text) # vs tolower() str_to_upper(text) # vs toupper() str_to_title(text) # vs tools::toTitleCase()
Pattern helpers for clarity
str_detect(text, fixed("$")) # literal match str_detect(text, regex("\d+")) # explicit regex str_detect(text, coll("e", locale = "fr")) # collation
Avoid - inconsistent base R functions
grepl("pattern", text) # argument order varies regmatches(text, regexpr(...)) # complex extraction gsub("a", "b", text) # different arg order
Vectorization and Performance
Good - vectorized operations
result <- x + y
Good - Type-stable purrr functions
map_dbl(data, mean) # always returns double map_chr(data, class) # always returns character
Avoid - Type-unstable base functions
sapply(data, mean) # might return list or vector
Avoid - explicit loops for simple operations
result <- numeric(length(x)) for(i in seq_along(x)) { result[i] <- x[i] + y[i] }
Common Anti-Patterns to Avoid
Legacy Patterns
Avoid - Old pipe
data %>% function()
Avoid - Old join syntax
inner_join(x, y, by = c("a" = "b"))
Avoid - Implicit type conversion
sapply() # Use map_*() instead
Avoid - String manipulation in data masking
mutate(data, !!paste0("new_", var) := value)
Use across() or other approaches instead
Performance Anti-Patterns
Avoid - Growing objects in loops
result <- c() for(i in 1:n) { result <- c(result, compute(i)) # Slow! }
Good - Pre-allocate
result <- vector("list", n) for(i in 1:n) { result[[i]] <- compute(i) }
Better - Use purrr
result <- map(1:n, compute)
Migration from Old Patterns
From Base R to Modern Tidyverse
Data manipulation
subset(data, condition) -> filter(data, condition) data[order(data$x), ] -> arrange(data, x) aggregate(x ~ y, data, mean) -> summarise(data, mean(x), .by = y)
Functional programming
sapply(x, f) -> map(x, f) # type-stable lapply(x, f) -> map(x, f)
String manipulation
grepl("pattern", text) -> str_detect(text, "pattern") gsub("old", "new", text) -> str_replace_all(text, "old", "new") substr(text, 1, 5) -> str_sub(text, 1, 5) nchar(text) -> str_length(text) strsplit(text, ",") -> str_split(text, ",") paste0(a, b) -> str_c(a, b) tolower(text) -> str_to_lower(text)
From Old to New Tidyverse Patterns
Pipes
data %>% function() -> data |> function()
Grouping (dplyr 1.1+)
group_by(data, x) |> summarise(mean(y)) |> ungroup() -> summarise(data, mean(y), .by = x)
Column selection
across(starts_with("x")) -> pick(starts_with("x")) # for selection only
Joins
by = c("a" = "b") -> by = join_by(a == b)
Multi-row summaries
summarise(data, x, .groups = "drop") -> reframe(data, x)
Data reshaping
gather()/spread() -> pivot_longer()/pivot_wider()
String separation (tidyr 1.3+)
separate(col, into = c("a", "b")) -> separate_wider_delim(col, delim = "_", names = c("a", "b")) extract(col, into = "x", regex) -> separate_wider_regex(col, patterns = c(x = regex))
Superseded purrr Functions (purrr 1.0+)
map_dfr(x, f) -> map(x, f) |> list_rbind() map_dfc(x, f) -> map(x, f) |> list_cbind() map2_dfr(x, y, f) -> map2(x, y, f) |> list_rbind() pmap_dfr(list, f) -> pmap(list, f) |> list_rbind() imap_dfr(x, f) -> imap(x, f) |> list_rbind()
For side effects
walk(x, write_file) # instead of for loops walk2(data, paths, write_csv) # multiple arguments