Modern Tidyverse Patterns

Best practices for modern tidyverse development with dplyr 1.1+ and R 4.3+

Core Principles

Use modern tidyverse patterns - Prioritize dplyr 1.1+ features, native pipe, and current APIs
Profile before optimizing - Use profvis and bench to identify real bottlenecks
Write readable code first - Optimize only when necessary and after profiling
Follow tidyverse style guide - Consistent naming, spacing, and structure

Pipe Usage (|> not %>% )

Always use native pipe |> instead of magrittr %>%
R 4.3+ provides all needed features

Good - Modern native pipe

data |> filter(year >= 2020) |> summarise(mean_value = mean(value))

Avoid - Legacy magrittr pipe

data %>% filter(year >= 2020) %>% summarise(mean_value = mean(value))

Join Syntax (dplyr 1.1+)

Use join_by() instead of character vectors for joins
Support for inequality, rolling, and overlap joins

Good - Modern join syntax

transactions |> inner_join(companies, by = join_by(company == id))

Good - Inequality joins

transactions |> inner_join(companies, join_by(company == id, year >= since))

Good - Rolling joins (closest match)

transactions |> inner_join(companies, join_by(company == id, closest(year >= since)))

Avoid - Old character vector syntax

transactions |> inner_join(companies, by = c("company" = "id"))

Multiple Match Handling

Use multiple and unmatched arguments for quality control

Expect 1:1 matches, error on multiple

inner_join(x, y, by = join_by(id), multiple = "error")

Allow multiple matches explicitly

inner_join(x, y, by = join_by(id), multiple = "all")

Ensure all rows match

inner_join(x, y, by = join_by(id), unmatched = "error")

Data Masking and Tidy Selection

Understand the difference between data masking and tidy selection
Use {{}} (embrace) for function arguments
Use .data[[]] for character vectors

Data masking functions: arrange(), filter(), mutate(), summarise()

Tidy selection functions: select(), relocate(), across()

Function arguments - embrace with {{}}

my_summary <- function(data, group_var, summary_var) { data |> group_by({{ group_var }}) |> summarise(mean_val = mean({{ summary_var }})) }

Character vectors - use .data[[]]

for (var in names(mtcars)) { mtcars |> count(.data[[var]]) |> print() }

Multiple columns - use across()

data |> summarise(across({{ summary_vars }}, ~ mean(.x, na.rm = TRUE)))

Modern Grouping and Column Operations

Use .by for per-operation grouping (dplyr 1.1+)
Use pick() for column selection inside data-masking functions
Use across() for applying functions to multiple columns
Use reframe() for multi-row summaries

Good - Per-operation grouping (always returns ungrouped)

data |> summarise(mean_value = mean(value), .by = category)

Good - Multiple grouping variables

data |> summarise(total = sum(revenue), .by = c(company, year))

Good - pick() for column selection

data |> summarise( n_x_cols = ncol(pick(starts_with("x"))), n_y_cols = ncol(pick(starts_with("y"))) )

Good - across() for applying functions

data |> summarise(across(where(is.numeric), mean, .names = "mean_{.col}"), .by = group)

Good - reframe() for multi-row results

data |> reframe(quantiles = quantile(x, c(0.25, 0.5, 0.75)), .by = group)

Avoid - Old persistent grouping pattern

data |> group_by(category) |> summarise(mean_value = mean(value)) |> ungroup()

Modern purrr Patterns

Use map() |> list_rbind() instead of superseded map_dfr()
Use walk() for side effects (file writing, plotting)
Use in_parallel() for scaling across cores

Modern data frame row binding (purrr 1.0+)

models <- data_splits |> map((split) train_model(split)) |> list_rbind() # Replaces map_dfr()

Column binding

summaries <- data_list |> map((df) get_summary_stats(df)) |> list_cbind() # Replaces map_dfc()

Side effects with walk()

plots <- walk2(data_list, plot_names, (df, name) { p <- ggplot(df, aes(x, y)) + geom_point() ggsave(name, p) })

Parallel processing (purrr 1.1.0+)

library(mirai) daemons(4) results <- large_datasets |> map(in_parallel(expensive_computation)) daemons(0)

String Manipulation with stringr

Use stringr over base R string functions
Consistent str_ prefix and string-first argument order
Pipe-friendly and vectorized by design

Good - stringr (consistent, pipe-friendly)

text |> str_to_lower() |> str_trim() |> str_replace_all("pattern", "replacement") |> str_extract("\d+")

Common patterns

str_detect(text, "pattern") # vs grepl("pattern", text) str_extract(text, "pattern") # vs complex regmatches() str_replace_all(text, "a", "b") # vs gsub("a", "b", text) str_split(text, ",") # vs strsplit(text, ",") str_length(text) # vs nchar(text) str_sub(text, 1, 5) # vs substr(text, 1, 5)

String combination and formatting

str_c("a", "b", "c") # vs paste0() str_glue("Hello {name}!") # templating str_pad(text, 10, "left") # padding str_wrap(text, width = 80) # text wrapping

Case conversion

str_to_lower(text) # vs tolower() str_to_upper(text) # vs toupper() str_to_title(text) # vs tools::toTitleCase()

Pattern helpers for clarity

str_detect(text, fixed("$")) # literal match str_detect(text, regex("\d+")) # explicit regex str_detect(text, coll("e", locale = "fr")) # collation

Avoid - inconsistent base R functions

grepl("pattern", text) # argument order varies regmatches(text, regexpr(...)) # complex extraction gsub("a", "b", text) # different arg order

Vectorization and Performance

Good - vectorized operations

result <- x + y

Good - Type-stable purrr functions

map_dbl(data, mean) # always returns double map_chr(data, class) # always returns character

Avoid - Type-unstable base functions

sapply(data, mean) # might return list or vector

Avoid - explicit loops for simple operations

result <- numeric(length(x)) for(i in seq_along(x)) { result[i] <- x[i] + y[i] }

Common Anti-Patterns to Avoid

Legacy Patterns

Avoid - Old pipe

data %>% function()

Avoid - Old join syntax

inner_join(x, y, by = c("a" = "b"))

Avoid - Implicit type conversion

sapply() # Use map_*() instead

Avoid - String manipulation in data masking

mutate(data, !!paste0("new_", var) := value)

Use across() or other approaches instead

Performance Anti-Patterns

Avoid - Growing objects in loops

result <- c() for(i in 1:n) { result <- c(result, compute(i)) # Slow! }

Good - Pre-allocate

result <- vector("list", n) for(i in 1:n) { result[[i]] <- compute(i) }

Better - Use purrr

result <- map(1:n, compute)

Migration from Old Patterns

From Base R to Modern Tidyverse

Data manipulation

subset(data, condition) -> filter(data, condition) data[order(data$x), ] -> arrange(data, x) aggregate(x ~ y, data, mean) -> summarise(data, mean(x), .by = y)

Functional programming

sapply(x, f) -> map(x, f) # type-stable lapply(x, f) -> map(x, f)

String manipulation

grepl("pattern", text) -> str_detect(text, "pattern") gsub("old", "new", text) -> str_replace_all(text, "old", "new") substr(text, 1, 5) -> str_sub(text, 1, 5) nchar(text) -> str_length(text) strsplit(text, ",") -> str_split(text, ",") paste0(a, b) -> str_c(a, b) tolower(text) -> str_to_lower(text)

From Old to New Tidyverse Patterns

Pipes

data %>% function() -> data |> function()

Grouping (dplyr 1.1+)

group_by(data, x) |> summarise(mean(y)) |> ungroup() -> summarise(data, mean(y), .by = x)

Column selection

across(starts_with("x")) -> pick(starts_with("x")) # for selection only

Joins

by = c("a" = "b") -> by = join_by(a == b)

Multi-row summaries

summarise(data, x, .groups = "drop") -> reframe(data, x)

Data reshaping

gather()/spread() -> pivot_longer()/pivot_wider()

String separation (tidyr 1.3+)

separate(col, into = c("a", "b")) -> separate_wider_delim(col, delim = "_", names = c("a", "b")) extract(col, into = "x", regex) -> separate_wider_regex(col, patterns = c(x = regex))

Superseded purrr Functions (purrr 1.0+)

For side effects

walk(x, write_file) # instead of for loops walk2(data, paths, write_csv) # multiple arguments