Tidymodels Overview
The tidymodels ecosystem provides a consistent, modular framework for machine learning in R. Understanding the ecosystem context helps when working with any tidymodels pipeline before diving into package-specific details.
Core Principle: Recipes Are Plans, Not Actions
Critical: A recipe object is a specification of preprocessing steps. Adding steps like step_normalize() does not transform data immediately. Transformations execute only when:
prep()estimates parameters from training databake()applies the prepped recipe to new data
# This does NOT transform data - it creates a plan
rec <- recipe(outcome ~ ., data = train) |>
step_normalize(all_numeric_predictors())
# This estimates parameters (means, sds) from training data
prepped <- prep(rec, training = train)
# This applies transformations to new data
processed <- bake(prepped, new_data = test)
The Tidymodels Workflow
Follow this standard workflow for modeling projects:
1. Data Splitting (rsample)
Allocate data to training, validation, and test sets before any modeling:
set.seed(123)
data_split <- initial_split(data, prop = 0.8, strata = outcome)
train_data <- training(data_split)
test_data <- testing(data_split)
# For iterative evaluation during development
resamples <- vfold_cv(train_data, v = 10)
2. Preprocessing (recipes)
Define feature engineering as a recipe specification:
rec_spec <- recipe(outcome ~ ., data = train_data) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_factor_predictors()) |>
step_zv(all_predictors())
Use tidyselect helpers for column selection:
all_predictors(),all_outcomes()- by roleall_numeric_predictors(),all_nominal_predictors()- by type and rolehas_role(),has_type()- explicit queries
3. Model Specification (parsnip)
Define the model type, engine, and mode separately from fitting:
model_spec <- rand_forest(mtry = tune(), trees = 1000) |>
set_engine("ranger") |>
set_mode("regression")
4. Bundling (workflows)
Combine preprocessing and model into a single object:
wflow <- workflow() |>
add_recipe(rec_spec) |>
add_model(model_spec)
5. Evaluation (tune + yardstick)
Use resampling or validation sets to assess performance:
# Define metrics
metrics <- metric_set(rmse, rsq, mae)
# Tune hyperparameters
tuned <- tune_grid(
wflow,
resamples = resamples,
grid = 10,
metrics = metrics
)
# Select best parameters
best_params <- select_best(tuned, metric = "rmse")
6. Finalization
Finalize the workflow and fit to full training data:
final_wflow <- finalize_workflow(wflow, best_params)
final_fit <- last_fit(final_wflow, split = data_split)
# Extract test set metrics
collect_metrics(final_fit)
Package Roles
| Package | Purpose | Key Functions |
|---|---|---|
| rsample | Data splitting and resampling | initial_split(), vfold_cv(), bootstraps() |
| recipes | Preprocessing specification | recipe(), step_*(), prep(), bake() |
| parsnip | Model specification | Model functions, set_engine(), set_mode() |
| workflows | Bundle recipe + model | workflow(), add_recipe(), add_model() |
| tune | Hyperparameter optimization | tune_grid(), tune_bayes(), select_best() |
| yardstick | Performance metrics | metric_set(), rmse(), accuracy() |
| workflowsets | Compare multiple pipelines | workflow_set(), workflow_map() |
| stacks | Model ensembling | stacks(), add_candidates(), blend_predictions() |
| hardhat | Internal infrastructure | mold(), forge(), blueprints |
Key Principles
Use Package Functions, Not Direct Access
Never directly modify tidymodels object internals. Always use provided functions:
# WRONG - directly modifying internals
recipe_obj$steps[[1]]$means <- new_means
# CORRECT - use proper functions
rec <- recipe(...) |>
step_normalize(...) |>
prep()
Use Selectors, Not String Matching
Avoid constructing variable lists manually:
# WRONG - manual string matching
numeric_cols <- names(data)[sapply(data, is.numeric)]
rec |> step_normalize(all_of(numeric_cols))
# CORRECT - use tidyselect helpers
rec |> step_normalize(all_numeric_predictors())
Understand Role Requirements
Custom roles are required at bake() time by default. When using step_rm() with custom roles, update requirements:
rec <- recipe(...) |>
update_role(id_column, new_role = "id") |>
update_role_requirements("id", bake = FALSE) |>
step_rm(has_role("id"))
workflowsets Require Same Outcome
All workflows in a workflow_set must predict the same outcome variable. For different outcomes, create separate workflow sets.
When to Use Each Package
- Simple model: recipes + parsnip + workflows
- Hyperparameter tuning: Add tune
- Model comparison: Add workflowsets
- Ensemble models: Add stacks (requires
save_pred = TRUE,save_workflow = TRUE) - Custom preprocessing interfaces: Use hardhat
Additional Resources
Reference Files
For detailed information, consult:
references/packages.md- Detailed package documentation including object structures, creation processes, and deep knowledge linksreferences/common-problems.md- Common pitfalls when working with tidymodels and how to avoid them
External Documentation
- tidymodels.org - Official documentation and tutorials
- recipes.tidymodels.org - Recipe step reference
- parsnip.tidymodels.org - Model specifications