Optimizing R
This skill covers profiling, benchmarking, parallelization, and performance best practices for R.
Core Principle
Profile before optimizing - Use profvis and bench to identify real bottlenecks. Write readable code first, optimize only when necessary.
Profiling Tools Decision Matrix
| Tool | Use When | Don't Use When | What It Shows |
|---|---|---|---|
profvis | Complex code, unknown bottlenecks | Simple functions, known issues | Time per line, call stack |
bench::mark() | Comparing alternatives | Single approach | Relative performance, memory |
system.time() | Quick checks | Detailed analysis | Total runtime only |
Rprof() | Base R only environments | When profvis available | Raw profiling data |
Performance Workflow
- Profile first - Find the actual bottlenecks
- Focus on the slowest parts - 80/20 rule
- Benchmark alternatives - For hot spots only
- Consider tool trade-offs - Based on bottleneck type
See profiling-workflow.md for the complete workflow.
When Each Tool Helps vs Hurts
Parallel Processing (in_parallel())
Helps when:
- CPU-intensive computations
- Embarrassingly parallel problems
- Large datasets with independent operations
- I/O bound operations (file reading, API calls)
Hurts when:
- Simple, fast operations (overhead > benefit)
- Memory-intensive operations (may cause thrashing)
- Operations requiring shared state
- Small datasets
See parallel-examples.md for decision points.
Data Backend Selection
| Backend | Use When |
|---|---|
| data.table | Very large datasets (>1GB), complex grouping, maximum performance critical |
| dplyr | Readability priority, complex joins/window functions, moderate data (<100MB) |
| base R | No dependencies allowed, simple operations, teaching/learning |
See backend-selection.md for guidance.
Profiling Best Practices
- Profile realistic data sizes - Not toy examples
- Profile multiple runs - For stability
- Check memory usage too - Not just time
- Profile realistic usage patterns - Not isolated calls
See profiling-best-practices.md for examples.
Performance Anti-Patterns to Avoid
- Don't optimize without measuring - Profile first
- Don't over-engineer - Complex optimizations for 1% gains
- Don't assume - "for loops are always slow" is a myth
- Don't ignore readability costs - Readable code with targeted optimizations
See performance-anti-patterns.md for examples.
Modern purrr Patterns
Data Frame Binding (purrr 1.0+)
| Superseded | Modern Replacement |
|---|---|
map_dfr(x, f) | map(x, f) |> list_rbind() |
map_dfc(x, f) | map(x, f) |> list_cbind() |
map2_dfr(x, y, f) | map2(x, y, f) |> list_rbind() |
Side Effects with walk()
Use walk() and walk2() for side effects (file writing, plotting).
Parallel Processing (purrr 1.1.0+)
Use in_parallel() with mirai for scaling across cores.
See purrr-patterns.md for all patterns.
Backend Tools for Performance
When speed is critical, consider:
- vctrs - Type-stable vector operations
- rlang - Metaprogramming
- data.table - Large data operations
Profile to identify whether these tools will help your specific bottleneck.
source: Sarah Johnson's gist https://gist.github.com/sj-io/3828d64d0969f2a0f05297e59e6c15ad