rust-performance

Performance Optimization

Profiling First

cargo install flamegraph && cargo flamegraph --bin myapp # CPU cargo bench # Benchmarks (criterion) valgrind --tool=cachegrind ./target/release/myapp # Cache analysis

Rule Guideline

Use mimalloc as global allocator in apps M-MIMALLOC-APPS

Profile hot paths early with criterion/divan M-HOTPATH

Optimize items/CPU-cycle, avoid empty cycles M-THROUGHPUT

yield_now().await every 10-100us in CPU loops M-YIELD-POINTS

Allocation Avoidance

// Cow: avoid allocation when not needed use std::borrow::Cow; fn to_uppercase(s: &str) -> Cow<'_, str> { if s.chars().all(|c| c.is_uppercase()) { Cow::Borrowed(s) } else { Cow::Owned(s.to_uppercase()) } }

// Reuse buffers across iterations let mut buffer = Vec::new(); for item in items { buffer.clear(); process(&mut buffer, item); }

// Pre-allocate capacity let mut v = Vec::with_capacity(10000);

Collection Choice

Need Collection Notes

Sequential access Vec<T>

Best cache locality

Random access by key HashMap<K, V>

O(1) lookup

Ordered keys BTreeMap<K, V>

O(log n) lookup

Small sets (<20) Vec<T>

linear search Lower overhead

FIFO queue VecDeque<T>

O(1) push/pop both ends

Frequent membership check HashSet<T>

O(1) vs O(n) for Vec

Fixed-size [T; N] array No heap, known size

Iterator Best Practices

// BAD: bounds checking on each access for i in 0..vec.len() { process(vec[i]); }

// GOOD: no bounds checking, idiomatic for item in &vec { process(item); }

// BAD: unnecessary intermediate allocation let filtered: Vec<_> = items.iter().filter(|x| x.valid).collect(); for item in filtered { process(item); }

// GOOD: chain iterators — no allocation for item in items.iter().filter(|x| x.valid) { process(item); }

// Counting? Don't collect. let count = items.iter().filter(|x| x.valid).count();

String Optimization

// BAD: O(n^2) allocations in loop let mut result = String::new(); for s in strings { result = result + &s; }

// GOOD: pre-calculate capacity let total_len: usize = strings.iter().map(|s| s.len()).sum(); let mut result = String::with_capacity(total_len); for s in strings { result.push_str(&s); }

// BEST: use join for simple concatenation let result = strings.join("");

// Prefer bytes over chars when ASCII // s.bytes() is faster than s.chars()

Parallelism with Rayon

use rayon::prelude::*;

// Sequential let sum: i32 = (0..1_000_000).map(|x| x * x).sum();

// Parallel (automatic work stealing) let sum: i32 = (0..1_000_000).into_par_iter().map(|x| x * x).sum();

// Parallel with custom chunk size let results: Vec<_> = data .par_chunks(1000) .map(|chunk| process_chunk(chunk)) .collect();

Memory Layout

// BAD: 24 bytes due to padding struct Bad { a: u8, b: u64, c: u8 } // 1+7pad + 8 + 1+7pad

// GOOD: 16 bytes — order fields largest-first struct Good { b: u64, a: u8, c: u8 } // 8 + 1 + 1 + 6pad

// Box large enum variants enum Message { Quit, Data(Box<[u8; 10000]>), // pointer-sized, not 10KB }

// Compile-time size assertion const _: () = assert!(std::mem::size_of::<MyStruct>() <= 64);

Release Build Settings

[profile.release] lto = true # Link-time optimization codegen-units = 1 # Slower compile, faster code panic = "abort" # Smaller binary, no unwinding overhead strip = true # Strip debug symbols

Criterion Benchmarks

use criterion::{criterion_group, criterion_main, Criterion};

fn benchmark_parse(c: &mut Criterion) { let input = "test data".repeat(1000); c.bench_function("parse_v1", |b| b.iter(|| parse_v1(&input))); c.bench_function("parse_v2", |b| b.iter(|| parse_v2(&input))); }

criterion_group!(benches, benchmark_parse); criterion_main!(benches);

rust-performance

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

elementor-hooks

elementor-themes

elementor-controls