Performance Optimization Skill
Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.
Core Philosophy
-
Measure First: Never optimize without profiling data
-
Estimate Costs: Back-of-envelope calculations before implementation
-
Avoid Work: The fastest code is code that doesn't run
-
Reduce Allocations: Memory allocation is often the hidden bottleneck
-
Cache Locality: Memory access patterns dominate modern performance
Reference Latency Numbers
Use these for back-of-envelope calculations:
Operation Latency
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1KB (Snappy) 3 us
SSD random read (4KB) 20 us
Round trip in datacenter 50 us
Disk seek 5 ms
Optimization Techniques (Priority Order)
- Algorithmic Improvements (Highest Impact)
Always check algorithm complexity first:
O(N^2) → O(N log N) = 1000x faster for N=1M O(N) → O(1) = unbounded improvement
Common patterns:
-
Replace nested loops with hash lookups
-
Use sorted-list intersection → hash table lookups
-
Process in reverse post-order to eliminate per-element checks
-
Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed
- Reduce Memory Allocations
Allocation is expensive (~25-100ns + GC pressure)
BAD: Allocates on every call
def process(items): result = [] # New allocation for item in items: result.append(transform(item)) return result
GOOD: Pre-allocate or reuse
def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out
Techniques:
-
Pre-size containers with reserve() or known capacity
-
Hoist temporary containers outside loops
-
Reuse buffers across iterations (clear instead of recreate)
-
Move instead of copy large structures
-
Use stack allocation for bounded-lifetime objects
- Compact Data Structures
Minimize memory footprint and cache lines touched:
// BAD: 24 bytes due to padding struct Item { flag: bool, // 1 byte + 7 padding value: i64, // 8 bytes count: i32, // 4 bytes + 4 padding }
// GOOD: 16 bytes with reordering struct Item { value: i64, // 8 bytes count: i32, // 4 bytes flag: bool, // 1 byte + 3 padding }
Techniques:
-
Reorder struct fields by size (largest first)
-
Use smaller numeric types when range permits (u8 vs i32)
-
Replace 64-bit pointers with 32-bit indices
-
Keep hot fields together, move cold fields to separate struct
-
Use bit vectors for small-domain sets
-
Flatten nested maps: Map<A, Map<B, C>> → Map<(A,B), C>
- Fast Paths for Common Cases
Optimize the common case without hurting rare cases:
BAD: Always takes slow path
def parse_varint(data): return generic_varint_parser(data)
GOOD: Fast path for common 1-byte case
def parse_varint(data): if data[0] < 128: # Single byte - 90% of cases return data[0], 1 return generic_varint_parser(data) # Rare multi-byte
Techniques:
-
Handle common dimensions inline (1-D, 2-D tensors)
-
Check for empty/trivial inputs early
-
Specialize for common sizes (small strings, few elements)
-
Process trailing elements separately to avoid slow generic code
- Precompute Expensive Information
Trade memory for compute when beneficial:
BAD: Recomputes on every access
def is_vowel(char): return char.lower() in 'aeiou'
GOOD: Lookup table
VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]
Techniques:
-
Precompute flags/properties at construction time
-
Build lookup tables for character classification
-
Cache expensive computed properties
-
Validate at boundaries once, not repeatedly inside
- Bulk/Batch APIs
Amortize fixed costs across multiple operations:
BAD: N round trips
for item in items: result = db.lookup(item)
GOOD: 1 round trip
results = db.lookup_many(items)
Design APIs that support:
-
Batch lookups instead of individual fetches
-
Vectorized operations over loops
-
Streaming interfaces for large datasets
- Avoid Unnecessary Work
BAD: Always computes expensive value
def process(data, config): expensive = compute_expensive(data) # Always runs if config.needs_expensive: use(expensive)
GOOD: Defer until needed
def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # Only when needed use(expensive)
Techniques:
-
Lazy evaluation for expensive operations
-
Short-circuit evaluation in conditions
-
Move loop-invariant code outside loops
-
Specialize instead of using general-purpose libraries in hot paths
- Help the Compiler/Runtime
Lower-level optimizations when profiling shows need:
// Avoid function call overhead in hot loops #[inline(always)] fn hot_function(x: i32) -> i32 { x * 2 }
// Copy to local variable for better alias analysis fn process(data: &mut [i32], factor: &i32) { let f = *factor; // Compiler knows this won't change for x in data { *x *= f; } }
Techniques:
-
Use raw pointers/indices instead of iterators in critical loops
-
Hand-unroll very hot loops (4+ iterations)
-
Move slow-path code to separate non-inlined functions
-
Avoid abstractions that hide costs in hot paths
- Reduce Synchronization
Minimize lock contention and atomic operations:
-
Default to thread-compatible (external sync) not thread-safe
-
Sample statistics (1-in-32) instead of tracking everything
-
Batch updates to shared state
-
Use thread-local storage for per-thread data
Profiling Workflow
-
Identify hotspots: Use profiler (pprof, perf, py-spy)
-
Measure baseline: Write benchmark before optimizing
-
Estimate improvement: Calculate expected gain
-
Implement change: Focus on one optimization at a time
-
Verify improvement: Run benchmark, confirm gain
-
Check for regressions: Ensure no other code paths slowed down
When Profile is Flat (No Clear Hotspots)
-
Pursue multiple small 1-2% improvements collectively
-
Look for restructuring opportunities higher in call stack
-
Profile allocations specifically (often hidden cost)
-
Gather hardware performance counters (cache misses, branch mispredicts)
-
Consider if the code is already well-optimized
Anti-Patterns to Avoid
-
Premature optimization in non-critical code
-
Optimizing without measurement - changes based on intuition
-
Micro-optimizing while ignoring algorithmic complexity
-
Copying when moving would suffice
-
Growing containers one element at a time (quadratic)
-
Allocating in loops when reuse is possible
-
String formatting in hot paths (use pre-built templates)
-
Regex when simple string matching suffices
Estimation Template
Before optimizing, estimate:
Operation cost: ___ ns/us/ms Frequency: ___ times per second/request Total time: cost × frequency = ___ Improvement target: ___% reduction Expected new time: ___ Is this worth it? [ ] Yes [ ] No