Performance Optimization Skill

Apply these principles when optimizing code for performance. Focus on the critical 3% where performance truly matters - a 12% improvement is never marginal in engineering.

Core Philosophy

Measure First: Never optimize without profiling data
Estimate Costs: Back-of-envelope calculations before implementation
Avoid Work: The fastest code is code that doesn't run
Reduce Allocations: Memory allocation is often the hidden bottleneck
Cache Locality: Memory access patterns dominate modern performance

Reference Latency Numbers

Use these for back-of-envelope calculations:

Operation Latency

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Compress 1KB (Snappy) 3 us

SSD random read (4KB) 20 us

Round trip in datacenter 50 us

Disk seek 5 ms

Optimization Techniques (Priority Order)

Algorithmic Improvements (Highest Impact)

Always check algorithm complexity first:

O(N^2) → O(N log N) = 1000x faster for N=1M O(N) → O(1) = unbounded improvement

Common patterns:

Replace nested loops with hash lookups
Use sorted-list intersection → hash table lookups
Process in reverse post-order to eliminate per-element checks
Replace interval trees (O(log N)) with hash maps (O(1)) when ranges aren't needed

Reduce Memory Allocations

Allocation is expensive (~25-100ns + GC pressure)

BAD: Allocates on every call

def process(items): result = [] # New allocation for item in items: result.append(transform(item)) return result

GOOD: Pre-allocate or reuse

def process(items, out=None): if out is None: out = [None] * len(items) for i, item in enumerate(items): out[i] = transform(item) return out

Techniques:

Pre-size containers with reserve() or known capacity
Hoist temporary containers outside loops
Reuse buffers across iterations (clear instead of recreate)
Move instead of copy large structures
Use stack allocation for bounded-lifetime objects

Compact Data Structures

Minimize memory footprint and cache lines touched:

// BAD: 24 bytes due to padding struct Item { flag: bool, // 1 byte + 7 padding value: i64, // 8 bytes count: i32, // 4 bytes + 4 padding }

// GOOD: 16 bytes with reordering struct Item { value: i64, // 8 bytes count: i32, // 4 bytes flag: bool, // 1 byte + 3 padding }

Techniques:

Reorder struct fields by size (largest first)
Use smaller numeric types when range permits (u8 vs i32)
Replace 64-bit pointers with 32-bit indices
Keep hot fields together, move cold fields to separate struct
Use bit vectors for small-domain sets
Flatten nested maps: Map<A, Map<B, C>> → Map<(A,B), C>

Fast Paths for Common Cases

Optimize the common case without hurting rare cases:

BAD: Always takes slow path

def parse_varint(data): return generic_varint_parser(data)

GOOD: Fast path for common 1-byte case

def parse_varint(data): if data[0] < 128: # Single byte - 90% of cases return data[0], 1 return generic_varint_parser(data) # Rare multi-byte

Techniques:

Handle common dimensions inline (1-D, 2-D tensors)
Check for empty/trivial inputs early
Specialize for common sizes (small strings, few elements)
Process trailing elements separately to avoid slow generic code

Precompute Expensive Information

Trade memory for compute when beneficial:

BAD: Recomputes on every access

def is_vowel(char): return char.lower() in 'aeiou'

GOOD: Lookup table

VOWEL_TABLE = [c.lower() in 'aeiou' for c in (chr(i) for i in range(256))] def is_vowel(char): return VOWEL_TABLE[ord(char)]

Techniques:

Precompute flags/properties at construction time
Build lookup tables for character classification
Cache expensive computed properties
Validate at boundaries once, not repeatedly inside

Bulk/Batch APIs

Amortize fixed costs across multiple operations:

BAD: N round trips

for item in items: result = db.lookup(item)

GOOD: 1 round trip

results = db.lookup_many(items)

Design APIs that support:

Batch lookups instead of individual fetches
Vectorized operations over loops
Streaming interfaces for large datasets

Avoid Unnecessary Work

BAD: Always computes expensive value

def process(data, config): expensive = compute_expensive(data) # Always runs if config.needs_expensive: use(expensive)

GOOD: Defer until needed

def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # Only when needed use(expensive)

Techniques:

Lazy evaluation for expensive operations
Short-circuit evaluation in conditions
Move loop-invariant code outside loops
Specialize instead of using general-purpose libraries in hot paths

Help the Compiler/Runtime

Lower-level optimizations when profiling shows need:

// Avoid function call overhead in hot loops #[inline(always)] fn hot_function(x: i32) -> i32 { x * 2 }

// Copy to local variable for better alias analysis fn process(data: &mut [i32], factor: &i32) { let f = *factor; // Compiler knows this won't change for x in data { *x *= f; } }

Techniques:

Use raw pointers/indices instead of iterators in critical loops
Hand-unroll very hot loops (4+ iterations)
Move slow-path code to separate non-inlined functions
Avoid abstractions that hide costs in hot paths

Reduce Synchronization

Minimize lock contention and atomic operations:

Default to thread-compatible (external sync) not thread-safe
Sample statistics (1-in-32) instead of tracking everything
Batch updates to shared state
Use thread-local storage for per-thread data

Profiling Workflow

Identify hotspots: Use profiler (pprof, perf, py-spy)
Measure baseline: Write benchmark before optimizing
Estimate improvement: Calculate expected gain
Implement change: Focus on one optimization at a time
Verify improvement: Run benchmark, confirm gain
Check for regressions: Ensure no other code paths slowed down

When Profile is Flat (No Clear Hotspots)

Pursue multiple small 1-2% improvements collectively
Look for restructuring opportunities higher in call stack
Profile allocations specifically (often hidden cost)
Gather hardware performance counters (cache misses, branch mispredicts)
Consider if the code is already well-optimized

Anti-Patterns to Avoid

Premature optimization in non-critical code
Optimizing without measurement - changes based on intuition
Micro-optimizing while ignoring algorithmic complexity
Copying when moving would suffice
Growing containers one element at a time (quadratic)
Allocating in loops when reuse is possible
String formatting in hot paths (use pre-built templates)
Regex when simple string matching suffices

Estimation Template

Before optimizing, estimate:

Operation cost: ___ ns/us/ms Frequency: ___ times per second/request Total time: cost × frequency = ___ Improvement target: ___% reduction Expected new time: ___ Is this worth it? [ ] Yes [ ] No

performance-optimization

Safety Notice

Copy this and send it to your AI assistant to learn

BAD: Allocates on every call

GOOD: Pre-allocate or reuse

BAD: Always takes slow path

GOOD: Fast path for common 1-byte case

BAD: Recomputes on every access

GOOD: Lookup table

BAD: N round trips

GOOD: 1 round trip

BAD: Always computes expensive value

GOOD: Defer until needed

Source Transparency

Related Skills

browser-automation

baoyu-xhs-images

firecrawl-scraper