Python & Rust Performance Hints (Jeff Dean & Sanjay Ghemawat style)

This skill packages key ideas from Abseil's Performance Hints document, adapted for Python and Rust development.

Use it to:

review Python/Rust code for performance risks
propose high-impact optimizations with explicit tradeoffs
design APIs/data structures that keep future optimizations possible
write an experiment plan (profile + microbenchmark) to validate changes

Scope and guardrails

Scope: single-process / single-binary performance (CPU, memory, allocations, cache behavior).
Do not: change externally observable behavior unless the user explicitly agrees.
Do not: introduce undefined behavior, data races, or brittle "clever" micro-opts without evidence.
Default philosophy: choose the faster alternative when it doesn't materially harm readability or complexity; otherwise, measure first.

When to apply

Use this skill when the task involves any of:

reducing latency or improving throughput
cutting memory footprint or allocation rate
improving cache locality / reducing cache misses
designing performant APIs (bulk ops, view types, threading model)
reviewing performance-sensitive Python or Rust changes
interpreting a flat profile and finding next steps

What to ask for (minimum inputs)

If you don't have enough information, ask for the smallest set that changes your recommendation quality:

Goal: latency vs throughput vs memory (and the SLO if any)
Where: hot path vs init vs test-only (and typical input sizes)
Evidence: profile / flame graph / perf counters / allocation profile (if available)
Constraints: correctness constraints, API constraints, thread-safety requirements

If none exists yet, proceed with static analysis + "what to measure first".

Workflow: how an agent should use these hints

Step 1 — classify the code

Test code: mostly care about asymptotic complexity and test runtime.
Application code: separate init/cold vs hot path.
Library code: prefer "safe, low-complexity" performance techniques because you can't predict callers.

Step 2 — do a back-of-the-envelope estimate

Before implementing changes, estimate what might dominate:

Count expensive operations (seeks, round-trips, allocations, bytes touched, comparisons, etc.)
Multiply by rough cost.
If latency matters and there is concurrency, consider overlap.

Reference latency table (rough order-of-magnitude)

Operation Approx time

L1 cache reference 0.5 ns

L2 cache reference 3 ns

Branch mispredict 5 ns

Mutex lock/unlock (uncontended) 15 ns

Main memory reference 50 ns

Python function call overhead 50-100 ns

Rust function call (non-inlined) 1-10 ns

PyO3 GIL acquire/release 100-500 ns

Compress 1K bytes with Snappy 1,000 ns

Read 4KB from SSD 20,000 ns

Round trip within same datacenter 50,000 ns

Read 1MB sequentially from memory 64,000 ns

Read 1MB over 100 Gbps network 100,000 ns

Read 1MB from SSD 1,000,000 ns (1 ms)

Disk seek 5,000,000 ns (5 ms)

Read 1MB sequentially from disk 10,000,000 ns (10 ms)

Send packet CA→Netherlands→CA 150,000,000 ns (150 ms)

Python-specific costs

Operation Approx time

dict lookup 20-50 ns

list.append 20-40 ns

getattr on object 50-100 ns

isinstance check 30-60 ns

JSON parse (stdlib) 1KB 50-100 us

msgspec parse 1KB 5-15 us

Django ORM query (simple) 1-10 ms

Django ORM async query 0.5-5 ms

Rust-specific costs

Operation Approx time

HashMap lookup 10-30 ns

Vec push (no realloc) 2-10 ns

String allocation (small) 20-50 ns

Arc clone 10-20 ns

tokio task spawn 200-500 ns

async channel send 50-200 ns

Estimation examples (templates)

Web request through PyO3 bridge:

Rust HTTP parsing: ~1us
GIL acquisition: ~200ns
Python handler execution: ~50us (simple) to ~5ms (with ORM)
Response serialization (msgspec): ~10us
GIL release + Rust response: ~200ns
Total: ~50us to ~5ms depending on handler complexity

Batch processing 10K items:

Per-item Python function call: 10K × 100ns = 1ms (overhead alone)
With msgspec struct validation: 10K × 500ns = 5ms
With dict allocation per item: 10K × 50ns = 0.5ms (plus GC pressure)

Step 3 — measure before paying complexity

When you can, measure to validate impact:

Python profiling:

py-spy for sampling profiler (low overhead, production-safe)
cProfile for deterministic profiling (high overhead)
memray or memory_profiler for allocation profiling
scalene for CPU + memory + GPU profiling

Rust profiling:

perf for Linux sampling profiler
flamegraph crate for generating flame graphs
criterion for microbenchmarks
dhat for allocation profiling

Watch for GIL contention in Python: contention can lower CPU usage and hide the "real" bottleneck.

Watch for async runtime overhead in Rust: too many small tasks can hurt more than help.

Step 4 — pick the biggest lever first

Prioritize in this order unless evidence suggests otherwise:

Algorithmic complexity wins (O(N²) → O(N log N) or O(N))
Data structure choice / memory layout (cache locality; fewer cache lines)
Allocation reduction (fewer allocs, better reuse)
Avoid unnecessary work (fast paths, precompute, defer)
Language boundary optimization (minimize Python↔Rust crossings)
Compiler/interpreter friendliness (simplify hot loops, reduce abstraction overhead)

Step 5 — produce an LLM-friendly output

When you respond to the user, use this structure:

Hot path hypothesis (what you think dominates, and why)
Top issues (ranked): issue → evidence/estimate → proposed fix → expected impact
Patch sketch (minimal code changes or pseudocode)
Tradeoffs & risks (correctness, memory, API, complexity)
Measurement plan (what to profile/benchmark and success criteria)

Techniques and patterns

API design for performance

Use bulk APIs to amortize overhead

When: callers do N similar operations (lookups, deletes, updates, decoding, locking).

Why: reduce boundary crossings and repeated fixed costs (locks, dispatch, decoding, syscalls).

Python patterns:

BAD: N database round trips

for user_id in user_ids: user = await User.objects.aget(id=user_id)

GOOD: 1 database round trip

users = await sync_to_async(list)(User.objects.filter(id__in=user_ids)) users_by_id = {u.id: u for u in users}

Rust patterns:

// BAD: N individual operations for id in ids { let result = cache.get(&id)?; // ... }

// GOOD: Batch lookup let results = cache.get_many(&ids)?;

Prefer view types for function arguments

Python:

Use Sequence[T] or Iterable[T] instead of list[T] when you don't mutate
Accept bytes or memoryview instead of copying to bytearray
Use msgspec.Struct for zero-copy deserialization

Rust:

Use &[T] or impl AsRef<[T]> instead of Vec<T> when you don't need ownership
Use &str instead of String for read-only string access
Use Cow<'_, T> when you might need to own or borrow

Thread-compatible vs thread-safe types

Python:

Default to thread-compatible (external GIL or explicit locks)
Use threading.local() for per-thread state
Prefer asyncio over threads for I/O-bound work

Rust:

Default to Send + Sync for shared state
Use Arc<RwLock<T>> only when needed; prefer message passing
Consider dashmap or sharded maps for high-contention scenarios

Algorithmic improvements

The rare-but-massive wins.

Reduce complexity class

Common transformations:

O(N²) → O(N log N) or O(N)
O(N log N) sorted-list intersection → O(N) using a hash set
O(log N) tree lookup → O(1) using hash lookup

Python-specific patterns

BAD: O(N²) - checking membership in list

for item in items: if item in seen_list: # O(N) lookup continue seen_list.append(item)

GOOD: O(N) - using set

seen = set() for item in items: if item in seen: # O(1) lookup continue seen.add(item)

BETTER: O(N) - using dict.fromkeys for deduplication

unique_items = list(dict.fromkeys(items))

Rust-specific patterns

// BAD: O(N²) - nested iteration for a in &items_a { for b in &items_b { if a.key == b.key { // ... } } }

// GOOD: O(N) - hash lookup let b_map: HashMap<_, _> = items_b.iter().map(|b| (&b.key, b)).collect(); for a in &items_a { if let Some(b) = b_map.get(&a.key) { // ... } }

Better memory representation and cache locality

Python: Prefer slots and dataclasses

BAD: Regular class with dict

class Item: def init(self, x, y, z): self.x = x self.y = y self.z = z

GOOD: Slots-based class (smaller memory, faster attribute access)

class Item: slots = ('x', 'y', 'z') def init(self, x, y, z): self.x = x self.y = y self.z = z

BETTER: msgspec.Struct (even more compact, fast serialization)

class Item(msgspec.Struct): x: int y: int z: int

Python: Prefer numpy/array for numeric data

BAD: List of floats (each is a Python object)

values = [1.0, 2.0, 3.0, ...] # ~28 bytes per float

GOOD: numpy array (contiguous, cache-friendly)

values = np.array([1.0, 2.0, 3.0, ...]) # 8 bytes per float

GOOD: array module for simpler cases

from array import array values = array('d', [1.0, 2.0, 3.0, ...])

Rust: Memory layout and padding

// BAD: 24 bytes due to padding struct Item { flag: bool, // 1 byte + 7 padding value: i64, // 8 bytes count: i32, // 4 bytes + 4 padding }

// GOOD: 16 bytes with reordering struct Item { value: i64, // 8 bytes count: i32, // 4 bytes flag: bool, // 1 byte + 3 padding }

// Use #[repr(C)] or #[repr(packed)] when ABI matters

Rust: Indices instead of pointers

// Pointer-heavy: 8 bytes per reference, poor cache locality struct Node { data: i32, left: Option<Box<Node>>, right: Option<Box<Node>>, }

// Index-based: 4 bytes per reference, better cache locality struct Tree { nodes: Vec<NodeData>, } struct NodeData { data: i32, left: Option<u32>, // index into nodes right: Option<u32>, }

Rust: SmallVec and tinyvec for small collections

use smallvec::SmallVec;

// Stack allocation for up to 8 elements, heap only if larger let mut items: SmallVec<[Item; 8]> = SmallVec::new();

Reduce allocations

Python: Avoid unnecessary allocations

BAD: Creates new list on every call

def process(items): return [transform(item) for item in items]

GOOD: Generator for streaming (no intermediate list)

def process(items): for item in items: yield transform(item)

GOOD: Pre-allocate when size is known

def process(items): result = [None] * len(items) for i, item in enumerate(items): result[i] = transform(item) return result

Python: Reuse objects

BAD: New dict on every iteration

for data in stream: result = {} # allocation result['key'] = process(data) yield result

GOOD: Reuse dict (if consumers don't hold reference)

result = {} for data in stream: result.clear() result['key'] = process(data) yield result # Caveat: only if consumer processes immediately

Python: Use slots to reduce memory

Without slots: ~152 bytes per instance

class Point: def init(self, x, y): self.x = x self.y = y

With slots: ~56 bytes per instance

class Point: slots = ('x', 'y') def init(self, x, y): self.x = x self.y = y

Rust: Pre-allocate with capacity

// BAD: Multiple reallocations let mut results = Vec::new(); for item in items { results.push(transform(item)); }

// GOOD: Single allocation let mut results = Vec::with_capacity(items.len()); for item in items { results.push(transform(item)); }

// BETTER: Use collect with size hint let results: Vec<_> = items.iter().map(transform).collect();

Rust: Avoid cloning when borrowing suffices

// BAD: Unnecessary clone fn process(data: &Data) -> String { let s = data.name.clone(); // allocation format!("Hello, {}", s) }

// GOOD: Borrow fn process(data: &Data) -> String { format!("Hello, {}", &data.name) }

Avoid unnecessary work

Fast paths for common cases

Python:

BAD: Always takes slow path

def parse_int(s: str) -> int: return int(s) # Handles all edge cases

GOOD: Fast path for common ASCII digits

def parse_int(s: str) -> int: if len(s) <= 10 and s.isdigit(): # Fast check result = 0 for c in s: result = result * 10 + (ord(c) - 48) return result return int(s) # Fallback for edge cases

Rust:

// Fast path for single-byte varint (common case) fn parse_varint(data: &[u8]) -> (u64, usize) { if data[0] < 128 { return (data[0] as u64, 1); // 90% of cases } parse_varint_slow(data) // Rare multi-byte }

Defer expensive computations

Python:

BAD: Always computes expensive value

def process(data, config): expensive = compute_expensive(data) # Always runs if config.needs_expensive: use(expensive)

GOOD: Defer until needed

def process(data, config): if config.needs_expensive: expensive = compute_expensive(data) # Only when needed use(expensive)

Move loop-invariant code outside loops

BAD: Repeated attribute lookup

for item in items: result = self.config.transform_fn(item) # 2 lookups per iteration

GOOD: Hoist invariant lookups

transform = self.config.transform_fn for item in items: result = transform(item)

Cache computed results

from functools import lru_cache

Cache expensive computations

@lru_cache(maxsize=1024) def expensive_computation(key: str) -> Result: # ...

Python↔Rust boundary optimization (PyO3)

Minimize GIL crossings

// BAD: Acquire/release GIL for each item fn process_items(py: Python, items: Vec<PyObject>) -> PyResult<Vec<PyObject>> { items.iter().map(|item| { // Each call acquires GIL internally process_one(py, item) }).collect() }

// GOOD: Hold GIL for batch, release for Rust-only work fn process_items(py: Python, items: Vec<PyObject>) -> PyResult<Vec<PyObject>> { // Extract data while holding GIL let data: Vec<> = items.iter().map(|i| extract_data(py, i)).collect::<PyResult<>>()?;

// Release GIL for CPU-intensive work
let results = py.allow_threads(|| {
    data.par_iter().map(|d| rust_process(d)).collect()
});

// Re-acquire GIL to build Python objects
results.iter().map(|r| to_python(py, r)).collect()

}

Batch data across the boundary

BAD: N PyO3 calls

for item in items: rust_process(item)

GOOD: Single PyO3 call with batch

rust_process_batch(items)

Use zero-copy types

// Accept bytes without copying #[pyfunction] fn process_bytes(data: &[u8]) -> PyResult<Vec<u8>> { // data is a view into Python bytes, no copy Ok(transform(data)) }

// Use Py<PyBytes> for owned bytes without copying #[pyfunction] fn process_bytes_owned(py: Python, data: Py<PyBytes>) -> PyResult<Py<PyBytes>> { let bytes = data.as_bytes(py); // ... }

Async optimization

Python async patterns

BAD: Sequential await

async def fetch_all(urls): results = [] for url in urls: results.append(await fetch(url)) # Sequential return results

GOOD: Concurrent await

async def fetch_all(urls): return await asyncio.gather(*[fetch(url) for url in urls])

BETTER: Bounded concurrency

async def fetch_all(urls, max_concurrent=10): semaphore = asyncio.Semaphore(max_concurrent) async def fetch_with_limit(url): async with semaphore: return await fetch(url) return await asyncio.gather(*[fetch_with_limit(url) for url in urls])

Rust async patterns

// BAD: Sequential await async fn fetch_all(urls: Vec<Url>) -> Vec<Response> { let mut results = Vec::new(); for url in urls { results.push(fetch(&url).await); // Sequential } results }

// GOOD: Concurrent with join_all async fn fetch_all(urls: Vec<Url>) -> Vec<Response> { futures::future::join_all(urls.iter().map(fetch)).await }

// BETTER: Bounded concurrency with buffer_unordered use futures::stream::{self, StreamExt};

async fn fetch_all(urls: Vec<Url>) -> Vec<Response> { stream::iter(urls) .map(|url| fetch(url)) .buffer_unordered(10) // Max 10 concurrent .collect() .await }

Reduce logging and stats costs

Python logging in hot paths

BAD: Logging overhead even when disabled

for item in items: logger.debug(f"Processing {item}") # String formatting always happens

GOOD: Check level first

if logger.isEnabledFor(logging.DEBUG): for item in items: logger.debug(f"Processing {item}")

BETTER: Remove logging from innermost loops entirely

Rust logging in hot paths

// BAD: log! macro overhead in hot loop for item in items { log::debug!("Processing {:?}", item); // Format even if disabled }

// GOOD: Check level outside loop if log::log_enabled!(log::Level::Debug) { for item in items { log::debug!("Processing {:?}", item); } }

// BETTER: Use tracing with static filtering #[tracing::instrument(skip_all)] fn process_batch(items: &[Item]) { // Span created once, not per item for item in items { // Hot loop without logging } }

Flat-profile playbook

If no single hotspot dominates:

Don't discount many small wins (twenty 1% improvements can matter).
Look for loops closer to the top of call stacks (flame graphs help).
Consider structural refactors (one-shot construction instead of incremental mutation).
Replace overly general abstractions with specialized code.
Reduce allocations (allocation profiles help).
Use hardware counters (cache misses, branch misses) to find invisible costs.
Python-specific: Look for GIL contention, excessive object creation, slow imports.
Rust-specific: Look for excessive async overhead, unnecessary clones, poor cache locality.

Quick review checklist

When reviewing a performance-sensitive change, scan for:

Python

Any O(N²) behavior on realistic N?
List comprehension where generator would suffice?
Missing slots on frequently instantiated classes?
dict/list allocation inside hot loops?
Sequential awaits where concurrent would work?
Repeated attribute lookups in loops?
Logging/debug in innermost loops?
Using stdlib json instead of msgspec/orjson?
Synchronous I/O in async context?
Missing @lru_cache for repeated expensive computations?

Rust

Any O(N²) behavior on realistic N?
Missing with_capacity() for known-size collections?
Unnecessary .clone() where borrow would work?
String allocation in hot loops (use &str or Cow )?
Box/Arc where stack allocation would work?
Excessive async task spawning for small operations?
Poor struct field ordering (padding waste)?
Missing #[inline] on critical small functions?
Lock contention in concurrent code?
Logging format overhead in hot paths?

PyO3 boundary

Multiple GIL acquisitions where one would suffice?
Python object creation in hot Rust loops?
Missing allow_threads() for CPU-bound Rust work?
Copying data across boundary when view would work?
Single-item APIs where batch API would reduce crossings?

Examples (inputs → outputs)

Example A — Python list vs generator

Input:

def process_large_file(path): lines = open(path).readlines() # Load entire file return [transform(line) for line in lines] # Another full copy

Agent output:

Issue: Two full copies of file contents; O(N) memory for O(N) items.
Fix: Use generator for streaming:

def process_large_file(path): with open(path) as f: for line in f: yield transform(line)

Expected impact: O(1) memory instead of O(N).
Tradeoff: Can only iterate once; caller must handle streaming.

Example B — Rust unnecessary clones

Input:

fn build_response(data: &ResponseData) -> String { let name = data.name.clone(); let id = data.id.clone(); format!("User {} ({})", name, id) }

Agent output:

Issue: Unnecessary clones; format! can borrow directly.
Fix:

fn build_response(data: &ResponseData) -> String { format!("User {} ({})", &data.name, &data.id) }

Expected impact: 2 fewer allocations per call.

Example C — PyO3 batch optimization

Input:

Python side

for item in items: result = rust_validate(item) # 10K calls

Agent output:

Issue: 10K GIL acquisitions/releases (~2-5ms overhead alone).
Fix: Add batch API:

#[pyfunction] fn rust_validate_batch(items: Vec<Item>) -> Vec<bool> { items.iter().map(|i| validate(i)).collect() }

results = rust_validate_batch(items) # 1 call

Expected impact: ~10-50x faster for large batches.
Tradeoff: Must collect all items first; can't stream results.

performance-optimization

Safety Notice

Copy this and send it to your AI assistant to learn

BAD: N database round trips

GOOD: 1 database round trip

BAD: O(N²) - checking membership in list

GOOD: O(N) - using set

BETTER: O(N) - using dict.fromkeys for deduplication

BAD: Regular class with dict

GOOD: Slots-based class (smaller memory, faster attribute access)

BETTER: msgspec.Struct (even more compact, fast serialization)

BAD: List of floats (each is a Python object)

GOOD: numpy array (contiguous, cache-friendly)

GOOD: array module for simpler cases

BAD: Creates new list on every call

GOOD: Generator for streaming (no intermediate list)

GOOD: Pre-allocate when size is known

BAD: New dict on every iteration

GOOD: Reuse dict (if consumers don't hold reference)

Without slots: ~152 bytes per instance

With slots: ~56 bytes per instance

BAD: Always takes slow path

GOOD: Fast path for common ASCII digits

BAD: Always computes expensive value

GOOD: Defer until needed

BAD: Repeated attribute lookup

GOOD: Hoist invariant lookups

Cache expensive computations

BAD: N PyO3 calls

GOOD: Single PyO3 call with batch

BAD: Sequential await

GOOD: Concurrent await

BETTER: Bounded concurrency

BAD: Logging overhead even when disabled

GOOD: Check level first

BETTER: Remove logging from innermost loops entirely

Python side

Source Transparency

Related Skills

performance-optimization

performance-optimization

performance-optimization