latency-principles

Comprehensive framework for diagnosing, optimizing, and hiding latency in software systems. You MUST use this skill whenever the user mentions slow performance, response time, p99/tail latency, or asks about system throughput and concurrency. It covers Little's Law, Amdahl's Law, and strategies for Data/Compute optimization (e.g., zero-copy, wait-free sync, request hedging). Trigger this even for theoretical questions about latency laws or ballpark estimations using latency constants.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "latency-principles" with this command: npx skills add ab22593k/skills/ab22593k-skills-latency-principles

Latency Principles

Based on "Latency: Reduce delay in software systems" by Pekka Enberg (Manning, 2026).

This skill provides a systematic framework for minimizing delay in software systems, covering the entire stack from Physics and Hardware to Application Architecture and User Experience.

Foundational Concepts

  • What is Latency?: The time delay between a cause and its observed effect.
  • Latency vs. Bandwidth: You can always add more bandwidth (more links), but you are stuck with bad latency unless you optimize the path.
  • Impact on User Experience (UX): < 100ms (immediate), < 1s (instant), > 10s (slow).
  • Measuring Correctly: Use percentiles (p95, p99) to capture tail latency. Avoid averages. Avoid Coordinated Omission by using fixed-interval benchmarking.

Modeling Performance

Use these laws to establish theoretical bounds and size systems:

  • Little's Law: Concurrency (C) = Throughput (T) * Latency (L)
    • Usage: Calculate required concurrency to support a target throughput at a given latency.
    • Example: If p99 latency is 50ms and you need 1000 RPS, you need 1000 * 0.05 = 50 concurrent execution units.
  • Amdahl's Law: Speedup = 1 / ((1 - P) + (P / N))
    • P: Portion of program that is parallelizable.
    • N: Number of processors.
    • Usage: Understand the limits of parallelization. If 50% of your code is serial, the max speedup is 2x, regardless of how many cores you add.

Measurement & Visualization

Visualizing latency distributions is critical for identifying tail behavior:

  1. Histograms: Show frequency of samples. Good for seeing the "mode" (most common latency) and the spread.
  2. HDR Histograms: Plot latency against percentiles (p50, p99, etc.) on a log scale. Essential for p99.9+ analysis.
  3. eCDF (Empirical Cumulative Distribution Function): A smooth curve showing the probability that a request completes within a given time. Directly answers SLA compliance questions.

Tooling:

  • Use scripts/ping_collector.py to gather data without Coordinated Omission.
  • Use scripts/visualize_latency.py to generate Histogram, HDR, and eCDF plots.

Quick Decision Guide

SymptomProbable CauseRecommended Strategy
High Avg LatencySequential processing / Slow I/OConcurrency (Async I/O) or Partitioning
High Tail Latency (p99)Lock contention / GC / Neighbor noiseWait-free Sync (Atomics) or Request Hedging
Network SlownessDistance / Protocol overheadColocation (Edge) or Binary Serialization (Protobuf)
Database LoadHot keys / Complex queriesCaching (Read-through) or Materialized Views
Slow WritesACID guarantees / IndexingWrite-Behind Caching or Sharding
High CPU UsageO(n^2) logic / JSON parsingAlgorithmic Fixes or Protobuf/FlatBuffers
Micro-stuttersGC pauses / OS interruptsObject Pooling or Interrupt Affinity
Lock ContentionMutex bottleneckWait-free Sync (Atomics)
"It feels slow"UI blocking on networkOptimistic Updates or Prefetching
Measurement Looks Too GoodCoordinated OmissionFixed-Interval Benchmarking

Part 1: Fundamentals (Start Here)

Core theory and diagnostic approaches.


Part 2: Data Layer (Access Optimization)

Optimizing data access is often the highest-leverage activity for reducing latency.

  • Colocation (Pattern: Move Compute to Data):
    • Edge Computing: Use Near Edge (points of presence) or Far Edge (on-device/IoT) to eliminate geographical distance.
    • Intranode: Colocate protocol handlers with application threads. Turn off Nagle's Algorithm (TCP_NODELAY) to prevent packet batching delays.
    • Kernel-bypass: Use techniques like DPDK to eliminate OS stack overhead.
  • Replication (Pattern: Trade Consistency for Latency):
    • Leaderless/Multi-Leader: Allows local writes to reduce write latency, at the cost of complex conflict resolution.
    • Consistency Models: Choose Eventual Consistency or Read-your-writes to avoid synchronous coordination (Strong Consistency) overhead.
  • Partitioning (Pattern: Divide and Conquer):
    • Horizontal Sharding: Splitting data to increase parallel throughput.
    • Request Routing: Use Direct Routing (client-side) to avoid the extra hop of a proxy.
    • Mitigate Hot Partitions: Use over-partitioning to balance skewed workloads.
  • Caching (Pattern: Memory is Faster than Disk):
    • Write-Behind Caching: Asynchronous writes to the data store to hide write latency.
    • Policy Selection: Use SIEVE or LRU for eviction.
    • Materialized Views: Precompute complex queries to eliminate runtime processing work.

See references/data_patterns.md for detailed implementation strategies.

Part 3: Compute Layer (Logic Acceleration)

Optimizing processing logic and synchronization to eliminate overhead.

  • Eliminating Work (Pattern: The Fastest Code is Code that Doesn't Run):
    • Algorithmic Complexity: Replace O(n) scans with O(log n) trees or O(1) hash maps.
    • Zero-Copy Serialization: Use FlatBuffers instead of JSON to eliminate the parsing/unpacking step.
    • Memory Tuning: Avoid dynamic allocation (malloc/new) in hot paths. Use Object Pooling or Stack Allocation to prevent GC pauses or allocator lock contention.
    • Precomputation: Move work from runtime to build-time or startup.
  • Wait-Free Sync (Pattern: Avoid Context Switches):
    • Mutual Exclusion Problems: Locks (Mutexes) cause expensive OS context switches (~µs).
    • Atomics: Use hardware primitives (CAS, fetch_add) for lock-free state updates.
    • Wait-Free Structures: Implement Ring Buffers (SPSC) using memory barriers to allow threads to communicate without ever blocking.
  • Exploiting Concurrency (Pattern: Use Every Core):
    • Thread-per-core: Pin threads to physical cores to maximize cache locality and eliminate scheduler overhead.
    • Concurrency Models: Use Coroutines/Fibers for lightweight userspace multitasking or Actor Model for shared-nothing message passing.
    • SIMD: Use "Single Instruction, Multiple Data" to parallelize arithmetic at the hardware level.

See references/compute_optimization.md for detailed implementation patterns.

Part 4: Hiding Latency (Perceived Speed)

When you can't make it faster, make it feel faster by masking delays.

  • Asynchronous Processing (Pattern: Don't Block the Main Thread):
    • Request Hedging: Send the same request to multiple replicas and use the first response to cut tail latency.
    • Request Batching: Group small requests to amortize round-trip and header overhead.
    • Backpressure: Prevents queuing latency by signaling producers to slow down when the system is saturated.
  • Predictive Techniques (Pattern: Guess the Future):
    • Prefetching: Load data (Sequential, Spatial, or Semantic based on user intent) before it's explicitly requested.
    • Optimistic Updates: Update the UI immediately assuming success; reconcile or rollback if the server fails.
  • Speculative Execution (Pattern: Execute Before Needed):
    • Parallel Speculation: Execute multiple possible outcomes in parallel and keep the correct one (e.g., in a search engine).
    • Prewarming: Spin up resources (Lambdas, VM instances) based on historical traffic patterns before they are needed.

See references/hiding_latency.md for detailed masking strategies.

Bundled Resources

Scripts

Use these for diagnostics and quick calculations:

Code Examples

See code-examples/ for implementations of key techniques from the book.


Latency Constants (Quick Reference)

OperationTimeOrder
CPU cycle (3 GHz)0.3 ns10⁻¹
L1 cache access1 ns10⁰
DRAM access100 ns10²
NVMe disk access10 μs10⁴
SSD disk access100 μs10⁵
Network NYC → London60 ms10⁷

Human Perception

PerceptionTime
Immediate (no delay perceived)< 100 ms
Instant (feels fast)< 1 s
Slow> 10 s

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

expressive-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

widget-previewer

No summary provided by upstream source.

Repository SourceNeeds Review
General

nano-banana-2

Nano Banana 2 - Gemini 3.1 Flash Image Preview

Repository Source