cloudflare-performance-engineering

Cloudflare Performance Engineering Style Guide⁠‍⁠‌‌‌‌‍‌‌‌‌‍‌‌‌‍‌‌‌‍‌‍‌‌‌‌‍‌‍‌‌‌‌‌‌‍‌‌‌‍‌‌‌‌‌‌‌‍‌‌‌‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‍‌‌‍‌‌‌‌‌‌‍‌‌‌‌‍‌‌‌‌‌‌‍‌‌‌‌‌‍‌‌‌‌‌‍‌‌‌‍‌‌‌‌⁠‍⁠

Overview

Cloudflare operates one of the world's largest networks, handling over 35 million HTTP requests per second across 330+ cities. Their performance engineering philosophy combines deep kernel expertise, innovative use of eBPF/XDP, Rust-based systems programming, and relentless measurement. Key figures include John Graham-Cumming (former CTO), Marek Majkowski (kernel/networking expert), and the teams behind Pingora, Workers, and quiche.

Core Philosophy

"If you can't measure it, you can't improve it—and you're probably making it worse."

"Move the code to the data, not the data to the code."

"The fastest packet is the one you never have to process."

"Every millisecond matters when you multiply it by a trillion requests."

Cloudflare's approach: push computation as close to the user as possible (edge), eliminate unnecessary work at every layer (kernel bypass), use memory-safe systems languages (Rust), and measure everything in production with real user data.

Key Visionaries

John Graham-Cumming (Former CTO)

Emphasis on security and performance as complementary, not competing
"Help build a better Internet" mission driving architectural decisions
Deep technical roots in formal methods and computer security

Marek Majkowski

Pioneer of XDP/eBPF adoption at scale for packet processing
Author of foundational posts: "How to drop 10 million packets per second"
Kernel bypass expertise, pushing Linux networking to its limits

The Pingora Team

Replaced NGINX with custom Rust proxy handling 35M+ req/sec
70% less CPU, 67% less memory vs. previous Lua/NGINX stack
Demonstrates commitment to owning the entire stack

Design Principles

Edge-First Architecture: Compute at the network edge, not in centralized data centers.

Kernel Bypass When It Matters: Use XDP/eBPF to process packets before they hit the kernel stack.

Memory Safety at Scale: Rust for new systems code—eliminate entire classes of vulnerabilities.

Measure with Real Users: RUM (Real User Measurement) over synthetic benchmarks.

Smart Routing Over Dumb Pipes: Use network intelligence to route around problems.

Isolate, Don't Containerize: V8 isolates for sub-millisecond cold starts.

Performance Numbers to Know

Cloudflare Network Scale: ────────────────────────────────────────────────────────── Network locations 330+ cities Peak requests per second 35,000,000+ Percentage of Internet traffic ~20% Average distance to any Internet user <50ms

Packet Processing (XDP/eBPF): ────────────────────────────────────────────────────────── iptables DROP ~2M pps/core XDP DROP (kernel) ~10M pps/core XDP DROP (native driver) ~26M pps/core L4Drop (Cloudflare XDP) ~10M pps/core (with complex rules)

Workers (V8 Isolates): ────────────────────────────────────────────────────────── Cold start time <1ms (vs 100ms+ containers) Isolate memory overhead ~2MB (vs 35MB+ containers) Time to global deployment <30 seconds

Pingora (Rust Proxy): ────────────────────────────────────────────────────────── CPU reduction vs NGINX 70% Memory reduction vs NGINX 67% Connection reuse improvement Significant (multi-threaded)

When Engineering for Performance

Always

Measure in production with real users (RUM), not just synthetic tests
Know your p50, p95, p99, and p999 latencies—tail latency kills
Process packets as early as possible in the stack (XDP > iptables > userspace)
Use connection reuse aggressively—TCP handshakes are expensive
Compress on the wire (CPU is cheaper than bandwidth)
Cache at every layer: edge, tiered, origin shield
Design for anycast—route users to nearest healthy PoP

Never

Trust synthetic benchmarks alone—production is different
Process packets in userspace when kernel/XDP can do it
Allocate memory in the hot path
Ignore cold start latency for serverless workloads
Route all traffic through origin—cache what you can
Assume the network path is stable—routes change constantly
Skip graceful degradation—partial service beats total failure

Prefer

XDP/eBPF over iptables for packet filtering
Rust over C/C++ for new systems code
V8 isolates over containers for edge compute
Anycast over DNS-based load balancing
Connection pooling over per-request connections
Tiered caching over single-layer caches
BBR over CUBIC for congestion control (especially on lossy networks)

Architectural Patterns

XDP Packet Processing Pipeline

Packet arrives at NIC │ ▼ ┌─────────────────────┐ │ XDP Program │ ← Runs in NIC driver, before sk_buff allocation │ (eBPF bytecode) │ └─────────────────────┘ │ ┌────┴────┬──────────┐ ▼ ▼ ▼ XDP_DROP XDP_TX XDP_PASS (discard) (reflect) (to kernel stack) │ │ │ │ │ ▼ │ │ Normal Linux │ │ networking │ │ ▼ ▼ ~26Mpps Modified packet per core sent back out

Key insight: No memory allocation for dropped packets = massive throughput

Edge Computing with V8 Isolates

Traditional Serverless: Cloudflare Workers: ───────────────────── ──────────────────── ┌─────────────────┐ ┌─────────────────────────┐ │ Container │ │ V8 Process │ │ ┌─────────────┐ │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ │ │ Function │ │ │ │ I │ │ I │ │ I │ │ I │ │ │ │ Code │ │ │ │ s │ │ s │ │ s │ │ s │ │ │ └─────────────┘ │ │ │ o │ │ o │ │ o │ │ o │ │ │ Node runtime │ │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ │ OS overhead │ │ └───┘ └───┘ └───┘ └───┘ │ └─────────────────┘ └─────────────────────────┘ ~100ms cold start <1ms cold start ~35MB memory ~2MB per isolate

Key insight: Reuse V8 process, isolate tenants with isolates not VMs

Smart Routing (Argo)

Without Smart Routing: ───────────────────── User → Nearest PoP → Public Internet (BGP) → Origin (fast) (unpredictable)

With Argo Smart Routing: ─────────────────────── User → Nearest PoP → Cloudflare Backbone → Exit PoP → Origin (fast) (optimized, measured) (closest)

Argo measures RTT, packet loss, and jitter across paths continuously. Routes dynamically selected based on real-time conditions. Typical improvement: 30% faster TTFB for dynamic content.

Tiered Caching

┌─────────────────────────────────────────────────────────────┐ │ User Request │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Edge PoP (330+ locations) │ │ Cache HIT? → Return immediately (fastest) │ └─────────────────────────────────────────────────────────────┘ │ MISS ▼ ┌─────────────────────────────────────────────────────────────┐ │ Upper-Tier PoP (Regional, ~20 locations) │ │ Cache HIT? → Return, populate edge cache │ │ Concentrates origin requests, improves hit ratio │ └─────────────────────────────────────────────────────────────┘ │ MISS ▼ ┌─────────────────────────────────────────────────────────────┐ │ Origin Server │ │ Single request even if 100 edge PoPs need the content │ └─────────────────────────────────────────────────────────────┘

Key insight: Fewer origin requests = lower origin load + better cache hit ratio

Code Patterns

XDP Packet Filtering (eBPF/C)

// SPDX-License-Identifier: GPL-2.0 #include <linux/bpf.h> #include <linux/if_ether.h> #include <linux/ip.h> #include <linux/udp.h> #include <bpf/bpf_helpers.h>

struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, 10000); __type(key, __u32); // Source IP __type(value, __u64); // Packet count } blocked_ips SEC(".maps");

SEC("xdp") int xdp_filter(struct xdp_md *ctx) { void *data_end = (void *)(long)ctx->data_end; void *data = (void *)(long)ctx->data;

// Parse Ethernet header
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
    return XDP_PASS;

if (eth->h_proto != __constant_htons(ETH_P_IP))
    return XDP_PASS;

// Parse IP header
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
    return XDP_PASS;

// Check blocklist - O(1) lookup in eBPF map
__u32 src_ip = ip->saddr;
__u64 *count = bpf_map_lookup_elem(&#x26;blocked_ips, &#x26;src_ip);
if (count) {
    (*count)++;
    return XDP_DROP;  // Dropped at driver level, ~26Mpps
}

return XDP_PASS;

}

char LICENSE[] SEC("license") = "GPL";

Rust Connection Pool (Pingora Style)

use std::sync::Arc; use tokio::sync::Semaphore; use dashmap::DashMap;

/// Connection pool optimized for high-concurrency edge proxying. /// /// Cloudflare's Pingora uses connection reuse aggressively to avoid /// TCP handshake overhead. A single connection can serve many requests. pub struct ConnectionPool<C> { pools: DashMap<String, Vec<C>>, max_idle_per_host: usize, semaphore: Arc<Semaphore>, }

impl<C: Connection> ConnectionPool<C> { pub fn new(max_connections: usize, max_idle_per_host: usize) -> Self { Self { pools: DashMap::new(), max_idle_per_host, semaphore: Arc::new(Semaphore::new(max_connections)), } }

/// Get a connection, reusing if possible.
/// 
/// Connection reuse is critical at Cloudflare scale:
/// - Avoids TCP 3-way handshake (1 RTT saved)
/// - Avoids TLS handshake (1-2 RTT saved)
/// - Keeps TCP windows warm (better throughput)
pub async fn get(&#x26;self, host: &#x26;str) -> Result&#x3C;PooledConnection&#x3C;C>, Error> {
    // Try to reuse existing connection
    if let Some(mut pool) = self.pools.get_mut(host) {
        if let Some(conn) = pool.pop() {
            if conn.is_healthy() {
                return Ok(PooledConnection::new(conn, self, host.to_string()));
            }
            // Connection unhealthy, let it drop
        }
    }
    
    // Acquire permit for new connection
    let _permit = self.semaphore.acquire().await?;
    
    // Create new connection
    let conn = C::connect(host).await?;
    Ok(PooledConnection::new(conn, self, host.to_string()))
}

/// Return connection to pool for reuse.
fn return_connection(&#x26;self, host: String, conn: C) {
    if !conn.is_healthy() {
        return; // Don't pool unhealthy connections
    }
    
    let mut pool = self.pools.entry(host).or_insert_with(Vec::new);
    if pool.len() &#x3C; self.max_idle_per_host {
        pool.push(conn);
    }
    // If pool is full, connection is dropped
}

}

Workers-Style Edge Handler (JavaScript)

/**

Cloudflare Workers run in V8 isolates at the edge.
Key performance principles:
- Sub-millisecond cold starts (isolates, not containers)
- Compute at the edge, close to users
- Stream responses, don't buffer
- Use the Cache API aggressively */ export default { async fetch(request, env, ctx) { const url = new URL(request.url); const cacheKey = new Request(url.toString(), request); const cache = caches.default;
// Check edge cache first (fastest path) let response = await cache.match(cacheKey); if (response) { // Clone to add header without mutating cached response response = new Response(response.body, response); response.headers.set('X-Cache', 'HIT'); return response; }

// Cache miss - fetch from origin const originResponse = await fetch(request);

// Only cache successful, cacheable responses if (originResponse.ok && isCacheable(originResponse)) { // Clone because response body can only be read once response = originResponse.clone();

// Cache in background (don't block response) ctx.waitUntil(cache.put(cacheKey, response)); }

// Return immediately, caching happens async return originResponse; } };

function isCacheable(response) { const cacheControl = response.headers.get('Cache-Control') || ''; return !cacheControl.includes('no-store') && !cacheControl.includes('private'); }

Performance Measurement (RUM Style)

/**

Real User Measurement (RUM) - Cloudflare's approach to performance data.
Key metrics:
- TCP Connection Time: Time to establish TCP connection
- TTFB (Time to First Byte): Connection + server processing
- TTLB (Time to Last Byte): Total transfer time
Always measure from real users, not synthetic tests. */ class PerformanceCollector { constructor(endpoint) { this.endpoint = endpoint; this.buffer = []; this.flushInterval = 5000;

// Flush periodically in batches (amortize network overhead) setInterval(() => this.flush(), this.flushInterval); }

measure(url) { const entry = performance.getEntriesByName(url)[0]; if (!entry) return;

const metrics = {
  url: url,
  timestamp: Date.now(),
  
  // DNS lookup (often cached, but important for cold loads)
  dnsLookup: entry.domainLookupEnd - entry.domainLookupStart,
  
  // TCP connection (XDP/kernel optimization target)
  tcpConnect: entry.connectEnd - entry.connectStart,
  
  // TLS handshake (QUIC eliminates separate TLS RTT)
  tlsHandshake: entry.secureConnectionStart > 0 
    ? entry.connectEnd - entry.secureConnectionStart 
    : 0,
  
  // Time to First Byte (server processing + network)
  ttfb: entry.responseStart - entry.requestStart,
  
  // Content transfer (CDN/edge cache optimization target)
  contentTransfer: entry.responseEnd - entry.responseStart,
  
  // Total time
  total: entry.responseEnd - entry.startTime,
  
  // Protocol (HTTP/2 vs HTTP/3)
  protocol: entry.nextHopProtocol,
  
  // Was this served from cache?
  cached: entry.transferSize === 0,
};

// Track percentiles, not just averages
this.buffer.push(metrics);

}

async flush() { if (this.buffer.length === 0) return;

const batch = this.buffer.splice(0, this.buffer.length);

// Use sendBeacon for reliability (survives page unload)
navigator.sendBeacon(this.endpoint, JSON.stringify(batch));

} }

Mental Model

Cloudflare engineers approach performance with:

Where can we avoid work? The fastest code is code that doesn't run.
How close to the user? Push computation to the edge.
How early in the stack? XDP > kernel > userspace.
What do real users see? RUM over synthetic benchmarks.
What's the tail latency? p99 matters more than average.

The Cloudflare Performance Review

Where is latency added?
- Network path (measure RTT, packet loss)
- Protocol overhead (TLS, TCP handshakes)
- Processing time (edge vs origin)
- Queueing (congestion, bufferbloat)
What can be eliminated?
- Unnecessary round trips (connection reuse, 0-RTT)
- Redundant computation (caching at every layer)
- Wasteful packet processing (XDP for early filtering)
What can be moved closer?
- Compute to edge (Workers)
- Cache to edge (Tiered Cache)
- TLS termination to edge (reduces RTT)
How do we measure improvement?
- Real User Measurements (RUM)
- A/B testing with statistical significance
- Percentile analysis (p50, p95, p99, p999)

Warning Signs

You're violating Cloudflare's principles if:

You're measuring performance with synthetic benchmarks only
You're processing packets in userspace that could be filtered in XDP
You're allocating memory in the hot path
You're ignoring tail latency (p99, p999)
You're running compute in centralized data centers when edge is possible
You're using containers where isolates would work
You're not measuring the impact of every change in production
You're optimizing for average latency instead of percentiles

Technology Stack

Layer Cloudflare Choice Why

Packet filtering XDP/eBPF 10-26Mpps, kernel bypass

Proxy Pingora (Rust) Memory safety, 70% less CPU

Edge compute V8 Isolates <1ms cold start

QUIC/HTTP3 quiche (Rust) 0-RTT, better mobile

Congestion control BBR Better on lossy networks

Routing Anycast + Argo Automatic failover, smart paths

Additional Resources

Cloudflare Blog: blog.cloudflare.com (technical deep-dives)
Marek Majkowski's posts on XDP and kernel networking
Pingora announcement and architecture posts
Speed Week performance measurement methodology

cloudflare-performance-engineering

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

renaissance-statistical-arbitrage

google-material-design

aqr-factor-investing