Kafka Architecture & Design Expert

Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.

Core Architecture Concepts

Kafka Cluster Components

Brokers:

Individual Kafka servers that store and serve data
Each broker handles thousands of partitions
Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)

Controller:

One broker elected as controller (via KRaft or ZooKeeper)
Manages partition leaders and replica assignments
Failure triggers automatic re-election

Topics:

Logical channels for message streams
Divided into partitions for parallelism
Can have different retention policies per topic

Partitions:

Ordered, immutable sequence of records
Unit of parallelism (1 partition = 1 consumer in a group)
Distributed across brokers for load balancing

Replicas:

Copies of partitions across multiple brokers
1 leader replica (serves reads/writes)
N-1 follower replicas (replication only)
In-Sync Replicas (ISR): Followers caught up with leader

KRaft vs ZooKeeper Mode

KRaft Mode (Recommended, Kafka 3.3+):

Cluster Metadata:

Stored in Kafka itself (no external ZooKeeper)
Metadata topic: __cluster_metadata
Controller quorum (3 or 5 nodes)
Faster failover (<1s vs 10-30s)
Simplified operations

ZooKeeper Mode (Legacy, deprecated in 4.0):

External Coordination:

Requires separate ZooKeeper ensemble (3-5 nodes)
Stores cluster metadata, configs, ACLs
Slower failover (10-30 seconds)
More complex to operate

Migration: ZooKeeper → KRaft migration supported in Kafka 3.6+

Cluster Sizing Guidelines

Small Cluster (Development/Testing)

Configuration: Brokers: 3 Partitions per broker: ~100-500 Total partitions: 300-1500 Replication factor: 3 Hardware: - CPU: 4-8 cores - RAM: 8-16 GB - Disk: 500 GB - 1 TB SSD - Network: 1 Gbps

Use Cases:

Development environments
Low-volume production (<10 MB/s)
Proof of concepts
Single datacenter

Example Workload:

50 topics
5-10 partitions per topic
1 million messages/day
7-day retention

Medium Cluster (Standard Production)

Configuration: Brokers: 6-12 Partitions per broker: 500-2000 Total partitions: 3K-24K Replication factor: 3 Hardware: - CPU: 16-32 cores - RAM: 64-128 GB - Disk: 2-8 TB NVMe SSD - Network: 10 Gbps

Use Cases:

Standard production workloads
Multi-team environments
Regional deployments
Up to 500 MB/s throughput

Example Workload:

200-500 topics
10-50 partitions per topic
100 million messages/day
30-day retention

Large Cluster (High-Scale Production)

Configuration: Brokers: 20-100+ Partitions per broker: 2000-4000 Total partitions: 40K-400K+ Replication factor: 3 Hardware: - CPU: 32-64 cores - RAM: 128-256 GB - Disk: 8-20 TB NVMe SSD - Network: 25-100 Gbps

Use Cases:

Large enterprises
Multi-region deployments
Event-driven architectures
1+ GB/s throughput

Example Workload:

1000+ topics
50-200 partitions per topic
1+ billion messages/day
90-365 day retention

Kafka Streams / Exactly-Once Semantics (EOS) Clusters

Configuration: Brokers: 6-12+ (same as standard, but more control plane load) Partitions per broker: 500-1500 (fewer due to transaction overhead) Total partitions: 3K-18K Replication factor: 3 Hardware: - CPU: 16-32 cores (more CPU for transactions) - RAM: 64-128 GB - Disk: 4-12 TB NVMe SSD (more for transaction logs) - Network: 10-25 Gbps

Special Considerations:

More brokers due to transaction coordinator load
Lower partition count per broker (transactions = more overhead)
Higher disk IOPS for transaction logs
min.insync.replicas=2 mandatory for EOS
acks=all required for producers

Use Cases:

Stream processing with exactly-once guarantees
Financial transactions
Event sourcing with strict ordering
Multi-step workflows requiring atomicity

Partitioning Strategy

How Many Partitions?

Formula:

Partitions = max( Target Throughput / Single Partition Throughput, Number of Consumers (for parallelism), Future Growth Factor (2-3x) )

Single Partition Limits:

Write throughput: ~10-50 MB/s
Read throughput: ~30-100 MB/s
Message rate: ~10K-100K msg/s

Examples:

High Throughput Topic (Logs, Events):

Requirements:

Write: 200 MB/s
Read: 500 MB/s (multiple consumers)
Expected growth: 3x in 1 year

Calculation: Write partitions: 200 MB/s ÷ 20 MB/s = 10 Read partitions: 500 MB/s ÷ 40 MB/s = 13 Growth factor: 13 × 3 = 39

Recommendation: 40-50 partitions

Low-Latency Topic (Commands, Requests):

Requirements:

Write: 5 MB/s
Read: 10 MB/s
Latency: <10ms p99
Order preservation: By user ID

Calculation: Throughput partitions: 5 MB/s ÷ 20 MB/s = 1 Parallelism: 4 (for redundancy)

Recommendation: 4-6 partitions (keyed by user ID)

Dead Letter Queue:

Recommendation: 1-3 partitions Reason: Low volume, order less important

Partition Key Selection

Good Keys (High Cardinality, Even Distribution):

✅ User ID (UUIDs):

Millions of unique values
Even distribution
Example: "user-123e4567-e89b-12d3-a456-426614174000"

✅ Device ID (IoT):

Unique per device
Natural sharding
Example: "device-sensor-001-zone-a"

✅ Order ID (E-commerce):

Unique per transaction
Even temporal distribution
Example: "order-2024-11-15-abc123"

Bad Keys (Low Cardinality, Hotspots):

❌ Country Code:

Only ~200 values
Uneven (US, CN >> others)
Creates partition hotspots

❌ Boolean Flags:

Only 2 values (true/false)
Severe imbalance

❌ Date (YYYY-MM-DD):

All today's traffic → 1 partition
Temporal hotspot

Compound Keys (Best of Both):

✅ Country + User ID:

Partition by country for locality
Sub-partition by user for distribution
Example: "US:user-123" → hash("US:user-123")

✅ Tenant + Event Type + Timestamp:

Multi-tenant isolation
Event type grouping
Temporal ordering

Replication & High Availability

Replication Factor Guidelines

Development: Replication Factor: 1 Reason: Fast, no durability needed

Production (Standard): Replication Factor: 3 Reason: Balance durability vs cost Tolerates: 2 broker failures (with min.insync.replicas=2)

Production (Critical): Replication Factor: 5 Reason: Maximum durability Tolerates: 4 broker failures (with min.insync.replicas=3) Use Cases: Financial transactions, audit logs

Multi-Datacenter: Replication Factor: 3 per DC (6 total) Reason: DC-level fault tolerance Requires: MirrorMaker 2 or Confluent Replicator

min.insync.replicas

Configuration:

min.insync.replicas=2:

At least 2 replicas must acknowledge writes
Typical for replication.factor=3
Prevents data loss if 1 broker fails

min.insync.replicas=1:

Only leader must acknowledge (dangerous!)
Use only for non-critical topics

min.insync.replicas=3:

At least 3 replicas must acknowledge
For replication.factor=5 (critical systems)

Rule: min.insync.replicas ≤ replication.factor - 1 (to allow 1 replica failure)

Rack Awareness

Configuration: broker.rack=rack1 # Broker 1 broker.rack=rack2 # Broker 2 broker.rack=rack3 # Broker 3

Benefit:

Replicas spread across racks
Survives rack-level failures (power, network)
Example: Topic with RF=3 → 1 replica per rack

Placement: Leader: rack1 Follower 1: rack2 Follower 2: rack3

Retention Strategies

Time-Based Retention

Short-Term (Events, Logs): retention.ms: 86400000 # 1 day Use Cases: Real-time analytics, monitoring

Medium-Term (Transactions): retention.ms: 604800000 # 7 days Use Cases: Standard business events

Long-Term (Audit, Compliance): retention.ms: 31536000000 # 365 days Use Cases: Regulatory requirements, event sourcing

Infinite (Event Sourcing): retention.ms: -1 # Forever cleanup.policy: compact Use Cases: Source of truth, state rebuilding

Size-Based Retention

retention.bytes: 10737418240 # 10 GB per partition

Combined (Time OR Size): retention.ms: 604800000 # 7 days retention.bytes: 107374182400 # 100 GB

Whichever limit is reached first

Compaction (Log Compaction)

cleanup.policy: compact

How It Works:

Keeps only latest value per key
Deletes old versions
Preserves full history initially, compacts later

Use Cases:

Database changelogs (CDC)
User profile updates
Configuration management
State stores

Example: Before Compaction: user:123 → {name: "Alice", v:1} user:123 → {name: "Alice", v:2, email: "alice@ex.com"} user:123 → {name: "Alice A.", v:3}

After Compaction: user:123 → {name: "Alice A.", v:3} # Latest only

Performance Optimization

Broker Configuration

Network threads (handle client connections)

num.network.threads: 8 # Increase for high connection count

I/O threads (disk operations)

num.io.threads: 16 # Set to number of disks × 2

Replica fetcher threads

num.replica.fetchers: 4 # Increase for many partitions

Socket buffer sizes

socket.send.buffer.bytes: 1048576 # 1 MB socket.receive.buffer.bytes: 1048576 # 1 MB

Log flush (default: OS handles flushing)

log.flush.interval.messages: 10000 # Flush every 10K messages log.flush.interval.ms: 1000 # Or every 1 second

Producer Optimization

High Throughput: batch.size: 65536 # 64 KB linger.ms: 100 # Wait 100ms for batching compression.type: lz4 # Fast compression acks: 1 # Leader only

Low Latency: batch.size: 16384 # 16 KB (default) linger.ms: 0 # Send immediately compression.type: none acks: 1

Durability (Exactly-Once): batch.size: 16384 linger.ms: 10 compression.type: lz4 acks: all enable.idempotence: true transactional.id: "producer-1"

Consumer Optimization

High Throughput: fetch.min.bytes: 1048576 # 1 MB fetch.max.wait.ms: 500 # Wait 500ms to accumulate

Low Latency: fetch.min.bytes: 1 # Immediate fetch fetch.max.wait.ms: 100 # Short wait

Max Parallelism:

Deploy consumers = number of partitions

More consumers than partitions = idle consumers

Multi-Datacenter Patterns

Active-Passive (Disaster Recovery)

Architecture: Primary DC: Full Kafka cluster Secondary DC: Replica cluster (MirrorMaker 2)

Configuration:

Producers → Primary only
Consumers → Primary only
MirrorMaker 2: Primary → Secondary (async replication)

Failover:

Detect primary failure
Switch producers/consumers to secondary
Promote secondary to primary

Recovery Time: 5-30 minutes (manual) Data Loss: Potential (async replication lag)

Active-Active (Geo-Replication)

Architecture: DC1: Kafka cluster (region A) DC2: Kafka cluster (region B) Bidirectional replication via MirrorMaker 2

Configuration:

Producers → Nearest DC
Consumers → Nearest DC or both
Conflict resolution: Last-write-wins or custom

Challenges:

Duplicate messages (at-least-once delivery)
Ordering across DCs not guaranteed
Circular replication prevention

Use Cases:

Global applications
Regional compliance (GDPR)
Load distribution

Stretch Cluster (Synchronous Replication)

Architecture: Single Kafka cluster spanning 2 DCs Rack awareness: DC1 = rack1, DC2 = rack2

Configuration: min.insync.replicas: 2 replication.factor: 4 (2 per DC) acks: all

Requirements:

Low latency between DCs (<10ms)
High bandwidth link (10+ Gbps)
Dedicated fiber

Trade-offs: Pros: Synchronous replication, zero data loss Cons: Latency penalty, network dependency

Monitoring & Observability

Key Metrics

Broker Metrics:

UnderReplicatedPartitions: Alert: > 0 for > 5 minutes Indicates: Replica lag, broker failure

OfflinePartitionsCount: Alert: > 0 Indicates: No leader elected (critical!)

ActiveControllerCount: Alert: != 1 (should be exactly 1) Indicates: Split brain or no controller

RequestHandlerAvgIdlePercent: Alert: < 20% Indicates: Broker CPU saturation

Topic Metrics:

MessagesInPerSec: Monitor: Throughput trends Alert: Sudden drops (producer failure)

BytesInPerSec / BytesOutPerSec: Monitor: Network utilization Alert: Approaching NIC limits

RecordsLagMax (Consumer): Alert: > 10000 or growing Indicates: Consumer can't keep up

Disk Metrics:

LogSegmentSize: Monitor: Disk usage trends Alert: > 80% capacity

LogFlushRateAndTimeMs: Monitor: Disk write latency Alert: > 100ms p99 (slow disk)

Security Patterns

Authentication & Authorization

SASL/SCRAM-SHA-512:

Industry standard
User/password authentication
Stored in ZooKeeper/KRaft

ACLs (Access Control Lists):

Per-topic, per-group permissions
Operations: READ, WRITE, CREATE, DELETE, ALTER
Example: bin/kafka-acls.sh --add
--allow-principal User:alice
--operation READ
--topic orders

mTLS (Mutual TLS):

Certificate-based auth
Strong cryptographic identity
Best for service-to-service

Integration with SpecWeave

Automatic Architecture Detection:

import { ClusterSizingCalculator } from './lib/utils/sizing';

const calculator = new ClusterSizingCalculator(); const recommendation = calculator.calculate({ throughputMBps: 200, retentionDays: 30, replicationFactor: 3, topicCount: 100 });

console.log(recommendation); // { // brokers: 8, // partitionsPerBroker: 1500, // diskPerBroker: 6000 GB, // ramPerBroker: 64 GB // }

SpecWeave Commands:

/sw-kafka:deploy
Validates cluster sizing before deployment
/sw-kafka:monitor-setup
Configures metrics for key indicators

Related Skills

/sw-kafka:kafka-mcp-integration
MCP server setup
/sw-kafka:kafka-cli-tools
CLI operations

External Links

Kafka Documentation - Architecture
Confluent - Kafka Sizing
KRaft Mode Overview
LinkedIn Engineering - Kafka at Scale

kafka-architecture

Safety Notice

Copy this and send it to your AI assistant to learn

Whichever limit is reached first

Network threads (handle client connections)

I/O threads (disk operations)

Replica fetcher threads

Socket buffer sizes

Log flush (default: OS handles flushing)

Deploy consumers = number of partitions

More consumers than partitions = idle consumers

Source Transparency

Related Skills

technical-writing

spec-driven-brainstorming

docusaurus

frontend