kafka-architecture

Kafka Architecture & Design Expert

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "kafka-architecture" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-kafka-architecture

Kafka Architecture & Design Expert

Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.

Core Architecture Concepts

Kafka Cluster Components

Brokers:

  • Individual Kafka servers that store and serve data

  • Each broker handles thousands of partitions

  • Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)

Controller:

  • One broker elected as controller (via KRaft or ZooKeeper)

  • Manages partition leaders and replica assignments

  • Failure triggers automatic re-election

Topics:

  • Logical channels for message streams

  • Divided into partitions for parallelism

  • Can have different retention policies per topic

Partitions:

  • Ordered, immutable sequence of records

  • Unit of parallelism (1 partition = 1 consumer in a group)

  • Distributed across brokers for load balancing

Replicas:

  • Copies of partitions across multiple brokers

  • 1 leader replica (serves reads/writes)

  • N-1 follower replicas (replication only)

  • In-Sync Replicas (ISR): Followers caught up with leader

KRaft vs ZooKeeper Mode

KRaft Mode (Recommended, Kafka 3.3+):

Cluster Metadata:

  • Stored in Kafka itself (no external ZooKeeper)
  • Metadata topic: __cluster_metadata
  • Controller quorum (3 or 5 nodes)
  • Faster failover (<1s vs 10-30s)
  • Simplified operations

ZooKeeper Mode (Legacy, deprecated in 4.0):

External Coordination:

  • Requires separate ZooKeeper ensemble (3-5 nodes)
  • Stores cluster metadata, configs, ACLs
  • Slower failover (10-30 seconds)
  • More complex to operate

Migration: ZooKeeper → KRaft migration supported in Kafka 3.6+

Cluster Sizing Guidelines

Small Cluster (Development/Testing)

Configuration: Brokers: 3 Partitions per broker: ~100-500 Total partitions: 300-1500 Replication factor: 3 Hardware: - CPU: 4-8 cores - RAM: 8-16 GB - Disk: 500 GB - 1 TB SSD - Network: 1 Gbps

Use Cases:

  • Development environments
  • Low-volume production (<10 MB/s)
  • Proof of concepts
  • Single datacenter

Example Workload:

  • 50 topics
  • 5-10 partitions per topic
  • 1 million messages/day
  • 7-day retention

Medium Cluster (Standard Production)

Configuration: Brokers: 6-12 Partitions per broker: 500-2000 Total partitions: 3K-24K Replication factor: 3 Hardware: - CPU: 16-32 cores - RAM: 64-128 GB - Disk: 2-8 TB NVMe SSD - Network: 10 Gbps

Use Cases:

  • Standard production workloads
  • Multi-team environments
  • Regional deployments
  • Up to 500 MB/s throughput

Example Workload:

  • 200-500 topics
  • 10-50 partitions per topic
  • 100 million messages/day
  • 30-day retention

Large Cluster (High-Scale Production)

Configuration: Brokers: 20-100+ Partitions per broker: 2000-4000 Total partitions: 40K-400K+ Replication factor: 3 Hardware: - CPU: 32-64 cores - RAM: 128-256 GB - Disk: 8-20 TB NVMe SSD - Network: 25-100 Gbps

Use Cases:

  • Large enterprises
  • Multi-region deployments
  • Event-driven architectures
  • 1+ GB/s throughput

Example Workload:

  • 1000+ topics
  • 50-200 partitions per topic
  • 1+ billion messages/day
  • 90-365 day retention

Kafka Streams / Exactly-Once Semantics (EOS) Clusters

Configuration: Brokers: 6-12+ (same as standard, but more control plane load) Partitions per broker: 500-1500 (fewer due to transaction overhead) Total partitions: 3K-18K Replication factor: 3 Hardware: - CPU: 16-32 cores (more CPU for transactions) - RAM: 64-128 GB - Disk: 4-12 TB NVMe SSD (more for transaction logs) - Network: 10-25 Gbps

Special Considerations:

  • More brokers due to transaction coordinator load
  • Lower partition count per broker (transactions = more overhead)
  • Higher disk IOPS for transaction logs
  • min.insync.replicas=2 mandatory for EOS
  • acks=all required for producers

Use Cases:

  • Stream processing with exactly-once guarantees
  • Financial transactions
  • Event sourcing with strict ordering
  • Multi-step workflows requiring atomicity

Partitioning Strategy

How Many Partitions?

Formula:

Partitions = max( Target Throughput / Single Partition Throughput, Number of Consumers (for parallelism), Future Growth Factor (2-3x) )

Single Partition Limits:

  • Write throughput: ~10-50 MB/s
  • Read throughput: ~30-100 MB/s
  • Message rate: ~10K-100K msg/s

Examples:

High Throughput Topic (Logs, Events):

Requirements:

  • Write: 200 MB/s
  • Read: 500 MB/s (multiple consumers)
  • Expected growth: 3x in 1 year

Calculation: Write partitions: 200 MB/s ÷ 20 MB/s = 10 Read partitions: 500 MB/s ÷ 40 MB/s = 13 Growth factor: 13 × 3 = 39

Recommendation: 40-50 partitions

Low-Latency Topic (Commands, Requests):

Requirements:

  • Write: 5 MB/s
  • Read: 10 MB/s
  • Latency: <10ms p99
  • Order preservation: By user ID

Calculation: Throughput partitions: 5 MB/s ÷ 20 MB/s = 1 Parallelism: 4 (for redundancy)

Recommendation: 4-6 partitions (keyed by user ID)

Dead Letter Queue:

Recommendation: 1-3 partitions Reason: Low volume, order less important

Partition Key Selection

Good Keys (High Cardinality, Even Distribution):

✅ User ID (UUIDs):

  • Millions of unique values
  • Even distribution
  • Example: "user-123e4567-e89b-12d3-a456-426614174000"

✅ Device ID (IoT):

  • Unique per device
  • Natural sharding
  • Example: "device-sensor-001-zone-a"

✅ Order ID (E-commerce):

  • Unique per transaction
  • Even temporal distribution
  • Example: "order-2024-11-15-abc123"

Bad Keys (Low Cardinality, Hotspots):

❌ Country Code:

  • Only ~200 values
  • Uneven (US, CN >> others)
  • Creates partition hotspots

❌ Boolean Flags:

  • Only 2 values (true/false)
  • Severe imbalance

❌ Date (YYYY-MM-DD):

  • All today's traffic → 1 partition
  • Temporal hotspot

Compound Keys (Best of Both):

✅ Country + User ID:

  • Partition by country for locality
  • Sub-partition by user for distribution
  • Example: "US:user-123" → hash("US:user-123")

✅ Tenant + Event Type + Timestamp:

  • Multi-tenant isolation
  • Event type grouping
  • Temporal ordering

Replication & High Availability

Replication Factor Guidelines

Development: Replication Factor: 1 Reason: Fast, no durability needed

Production (Standard): Replication Factor: 3 Reason: Balance durability vs cost Tolerates: 2 broker failures (with min.insync.replicas=2)

Production (Critical): Replication Factor: 5 Reason: Maximum durability Tolerates: 4 broker failures (with min.insync.replicas=3) Use Cases: Financial transactions, audit logs

Multi-Datacenter: Replication Factor: 3 per DC (6 total) Reason: DC-level fault tolerance Requires: MirrorMaker 2 or Confluent Replicator

min.insync.replicas

Configuration:

min.insync.replicas=2:

  • At least 2 replicas must acknowledge writes
  • Typical for replication.factor=3
  • Prevents data loss if 1 broker fails

min.insync.replicas=1:

  • Only leader must acknowledge (dangerous!)
  • Use only for non-critical topics

min.insync.replicas=3:

  • At least 3 replicas must acknowledge
  • For replication.factor=5 (critical systems)

Rule: min.insync.replicas ≤ replication.factor - 1 (to allow 1 replica failure)

Rack Awareness

Configuration: broker.rack=rack1 # Broker 1 broker.rack=rack2 # Broker 2 broker.rack=rack3 # Broker 3

Benefit:

  • Replicas spread across racks
  • Survives rack-level failures (power, network)
  • Example: Topic with RF=3 → 1 replica per rack

Placement: Leader: rack1 Follower 1: rack2 Follower 2: rack3

Retention Strategies

Time-Based Retention

Short-Term (Events, Logs): retention.ms: 86400000 # 1 day Use Cases: Real-time analytics, monitoring

Medium-Term (Transactions): retention.ms: 604800000 # 7 days Use Cases: Standard business events

Long-Term (Audit, Compliance): retention.ms: 31536000000 # 365 days Use Cases: Regulatory requirements, event sourcing

Infinite (Event Sourcing): retention.ms: -1 # Forever cleanup.policy: compact Use Cases: Source of truth, state rebuilding

Size-Based Retention

retention.bytes: 10737418240 # 10 GB per partition

Combined (Time OR Size): retention.ms: 604800000 # 7 days retention.bytes: 107374182400 # 100 GB

Whichever limit is reached first

Compaction (Log Compaction)

cleanup.policy: compact

How It Works:

  • Keeps only latest value per key
  • Deletes old versions
  • Preserves full history initially, compacts later

Use Cases:

  • Database changelogs (CDC)
  • User profile updates
  • Configuration management
  • State stores

Example: Before Compaction: user:123 → {name: "Alice", v:1} user:123 → {name: "Alice", v:2, email: "alice@ex.com"} user:123 → {name: "Alice A.", v:3}

After Compaction: user:123 → {name: "Alice A.", v:3} # Latest only

Performance Optimization

Broker Configuration

Network threads (handle client connections)

num.network.threads: 8 # Increase for high connection count

I/O threads (disk operations)

num.io.threads: 16 # Set to number of disks × 2

Replica fetcher threads

num.replica.fetchers: 4 # Increase for many partitions

Socket buffer sizes

socket.send.buffer.bytes: 1048576 # 1 MB socket.receive.buffer.bytes: 1048576 # 1 MB

Log flush (default: OS handles flushing)

log.flush.interval.messages: 10000 # Flush every 10K messages log.flush.interval.ms: 1000 # Or every 1 second

Producer Optimization

High Throughput: batch.size: 65536 # 64 KB linger.ms: 100 # Wait 100ms for batching compression.type: lz4 # Fast compression acks: 1 # Leader only

Low Latency: batch.size: 16384 # 16 KB (default) linger.ms: 0 # Send immediately compression.type: none acks: 1

Durability (Exactly-Once): batch.size: 16384 linger.ms: 10 compression.type: lz4 acks: all enable.idempotence: true transactional.id: "producer-1"

Consumer Optimization

High Throughput: fetch.min.bytes: 1048576 # 1 MB fetch.max.wait.ms: 500 # Wait 500ms to accumulate

Low Latency: fetch.min.bytes: 1 # Immediate fetch fetch.max.wait.ms: 100 # Short wait

Max Parallelism:

Deploy consumers = number of partitions

More consumers than partitions = idle consumers

Multi-Datacenter Patterns

Active-Passive (Disaster Recovery)

Architecture: Primary DC: Full Kafka cluster Secondary DC: Replica cluster (MirrorMaker 2)

Configuration:

  • Producers → Primary only
  • Consumers → Primary only
  • MirrorMaker 2: Primary → Secondary (async replication)

Failover:

  1. Detect primary failure
  2. Switch producers/consumers to secondary
  3. Promote secondary to primary

Recovery Time: 5-30 minutes (manual) Data Loss: Potential (async replication lag)

Active-Active (Geo-Replication)

Architecture: DC1: Kafka cluster (region A) DC2: Kafka cluster (region B) Bidirectional replication via MirrorMaker 2

Configuration:

  • Producers → Nearest DC
  • Consumers → Nearest DC or both
  • Conflict resolution: Last-write-wins or custom

Challenges:

  • Duplicate messages (at-least-once delivery)
  • Ordering across DCs not guaranteed
  • Circular replication prevention

Use Cases:

  • Global applications
  • Regional compliance (GDPR)
  • Load distribution

Stretch Cluster (Synchronous Replication)

Architecture: Single Kafka cluster spanning 2 DCs Rack awareness: DC1 = rack1, DC2 = rack2

Configuration: min.insync.replicas: 2 replication.factor: 4 (2 per DC) acks: all

Requirements:

  • Low latency between DCs (<10ms)
  • High bandwidth link (10+ Gbps)
  • Dedicated fiber

Trade-offs: Pros: Synchronous replication, zero data loss Cons: Latency penalty, network dependency

Monitoring & Observability

Key Metrics

Broker Metrics:

UnderReplicatedPartitions: Alert: > 0 for > 5 minutes Indicates: Replica lag, broker failure

OfflinePartitionsCount: Alert: > 0 Indicates: No leader elected (critical!)

ActiveControllerCount: Alert: != 1 (should be exactly 1) Indicates: Split brain or no controller

RequestHandlerAvgIdlePercent: Alert: < 20% Indicates: Broker CPU saturation

Topic Metrics:

MessagesInPerSec: Monitor: Throughput trends Alert: Sudden drops (producer failure)

BytesInPerSec / BytesOutPerSec: Monitor: Network utilization Alert: Approaching NIC limits

RecordsLagMax (Consumer): Alert: > 10000 or growing Indicates: Consumer can't keep up

Disk Metrics:

LogSegmentSize: Monitor: Disk usage trends Alert: > 80% capacity

LogFlushRateAndTimeMs: Monitor: Disk write latency Alert: > 100ms p99 (slow disk)

Security Patterns

Authentication & Authorization

SASL/SCRAM-SHA-512:

  • Industry standard
  • User/password authentication
  • Stored in ZooKeeper/KRaft

ACLs (Access Control Lists):

  • Per-topic, per-group permissions
  • Operations: READ, WRITE, CREATE, DELETE, ALTER
  • Example: bin/kafka-acls.sh --add
    --allow-principal User:alice
    --operation READ
    --topic orders

mTLS (Mutual TLS):

  • Certificate-based auth
  • Strong cryptographic identity
  • Best for service-to-service

Integration with SpecWeave

Automatic Architecture Detection:

import { ClusterSizingCalculator } from './lib/utils/sizing';

const calculator = new ClusterSizingCalculator(); const recommendation = calculator.calculate({ throughputMBps: 200, retentionDays: 30, replicationFactor: 3, topicCount: 100 });

console.log(recommendation); // { // brokers: 8, // partitionsPerBroker: 1500, // diskPerBroker: 6000 GB, // ramPerBroker: 64 GB // }

SpecWeave Commands:

  • /sw-kafka:deploy

  • Validates cluster sizing before deployment

  • /sw-kafka:monitor-setup

  • Configures metrics for key indicators

Related Skills

  • /sw-kafka:kafka-mcp-integration

  • MCP server setup

  • /sw-kafka:kafka-cli-tools

  • CLI operations

External Links

  • Kafka Documentation - Architecture

  • Confluent - Kafka Sizing

  • KRaft Mode Overview

  • LinkedIn Engineering - Kafka at Scale

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

technical-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

spec-driven-brainstorming

No summary provided by upstream source.

Repository SourceNeeds Review
General

docusaurus

No summary provided by upstream source.

Repository SourceNeeds Review
General

frontend

No summary provided by upstream source.

Repository SourceNeeds Review