Kafka Engineer
Purpose
Provides Apache Kafka and event streaming expertise specializing in scalable event-driven architectures and real-time data pipelines. Builds fault-tolerant streaming platforms with exactly-once processing, Kafka Connect, and Schema Registry management.
When to Use
-
Designing event-driven microservices architectures
-
Setting up Kafka Connect pipelines (CDC, S3 Sink)
-
Writing stream processing apps (Kafka Streams / ksqlDB)
-
Debugging consumer lag, rebalancing storms, or broker performance
-
Designing schemas (Avro/Protobuf) with Schema Registry
-
Configuring ACLs and mTLS security
- Decision Framework
Architecture Selection
What is the use case? │ ├─ Data Integration (ETL) │ ├─ DB to DB/Data Lake? → Kafka Connect (Zero code) │ └─ Complex transformations? → Kafka Streams │ ├─ Real-Time Analytics │ ├─ SQL-like queries? → ksqlDB (Quick aggregation) │ └─ Complex stateful logic? → Kafka Streams / Flink │ └─ Microservices Comm ├─ Event Notification? → Standard Producer/Consumer └─ Event Sourcing? → State Stores (RocksDB)
Config Tuning (The "Big 3")
-
Throughput: batch.size , linger.ms , compression.type=lz4 .
-
Latency: linger.ms=0 , acks=1 .
-
Durability: acks=all , min.insync.replicas=2 , replication.factor=3 .
Red Flags → Escalate to sre-engineer :
-
"Unclean leader election" enabled (Data loss risk)
-
Zookeeper dependency in new clusters (Use KRaft mode)
-
Disk usage > 80% on brokers
-
Consumer lag constantly increasing (Capacity mismatch)
- Core Workflows
Workflow 1: Kafka Connect (CDC)
Goal: Stream changes from PostgreSQL to S3.
Steps:
Source Config (postgres-source.json )
{ "name": "postgres-source", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "db-host", "database.dbname": "mydb", "database.user": "kafka", "plugin.name": "pgoutput" } }
Sink Config (s3-sink.json )
{ "name": "s3-sink", "config": { "connector.class": "io.confluent.connect.s3.S3SinkConnector", "s3.bucket.name": "my-datalake", "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat", "flush.size": "1000" } }
Deploy
- curl -X POST -d @postgres-source.json http://connect:8083/connectors
Workflow 3: Schema Registry Integration
Goal: Enforce schema compatibility.
Steps:
Define Schema (user.avsc )
{ "type": "record", "name": "User", "fields": [ {"name": "id", "type": "int"}, {"name": "name", "type": "string"} ] }
Producer (Java)
-
Use KafkaAvroSerializer .
-
Registry URL: http://schema-registry:8081 .
- Anti-Patterns & Gotchas
❌ Anti-Pattern 1: Large Messages
What it looks like:
- Sending 10MB images payload in Kafka message.
Why it fails:
- Kafka is optimized for small messages (< 1MB). Large messages block the broker threads.
Correct approach:
-
Store image in S3.
-
Send Reference URL in Kafka message.
❌ Anti-Pattern 2: Too Many Partitions
What it looks like:
- Creating 10,000 partitions on a small cluster.
Why it fails:
-
Slow leader election (Zookeeper overhead).
-
High file handle usage.
Correct approach:
- Limit partitions per broker (~4000). Use fewer topics or larger clusters.
❌ Anti-Pattern 3: Blocking Consumer
What it looks like:
- Consumer doing heavy HTTP call (30s) for each message.
Why it fails:
- Rebalance storm (Consumer leaves group due to timeout).
Correct approach:
-
Async Processing: Move work to a thread pool.
-
Pause/Resume: consumer.pause() if buffer is full.
- Quality Checklist
Configuration:
-
Replication: Factor 3 for production.
-
Min.ISR: 2 (Prevents data loss).
-
Retention: Configured correctly (Time vs Size).
Observability:
-
Lag: Consumer Lag monitored (Burrow/Prometheus).
-
Under-replicated: Alert on under-replicated partitions (>0).
-
JMX: Metrics exported.
Examples
Example 1: Real-Time Fraud Detection Pipeline
Scenario: A financial services company needs real-time fraud detection using Kafka streaming.
Architecture Implementation:
-
Event Ingestion: Kafka Connect CDC from PostgreSQL transaction database
-
Stream Processing: Kafka Streams application for real-time pattern detection
-
Alert System: Producer to alert topic triggering notifications
-
Storage: S3 sink for historical analysis and compliance
Pipeline Configuration:
Component Configuration Purpose
Topics 3 (transactions, alerts, enriched) Data organization
Partitions 12 (3 brokers × 4) Parallelism
Replication 3 High availability
Compression LZ4 Throughput optimization
Key Logic:
-
Detects velocity patterns (5+ transactions in 1 minute)
-
Identifies geographic anomalies (impossible travel)
-
Flags high-risk merchant categories
Results:
-
99.7% of fraud detected in under 100ms
-
False positive rate reduced from 5% to 0.3%
-
Compliance audit passed with zero findings
Example 2: E-Commerce Order Processing System
Scenario: Build a resilient order processing system with Kafka for high reliability.
System Design:
-
Order Events: Topic for order lifecycle events
-
Inventory Service: Consumes orders, updates stock
-
Payment Service: Processes payments, publishes results
-
Notification Service: Sends confirmations via email/SMS
Resilience Patterns:
-
Dead Letter Queue for failed processing
-
Idempotent producers for exactly-once semantics
-
Consumer groups with manual offset management
-
Retries with exponential backoff
Configuration:
Producer Configuration
acks: all retries: 3 enable.idempotence: true
Consumer Configuration
auto.offset.reset: earliest enable.auto.commit: false max.poll.records: 500
Results:
-
99.99% message delivery reliability
-
Zero duplicate orders in 6 months
-
Peak processing: 10,000 orders/second
Example 3: IoT Telemetry Platform
Scenario: Process millions of IoT device telemetry messages with Kafka.
Platform Architecture:
-
Device Gateway: MQTT to Kafka proxy
-
Data Enrichment: Stream processing adds device metadata
-
Time-Series Storage: S3 sink partitioned by device_id/date
-
Real-Time Alerts: Threshold-based alerting for anomalies
Scalability Configuration:
-
50 partitions for parallel processing
-
Compression enabled for cost optimization
-
Retention: 7 days hot, 1 year cold in S3
-
Schema Registry for data contracts
Performance Metrics:
Metric Value
Throughput 500,000 messages/sec
Latency (P99) 50ms
Consumer lag < 1 second
Storage efficiency 60% reduction with compression
Best Practices
Topic Design
-
Naming Conventions: Use clear, hierarchical topic names (domain.entity.event)
-
Partition Strategy: Plan for future growth (3x expected throughput)
-
Retention Policies: Match retention to business requirements
-
Cleanup Policies: Use delete for time-based, compact for state
-
Schema Management: Enforce schemas via Schema Registry
Producer Optimization
-
Batching: Increase batch.size and linger.ms for throughput
-
Compression: Use LZ4 for balance of speed and size
-
Acks Configuration: Use all for reliability, 1 for latency
-
Retry Strategy: Implement retries with backoff
-
Idempotence: Enable for exactly-once semantics in critical paths
Consumer Best Practices
-
Offset Management: Use manual commit for critical processing
-
Batch Processing: Increase max.poll.records for efficiency
-
Rebalance Handling: Implement graceful shutdown
-
Error Handling: Dead letter queues for poison messages
-
Monitoring: Track consumer lag and processing time
Security Configuration
-
Encryption: TLS for all client-broker communication
-
Authentication: SASL/SCRAM or mTLS for production
-
Authorization: ACLs with least privilege principle
-
Quotas: Implement client quotas to prevent abuse
-
Audit Logging: Log all access and configuration changes
Performance Tuning
-
Broker Configuration: Optimize for workload type (throughput vs latency)
-
JVM Tuning: Heap size and garbage collector selection
-
OS Tuning: File descriptor limits, network settings
-
Monitoring: Metrics for throughput, latency, and errors
-
Capacity Planning: Regular review and scaling assessment
Security:
-
Encryption: TLS enabled for Client-Broker and Inter-broker.
-
Auth: SASL/SCRAM or mTLS enabled.
-
ACLs: Principle of least privilege (Topic read/write).