stream-processing

Stream Processing

Patterns and technologies for real-time data processing, event streaming, and stream analytics.

When to Use This Skill

Designing real-time data pipelines
Choosing stream processing frameworks
Implementing event-driven architectures
Building real-time analytics
Understanding streaming vs batch trade-offs

Batch vs Streaming

Comparison

Aspect Batch Streaming

Latency Minutes to hours Milliseconds to seconds

Data Bounded (finite) Unbounded (infinite)

Processing Process all at once Process as it arrives

State Recompute each run Maintain continuously

Complexity Lower Higher

Cost Often lower Often higher

When to Use Streaming

Use streaming when:

Real-time responses required (<1 minute)
Events need immediate action (fraud, alerts)
Data arrives continuously
Users expect live updates
Time-sensitive business decisions

Use batch when:

Daily/hourly reports sufficient
Complex transformations needed
Cost optimization priority
Historical analysis
One-time processing

Stream Processing Concepts

Event Time vs Processing Time

Event Time: When event actually occurred Processing Time: When event is processed

Example: ┌─────────────────────────────────────────────────────────┐ │ Event: Purchase at 10:00:00 (event time) │ │ Network delay: 5 seconds │ │ Processing: 10:00:05 (processing time) │ └─────────────────────────────────────────────────────────┘

Why it matters:

Late events need handling
Ordering not guaranteed
Watermarks track progress

Watermarks

Watermark = "All events before this time have arrived"

Event stream: ──[10:01]──[10:02]──[10:00]──[10:03]──[Watermark: 10:00]──

Allows system to:

Know when window is complete
Handle late events
Balance latency vs completeness

Windows

Tumbling Window (fixed, non-overlapping): |─────|─────|─────| 0 5 10 15 (seconds)

Sliding Window (fixed, overlapping): |─────| |─────| |─────| Size: 5s, Slide: 2s

Session Window (activity-based): |──────| |───────────| |───| User activity with gaps defines windows

Count Window: Process every N events

State Management

Stateful operations require maintained state:

Aggregations (sum, count, avg)
Joins between streams
Pattern detection
Deduplication

State backends:

In-memory (fast, limited)
RocksDB (larger, persistent)
External (Redis, database)

Stream Processing Frameworks

Apache Kafka Streams

Characteristics:

Library (not a cluster)
Exactly-once semantics
Kafka-native
Java/Scala

Best for:

Kafka-centric architectures
Simpler transformations
Microservices

Example topology: source → filter → map → aggregate → sink

Apache Flink

Characteristics:

Distributed cluster
True streaming (not micro-batch)
Advanced state management
SQL support

Best for:

Complex event processing
Large-scale streaming
Low-latency requirements

Example: DataStream<Event> events = env.addSource(kafkaSource); events .keyBy(e -> e.getUserId()) .window(TumblingEventTimeWindows.of(Time.minutes(5))) .aggregate(new CountAggregator()) .addSink(sink);

Apache Spark Streaming

Characteristics:

Micro-batch processing
Unified batch + streaming API
Wide ecosystem
Python, Scala, Java, R

Best for:

Teams with Spark experience
Batch + streaming unified
Machine learning integration

Latency: Seconds (micro-batch)

Kafka Streams vs Flink vs Spark

Factor Kafka Streams Flink Spark Streaming

Deployment Library Cluster Cluster

Latency Low Lowest Medium

State Good Excellent Good

Exactly-once Yes Yes Yes

Complexity Low High Medium

Scaling With Kafka Independent Independent

SQL Limited Yes Yes

ML integration Limited Limited Excellent

Stream Processing Patterns

Filtering

Input: All events Output: Events matching criteria

Example: Only process events where amount > 1000

Mapping/Transformation

Input: Event type A Output: Event type B

Example: Enrich order events with customer data

Aggregation

Input: Multiple events Output: Single aggregated result

Examples:

Count events per window
Sum amounts per user
Average latency per endpoint

Join Patterns

Stream-Stream Join: ┌─────────────┐ ┌─────────────┐ │ Orders │ ──► │ Join │ └─────────────┘ │ (by order_id│ ┌─────────────┐ │ in window) │ │ Shipments │ ──► │ │ └─────────────┘ └─────────────┘

Stream-Table Join (Enrichment): ┌─────────────┐ ┌─────────────┐ │ Events │ ──► │ Join │ └─────────────┘ │ (lookup by │ ┌─────────────┐ │ customer) │ │ Customer │ ──► │ │ │ Table │ └─────────────┘ └─────────────┘

Deduplication

Problem: Duplicate events from at-least-once delivery

Solution:

Track seen IDs in state (with TTL)
If seen, drop
If new, process and store ID

State: {event_id: timestamp} TTL: Based on expected duplicate window

Event Delivery Guarantees

At-Most-Once

May lose events, never duplicates Process → Commit → (if fail, event lost)

Use when: Loss acceptable, simplicity preferred

At-Least-Once

Never loses, may have duplicates Commit → Process → (if fail, reprocess)

Use when: No loss acceptable, handle duplicates downstream

Exactly-Once

Never loses, never duplicates Requires:

Idempotent operations, OR
Transactional processing

How it works:

Read from source transactionally
Process and update state
Write output and commit together

Flink: Checkpointing + two-phase commit Kafka Streams: Transactional producer + EOS

Late Event Handling

Strategies

Drop late events Simple, may lose data
Allow late events (allowed lateness) Process if within lateness threshold
Side output late events Main stream processes on-time Side stream handles late separately
Reprocess historical Batch job fixes late data impact

Watermark Strategies

Bounded Out-of-Orderness: watermark = max_event_time - max_lateness

Example: max_event_time = 10:00:00 max_lateness = 5 seconds watermark = 09:59:55

Events before 09:59:55 considered complete

Scalability Patterns

Partitioning

Partition by key for parallel processing:

┌─────────────────────────────────────────────────────┐ │ Kafka Topic (3 partitions) │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐│ │ │ Partition 0 │ │ Partition 1 │ │ Partition 2 ││ │ │ user_a, b │ │ user_c, d │ │ user_e, f ││ │ └─────────────┘ └─────────────┘ └─────────────────┘│ └─────────────────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │Worker 0 │ │Worker 1 │ │Worker 2 │ └─────────┘ └─────────┘ └─────────┘

Backpressure

When downstream can't keep up:

Buffer (risk: OOM)
Drop (risk: data loss)
Backpressure (slow down source)

Flink: Backpressure propagates automatically Kafka: Consumer lag indicates backpressure

Monitoring Streaming Applications

Key Metrics

Throughput:

Events per second
Bytes per second

Latency:

Processing latency
End-to-end latency

Health:

Consumer lag
Checkpoint duration
Backpressure rate
Error rate

Consumer Lag

Lag = Latest offset - Consumer offset

High lag indicates:

Processing too slow
Need more parallelism
Downstream bottleneck

Monitor: Set alerting thresholds

Best Practices

Design for exactly-once when needed
Handle late events explicitly
Use event time, not processing time
Monitor consumer lag closely
Plan for state recovery
Test with realistic data volumes
Implement backpressure handling
Keep processing idempotent when possible

Related Skills

message-queues
Messaging patterns
data-architecture
Data platform design
etl-elt-patterns
Data pipeline patterns

stream-processing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering