Apache Beam Core Concepts

The Beam Model

Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model."

Key Abstractions

Pipeline

A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data.

// Java Pipeline p = Pipeline.create(options); p.apply(...) .apply(...) .apply(...); p.run().waitUntilFinish();

Python

with beam.Pipeline(options=options) as p: (p | 'Read' >> beam.io.ReadFromText('input.txt') | 'Transform' >> beam.Map(process) | 'Write' >> beam.io.WriteToText('output'))

PCollection

A distributed dataset that can be bounded (batch) or unbounded (streaming).

Properties

Immutable - Once created, cannot be modified
Distributed - Elements processed in parallel
May be bounded or unbounded
Timestamped - Each element has an event timestamp
Windowed - Elements assigned to windows

PTransform

A data processing operation that transforms PCollections.

// Java PCollection<String> output = input.apply(MyTransform.create());

Python

output = input | 'Name' >> beam.ParDo(MyDoFn())

Core Transforms

ParDo

General-purpose parallel processing.

// Java input.apply(ParDo.of(new DoFn<String, Integer>() { @ProcessElement public void processElement(@Element String element, OutputReceiver<Integer> out) { out.output(element.length()); } }));

Python

class LengthFn(beam.DoFn): def process(self, element): yield len(element)

input | beam.ParDo(LengthFn())

Or simpler:

input | beam.Map(len)

GroupByKey

Groups elements by key.

PCollection<KV<String, Integer>> input = ...; PCollection<KV<String, Iterable<Integer>>> grouped = input.apply(GroupByKey.create());

CoGroupByKey

Joins multiple PCollections by key.

Combine

Combines elements (sum, mean, etc.).

// Global combine input.apply(Combine.globally(Sum.ofIntegers()));

// Per-key combine input.apply(Combine.perKey(Sum.ofIntegers()));

Flatten

Merges multiple PCollections.

PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection<String> merged = collections.apply(Flatten.pCollections());

Partition

Splits a PCollection into multiple PCollections.

Windowing

Types

Fixed Windows - Regular, non-overlapping intervals
Sliding Windows - Overlapping intervals
Session Windows - Gaps of inactivity define boundaries
Global Window - All elements in one window (default)

input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));

input | beam.WindowInto(beam.window.FixedWindows(300))

Triggers

Control when results are emitted.

input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1)))) .withAllowedLateness(Duration.standardHours(1)) .accumulatingFiredPanes());

Side Inputs

Additional inputs to ParDo.

PCollectionView<Map<String, String>> sideInput = lookupTable.apply(View.asMap());

mainInput.apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { Map<String, String> lookup = c.sideInput(sideInput); // Use lookup... } }).withSideInputs(sideInput));

Pipeline Options

Configure pipeline execution.

public interface MyOptions extends PipelineOptions { @Description("Input file") @Required String getInput(); void setInput(String value); }

MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class);

Schema

Strongly-typed access to structured data.

@DefaultSchema(AutoValueSchema.class) @AutoValue public abstract class User { public abstract String getName(); public abstract int getAge(); }

PCollection<User> users = ...; PCollection<Row> rows = users.apply(Convert.toRows());

Error Handling

Dead Letter Queue Pattern

TupleTag<String> successTag = new TupleTag<>() {}; TupleTag<String> failureTag = new TupleTag<>() {};

PCollectionTuple results = input.apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { try { c.output(process(c.element())); } catch (Exception e) { c.output(failureTag, c.element()); } } }).withOutputTags(successTag, TupleTagList.of(failureTag)));

results.get(successTag).apply(WriteToSuccess()); results.get(failureTag).apply(WriteToDeadLetter());

Cross-Language Pipelines

Use transforms from other SDKs.

Use Java Kafka connector from Python

from apache_beam.io.kafka import ReadFromKafka

result = pipeline | ReadFromKafka( consumer_config={'bootstrap.servers': 'localhost:9092'}, topics=['my-topic'] )

Best Practices

Prefer built-in transforms over custom DoFns
Use schemas for type-safe operations
Minimize side inputs for performance
Handle late data explicitly
Test with DirectRunner before deploying
Use TestPipeline for unit tests

beam-concepts

Safety Notice

Copy this and send it to your AI assistant to learn

Python

Python

Python

Or simpler:

Use Java Kafka connector from Python

Source Transparency

Related Skills

gradle-build

license-compliance

io-connectors