beam-concepts

Apache Beam Core Concepts

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "beam-concepts" with this command: npx skills add apache/beam/apache-beam-beam-concepts

Apache Beam Core Concepts

The Beam Model

Evolved from Google's MapReduce, FlumeJava, and Millwheel projects. Originally called the "Dataflow Model."

Key Abstractions

Pipeline

A Pipeline encapsulates the entire data processing task, including reading, transforming, and writing data.

// Java Pipeline p = Pipeline.create(options); p.apply(...) .apply(...) .apply(...); p.run().waitUntilFinish();

Python

with beam.Pipeline(options=options) as p: (p | 'Read' >> beam.io.ReadFromText('input.txt') | 'Transform' >> beam.Map(process) | 'Write' >> beam.io.WriteToText('output'))

PCollection

A distributed dataset that can be bounded (batch) or unbounded (streaming).

Properties

  • Immutable - Once created, cannot be modified

  • Distributed - Elements processed in parallel

  • May be bounded or unbounded

  • Timestamped - Each element has an event timestamp

  • Windowed - Elements assigned to windows

PTransform

A data processing operation that transforms PCollections.

// Java PCollection<String> output = input.apply(MyTransform.create());

Python

output = input | 'Name' >> beam.ParDo(MyDoFn())

Core Transforms

ParDo

General-purpose parallel processing.

// Java input.apply(ParDo.of(new DoFn<String, Integer>() { @ProcessElement public void processElement(@Element String element, OutputReceiver<Integer> out) { out.output(element.length()); } }));

Python

class LengthFn(beam.DoFn): def process(self, element): yield len(element)

input | beam.ParDo(LengthFn())

Or simpler:

input | beam.Map(len)

GroupByKey

Groups elements by key.

PCollection<KV<String, Integer>> input = ...; PCollection<KV<String, Iterable<Integer>>> grouped = input.apply(GroupByKey.create());

CoGroupByKey

Joins multiple PCollections by key.

Combine

Combines elements (sum, mean, etc.).

// Global combine input.apply(Combine.globally(Sum.ofIntegers()));

// Per-key combine input.apply(Combine.perKey(Sum.ofIntegers()));

Flatten

Merges multiple PCollections.

PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3); PCollection<String> merged = collections.apply(Flatten.pCollections());

Partition

Splits a PCollection into multiple PCollections.

Windowing

Types

  • Fixed Windows - Regular, non-overlapping intervals

  • Sliding Windows - Overlapping intervals

  • Session Windows - Gaps of inactivity define boundaries

  • Global Window - All elements in one window (default)

input.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));

input | beam.WindowInto(beam.window.FixedWindows(300))

Triggers

Control when results are emitted.

input.apply(Window.<T>into(FixedWindows.of(Duration.standardMinutes(5))) .triggering(AfterWatermark.pastEndOfWindow() .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1)))) .withAllowedLateness(Duration.standardHours(1)) .accumulatingFiredPanes());

Side Inputs

Additional inputs to ParDo.

PCollectionView<Map<String, String>> sideInput = lookupTable.apply(View.asMap());

mainInput.apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { Map<String, String> lookup = c.sideInput(sideInput); // Use lookup... } }).withSideInputs(sideInput));

Pipeline Options

Configure pipeline execution.

public interface MyOptions extends PipelineOptions { @Description("Input file") @Required String getInput(); void setInput(String value); }

MyOptions options = PipelineOptionsFactory.fromArgs(args).as(MyOptions.class);

Schema

Strongly-typed access to structured data.

@DefaultSchema(AutoValueSchema.class) @AutoValue public abstract class User { public abstract String getName(); public abstract int getAge(); }

PCollection<User> users = ...; PCollection<Row> rows = users.apply(Convert.toRows());

Error Handling

Dead Letter Queue Pattern

TupleTag<String> successTag = new TupleTag<>() {}; TupleTag<String> failureTag = new TupleTag<>() {};

PCollectionTuple results = input.apply(ParDo.of(new DoFn<String, String>() { @ProcessElement public void processElement(ProcessContext c) { try { c.output(process(c.element())); } catch (Exception e) { c.output(failureTag, c.element()); } } }).withOutputTags(successTag, TupleTagList.of(failureTag)));

results.get(successTag).apply(WriteToSuccess()); results.get(failureTag).apply(WriteToDeadLetter());

Cross-Language Pipelines

Use transforms from other SDKs.

Use Java Kafka connector from Python

from apache_beam.io.kafka import ReadFromKafka

result = pipeline | ReadFromKafka( consumer_config={'bootstrap.servers': 'localhost:9092'}, topics=['my-topic'] )

Best Practices

  • Prefer built-in transforms over custom DoFns

  • Use schemas for type-safe operations

  • Minimize side inputs for performance

  • Handle late data explicitly

  • Test with DirectRunner before deploying

  • Use TestPipeline for unit tests

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

gradle-build

No summary provided by upstream source.

Repository SourceNeeds Review
General

license-compliance

No summary provided by upstream source.

Repository SourceNeeds Review
General

io-connectors

No summary provided by upstream source.

Repository SourceNeeds Review