Data Engineering Best Practices
Use this skill for production architecture and standards decisions: storage layout, lifecycle, incremental semantics, schema evolution, quality checks, and cost/performance tradeoffs.
When to use this skill
Use for:
-
Designing Bronze/Silver/Gold or equivalent data layers
-
Choosing append vs overwrite vs merge behavior
-
Partitioning and file-size strategy
-
Defining schema evolution policy
-
Setting testing/observability guardrails
-
Establishing retention + cost controls
Use domain skills for implementation details:
-
@data-engineering-core
-
@data-engineering-storage-lakehouse
-
@data-engineering-storage-formats
-
@data-engineering-storage-remote-access
-
@data-engineering-storage-authentication
Decision checklist (apply in order)
-
Data contract
-
Required columns/types?
-
Nullability and key uniqueness?
-
Layer semantics
-
Bronze immutable?
-
Silver deduplicated/validated?
-
Gold business-ready aggregates?
-
Write mode
-
Append, partition overwrite, or merge?
-
Layout
-
Partition keys + target file size set?
-
Incremental logic
-
Watermark/checkpoint strategy defined?
-
Evolution policy
-
Additive-only by default?
-
Operational controls
-
Tests + observability + retention + backfill process?
Core standards
- Layering (Medallion)
-
Bronze: raw immutable ingestion; append-only
-
Silver: cleaned, validated, conformed schema
-
Gold: consumption-specific marts/features/aggregates
Do not skip Silver validation for convenience; silent quality drift is costly.
- Write semantics
Operation Use when Notes
Append strictly new immutable events simplest, cheapest
Partition overwrite deterministic reprocessing for date/key slice safe for backfills
Merge/Upsert corrections/late updates/deletes needs key + conflict semantics
- Partitioning
Good partition keys:
- Frequently filtered dimensions (often date + low/moderate-cardinality dimension)
Avoid:
-
High-cardinality keys (e.g., user_id)
-
Over-partitioning creating tiny files
- File sizing
Target file size: ~256MB–1GB (workload-dependent).
-
Too small → metadata/listing overhead + slow scans
-
Too large → poor parallelism and skewed processing
- Schema evolution
Default policy:
-
✅ additive changes first (new nullable columns)
-
⚠️ type widening only when compatibility is clear
-
❌ destructive rename/drop in-place for shared production tables
- Incremental processing
-
Persist watermark/checkpoint externally
-
Make re-runs idempotent
-
Include late-arriving data strategy (lag window/backfill)
- Quality and reliability
Minimum controls:
-
Required columns + types
-
Primary key uniqueness (or dedupe policy)
-
Null thresholds on critical fields
-
Freshness/SLA checks
-
Run-level metrics (rows in/out, failures, latency)
Anti-patterns (reject in review)
-
Full table overwrite for small incremental changes
-
No checkpoint/watermark for recurring pipeline
-
Unbounded tiny-file generation
-
Dynamic SQL built from user values without parameter binding
-
Production credentials in code/config committed to repo
-
No backfill plan / no rollback strategy
Minimal production blueprint
-
Ingest raw to Bronze (append-only)
-
Validate + standardize to Silver
-
Build Gold outputs
-
Emit metrics + quality report
-
Persist checkpoint/watermark
-
Apply lifecycle rules + periodic compaction/maintenance
Progressive disclosure (read next as needed)
-
best-practices-detailed.md — comprehensive deep-dive examples
-
@data-engineering-core/patterns/incremental.md — incremental loading patterns
-
@data-engineering-storage-lakehouse — Delta/Iceberg/Hudi-specific behavior
-
@data-engineering-quality — validation framework implementation
-
@data-engineering-observability — metrics/tracing/alerting
References
-
Delta Lake Schema Evolution
-
Apache Iceberg Evolution
-
DuckDB MERGE INTO
-
PyArrow Dataset API