data-engineering-best-practices

Data Engineering Best Practices

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-best-practices" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-best-practices

Data Engineering Best Practices

Use this skill for production architecture and standards decisions: storage layout, lifecycle, incremental semantics, schema evolution, quality checks, and cost/performance tradeoffs.

When to use this skill

Use for:

  • Designing Bronze/Silver/Gold or equivalent data layers

  • Choosing append vs overwrite vs merge behavior

  • Partitioning and file-size strategy

  • Defining schema evolution policy

  • Setting testing/observability guardrails

  • Establishing retention + cost controls

Use domain skills for implementation details:

  • @data-engineering-core

  • @data-engineering-storage-lakehouse

  • @data-engineering-storage-formats

  • @data-engineering-storage-remote-access

  • @data-engineering-storage-authentication

Decision checklist (apply in order)

  • Data contract

  • Required columns/types?

  • Nullability and key uniqueness?

  • Layer semantics

  • Bronze immutable?

  • Silver deduplicated/validated?

  • Gold business-ready aggregates?

  • Write mode

  • Append, partition overwrite, or merge?

  • Layout

  • Partition keys + target file size set?

  • Incremental logic

  • Watermark/checkpoint strategy defined?

  • Evolution policy

  • Additive-only by default?

  • Operational controls

  • Tests + observability + retention + backfill process?

Core standards

  1. Layering (Medallion)
  • Bronze: raw immutable ingestion; append-only

  • Silver: cleaned, validated, conformed schema

  • Gold: consumption-specific marts/features/aggregates

Do not skip Silver validation for convenience; silent quality drift is costly.

  1. Write semantics

Operation Use when Notes

Append strictly new immutable events simplest, cheapest

Partition overwrite deterministic reprocessing for date/key slice safe for backfills

Merge/Upsert corrections/late updates/deletes needs key + conflict semantics

  1. Partitioning

Good partition keys:

  • Frequently filtered dimensions (often date + low/moderate-cardinality dimension)

Avoid:

  • High-cardinality keys (e.g., user_id)

  • Over-partitioning creating tiny files

  1. File sizing

Target file size: ~256MB–1GB (workload-dependent).

  • Too small → metadata/listing overhead + slow scans

  • Too large → poor parallelism and skewed processing

  1. Schema evolution

Default policy:

  • ✅ additive changes first (new nullable columns)

  • ⚠️ type widening only when compatibility is clear

  • ❌ destructive rename/drop in-place for shared production tables

  1. Incremental processing
  • Persist watermark/checkpoint externally

  • Make re-runs idempotent

  • Include late-arriving data strategy (lag window/backfill)

  1. Quality and reliability

Minimum controls:

  • Required columns + types

  • Primary key uniqueness (or dedupe policy)

  • Null thresholds on critical fields

  • Freshness/SLA checks

  • Run-level metrics (rows in/out, failures, latency)

Anti-patterns (reject in review)

  • Full table overwrite for small incremental changes

  • No checkpoint/watermark for recurring pipeline

  • Unbounded tiny-file generation

  • Dynamic SQL built from user values without parameter binding

  • Production credentials in code/config committed to repo

  • No backfill plan / no rollback strategy

Minimal production blueprint

  • Ingest raw to Bronze (append-only)

  • Validate + standardize to Silver

  • Build Gold outputs

  • Emit metrics + quality report

  • Persist checkpoint/watermark

  • Apply lifecycle rules + periodic compaction/maintenance

Progressive disclosure (read next as needed)

  • best-practices-detailed.md — comprehensive deep-dive examples

  • @data-engineering-core/patterns/incremental.md — incremental loading patterns

  • @data-engineering-storage-lakehouse — Delta/Iceberg/Hudi-specific behavior

  • @data-engineering-quality — validation framework implementation

  • @data-engineering-observability — metrics/tracing/alerting

References

  • Delta Lake Schema Evolution

  • Apache Iceberg Evolution

  • DuckDB MERGE INTO

  • PyArrow Dataset API

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review