data-engineering-catalogs

Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-catalogs" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-catalogs

Data Catalogs

Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.

Why Catalogs Matter

Data catalogs are centralized metadata repositories that enable:

  • Data discovery: Find datasets by name, schema, owner, tags

  • Governance: Access control, data lineage, PII tagging

  • Schema management: Track table schemas, partitions, evolution over time

  • Table format abstraction: Iceberg/Delta/Hudi tables registered in catalog can be queried by multiple engines (Spark, Trino, Flink, DuckDB) without knowing underlying storage URIs

  • Multi-engine consistency: Same table name across Spark/DuckDB/Trino

Without a catalog, you must manage table locations and schemas manually in each engine.

Iceberg Catalogs

Apache Iceberg requires a catalog to store table metadata (schema, location, snapshots, partition specs). The catalog name → table identifier → storage location mapping.

Catalog Types

Catalog Backend Managed? Best For

Hive Metastore RDBMS (Postgres/MySQL) Self-operated Existing Hadoop installations, high partition counts

AWS Glue AWS-managed (serverless) AWS-managed AWS-native stacks, Athena/EMR

Tabular SaaS (Nessie-backed) Tabular-managed Iceberg-native, Git-like branching

Custom REST Any HTTP service Self-built Custom auth, multi-cloud

Hive Metastore

The traditional, battle-tested catalog used by Hadoop/Hive for over a decade. Stores table metadata in a relational database.

Setup:

Install Hive Metastore (Docker quickstart)

docker run -p 9083:9083 -e METASTORE_DB_TYPE=postgresql
-e METASTORE_DB_URL=jdbc:postgresql://localhost/metastore
apache/hive:4.0

Initialize metastore schema (run once)

schematool -dbType postgresql -initSchema

PyIceberg configuration:

from pyiceberg.catalog import Catalog from pyiceberg.catalog.hive import HiveCatalog

catalog = HiveCatalog( name="my_hive", uri="thrift://localhost:9083", warehouse="s3://my-bucket/warehouse/", properties={ "postgresql.connection": "jdbc:postgresql://localhost/metastore", "postgresql.user": "hive", "postgresql.password": "hive" } )

Create table

catalog.create_table( identifier=("my_db", "my_table"), schema=table_schema, location="s3://my-bucket/my_table/" )

Pros:

  • Mature, battle-tested (10+ years)

  • Handles 500k+ partitions with proper DB tuning

  • Works with any Hadoop ecosystem (Spark, Presto, Hive, Flink)

  • No vendor lock-in

Cons:

  • Self-managed (DB ops, backups, HA)

  • Limited Iceberg-specific features (no branching/namespace)

  • Thrift protocol (legacy)

AWS Glue Data Catalog

Fully managed catalog by AWS, serverless, Hive-compatible API.

Setup:

from pyiceberg.catalog.glue import GlueCatalog

catalog = GlueCatalog( name="my_glue", region="us-east-1", warehouse="s3://my-bucket/warehouse/" )

Tables automatically appear in AWS Glue console

Unity Catalog Federation (Databricks):

Databricks can read AWS Glue tables via federation

catalog = GlueCatalog( name="aws_glue_fed", region="us-east-1" )

Spark on Databricks queries glue catalog

Pros:

  • Serverless, no operations

  • Integrated with Athena, EMR, Redshift Spectrum

  • Fine-grained IAM permissions via Lake Formation

  • Works with Delta Lake, Iceberg, Hudi

Cons:

  • Performance degrades >10k partitions (known issue)

  • Limited metadata operations (slow operations)

  • AWS lock-in

Tabular (Iceberg-Native SaaS)

Tabular is a managed catalog service built for Iceberg with Nessie-like versioning (branching, tags, time travel on metadata).

from pyiceberg.catalog.tabular import TabularCatalog

catalog = TabularCatalog( name="tabular", warehouse="s3://my-bucket/warehouse/", token="tabular-token-..." )

Git-like operations

catalog.create_table("my_table", schema) catalog.create_branch("dev") # Branch for development catalog.set_current_branch("dev")

Later merge to main

Pros:

  • Native Iceberg features (branching, tags, atomic multi-table ops)

  • No ops overhead

  • Git-like workflow for datasets

Cons:

  • Commercial SaaS (cost)

  • Newer ecosystem

Using DuckDB as a Multi-Source Catalog

DuckDB does NOT provide a production data catalog service (no REST API, limited concurrency). However, you can use it as a lightweight embedded catalog to unify queries across multiple heterogeneous sources using the ATTACH statement. This pattern is suitable for:

  • Single-user analytics notebooks

  • PoC/mVP data platforms

  • Local dev environments

  • Small teams (< 10 users)

Architecture

DuckDB Process ├── ATTACH 'postgres://...' AS postgres ├── ATTACH 's3://lakehouse/delta/' AS delta_uc (TYPE unity_catalog) ├── ATTACH 's3://lakehouse/iceberg/' AS iceberg ├── ATTACH 'ducklake:data/catalog.ducklake' AS ducklake └── CREATE VIEW unified_dataset AS SELECT 'postgres' AS source, * FROM postgres.public.orders UNION ALL SELECT 'delta' AS source, * FROM delta.default.orders UNION ALL SELECT 'iceberg' AS source, * FROM iceberg.db.orders

Attaching PostgreSQL

Use Postgres as a metadata storage backend (DuckLake) or query external Postgres tables:

-- Attach external Postgres database ATTACH 'postgres://user:pass@host:5432/mydb' AS pg_db;

-- Query Postgres tables SELECT * FROM pg_db.public.orders LIMIT 10;

DuckLake pattern: Attach a Postgres database as a DuckLake catalog:

-- Install and load DuckLake extension INSTALL ducklake; LOAD ducklake;

-- Attach Postgres as DuckLake catalog ATTACH 'postgres://user:pass@host:5432/lakehouse_catalog' AS my_lakehouse ( DATA_PATH 's3://my-bucket/lakehouse/' );

-- Create Delta/Iceberg tables managed by DuckLake CREATE TABLE my_lakehouse.silver.orders ( order_id BIGINT, amount DOUBLE, order_date DATE ) USING DELTA; -- Or USING ICEBERG

Benefits: Single SQL catalog with ACID transactions, time travel via Postgres WAL.

Attaching Delta Lake (Unity Catalog)

Use the Unity Catalog extension (experimental, works without Databricks):

INSTALL unity_catalog; LOAD unity_catalog;

-- Create a Unity Catalog connection CREATE SECRET uc_cred ( TYPE 'aws', REGION 'us-east-1', ACCESS_KEY_ID 'AKIA...', SECRET_ACCESS_KEY '...' );

ATTACH 'my_uc' AS uc ( TYPE unity_catalog, ENDPOINT 'https://api.uc.tabular.io', SECRET 'uc_cred' );

-- Query Delta tables SELECT * FROM uc.my_db.my_delta_table;

Note: This is experimental and may require Tabular's hosted Unity Catalog. For local Delta tables, use delta_scan() directly:

INSTALL delta; LOAD delta;

SELECT * FROM delta_scan('s3://bucket/delta_table/');

Attaching Iceberg

Use Iceberg extension or read via REST catalog:

from duckdb import DuckDBPyConnection

con = duckdb.connect() con.execute("INSTALL iceberg; LOAD iceberg;")

Attach Iceberg catalog

con.execute(""" ATTACH 'iceberg_catalog' ( TYPE iceberg, CATALOG 'hive', URI 'thrift://localhost:9083', WAREHOUSE 's3://bucket/warehouse/' ); """)

Query

df = con.execute("SELECT * FROM iceberg.db.my_table").df()

Unified Multi-Source View

Create a virtual view that unions data from all sources:

CREATE VIEW unified_orders AS SELECT 'postgres' AS source_system, o.order_id, o.amount, o.order_date, NULL AS metadata FROM pg_db.public.orders o UNION ALL SELECT 'delta' AS source_system, o.order_id, o.amount, o.order_date, o._metadata FROM uc.production.orders o UNION ALL SELECT 'iceberg' AS source_system, o.order_id, o.amount, o.order_date, o._metadata FROM iceberg.analytics.orders o;

Now query the unified view:

SELECT source_system, COUNT(*) FROM unified_orders GROUP BY source_system;

Use case: Cross-platform migration validation (ensure counts match between Postgres source and Delta target).

Limitations of DuckDB-as-Catalog

Limitation Impact Workaround

No REST API Only in-process access; no multi-service sharing Run query gateway (FastAPI) that wraps DuckDB (but single point of failure)

Write lock Only one writer at a time (file-based DB) Use Postgres as DuckLake backend for multi-client

No fine-grained auth All queries share same DB credentials Use separate DuckDB files per user (not shared)

Scalability Metadata fits in memory; works for ≤ 100k tables Large enterprises need dedicated catalog service (Glue, Hive, Tabular)

Availability Single file corruption = total loss Back up catalog DB frequently

Conclusion: Use DuckDB for single-user/developer catalog or small team PoC. For production multi-user data platforms, use AWS Glue (AWS), Hive Metastore (self-hosted), or Tabular (Iceberg-native).

Open Source Catalogs Comparison (2024)

Amundsen (Lyft)

Focus: Data discovery, lightweight

  • Metadata store: Neo4j (graph) + MySQL (text search)

  • Lineage: Limited (no built-in column-level)

  • Search: Simple keyword search, easy to use

  • Governance: Minimal

  • Deployment: Easiest (fewer services)

  • Best for: Small teams needing basic discovery

  • Status: Slower development post-2023 acquisition

DataHub (LinkedIn)

Focus: Scalable, enterprise-grade

  • Metadata store: MySQL (RDBMS) + Kafka (streaming updates)

  • Lineage: Full end-to-end (auto-ingested from Spark, Flink, etc.)

  • Search: Advanced faceted search, elastic

  • Governance: Strong (PII tagging, access policies)

  • Deployment: Complex (requires multiple services)

  • Best for: Large enterprises ( LinkedIn-scale)

  • Extensibility: High (many metadata source connectors)

OpenMetadata

Focus: Unified governance & quality

  • Metadata store: Postgres + Elasticsearch/OpenSearch

  • Lineage: Built-in, column-level

  • Search: Good

  • Governance: Excellent (workflows, tasks, approvals, data quality tests integration)

  • Deployment: Moderate (needs Postgres, OpenSearch, Airflow for ingestion)

  • Best for: Teams needing strong governance + quality testing

  • Modern UI: Cleanest interface

Comparison Summary

Tool Discovery Lineage Governance Scale Ops Complexity

Amundsen ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐ Small Low

DataHub ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ Enterprise High

OpenMetadata ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Medium-Large Moderate

Recommendation:

  • Start with OpenMetadata for new projects (governance + quality focus)

  • Use DataHub if you need massive scale and lineage automation

  • Avoid Amundsen for new deployments (stale)

Catalog Selection Matrix for Lakehouse

Scenario Recommended Catalog

AWS-native (Athena, Redshift) AWS Glue

Azure-native (Synapse, Fabric) Azure Data Catalog / Unity Catalog

Self-hosted Hadoop/Spark Hive Metastore

Iceberg-first, multi-cloud Tabular or Hive Metastore

Small team, PoC DuckDB + DuckLake (single file)

Governance-heavy OpenMetadata (with separate metastore for Iceberg/Delta)

LinkedIn-scale DataHub

Setting Up a Multi-Source Catalog (DuckDB Pattern)

Use this concise flow in the main skill:

  • Create a local DuckDB catalog DB and load extensions (httpfs , postgres , delta , iceberg , ducklake )

  • Attach each source system (Postgres, Delta, Iceberg)

  • Build a unified view over all sources

  • Run validation queries across sources

For complete runnable examples, see:

  • duckdb-multisource.md

Best Practices

Catalog Selection

  • Default to Hive Metastore if you have existing Hadoop investments.

  • Use AWS Glue for pure AWS stacks (Athena, EMR).

  • Choose Tabular for Iceberg-first with branching needs.

  • Keep separate Iceberg catalog from business metadata catalog (OpenMetadata) - they serve different purposes.

DuckDB Multi-Source Pattern

  • Use for development only - Not production multi-user

  • Store catalog DuckDB file in version control (encrypted if credentials)

  • Separate credentials from catalog - use environment variables or secret manager

  • Read-only attaches for source systems - prevent accidental writes

  • Back up DuckDB regularly if using as primary catalog

References

  • @data-engineering-storage-lakehouse

  • Delta Lake, Iceberg table formats

  • @data-engineering-best-practices

  • Medallion architecture, dataset lifecycle, partitioning, schema evolution

  • duckdb-multisource.md

  • Step-by-step DuckDB multi-source catalog setup

  • Apache Iceberg Catalog Documentation

  • PyIceberg Catalog API

  • DuckDB ATTACH Documentation

  • DuckLake Documentation

  • Unity Catalog Extension

  • OpenMetadata vs DataHub vs Amundsen

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review