Data Catalogs
Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.
Why Catalogs Matter
Data catalogs are centralized metadata repositories that enable:
-
Data discovery: Find datasets by name, schema, owner, tags
-
Governance: Access control, data lineage, PII tagging
-
Schema management: Track table schemas, partitions, evolution over time
-
Table format abstraction: Iceberg/Delta/Hudi tables registered in catalog can be queried by multiple engines (Spark, Trino, Flink, DuckDB) without knowing underlying storage URIs
-
Multi-engine consistency: Same table name across Spark/DuckDB/Trino
Without a catalog, you must manage table locations and schemas manually in each engine.
Iceberg Catalogs
Apache Iceberg requires a catalog to store table metadata (schema, location, snapshots, partition specs). The catalog name → table identifier → storage location mapping.
Catalog Types
Catalog Backend Managed? Best For
Hive Metastore RDBMS (Postgres/MySQL) Self-operated Existing Hadoop installations, high partition counts
AWS Glue AWS-managed (serverless) AWS-managed AWS-native stacks, Athena/EMR
Tabular SaaS (Nessie-backed) Tabular-managed Iceberg-native, Git-like branching
Custom REST Any HTTP service Self-built Custom auth, multi-cloud
Hive Metastore
The traditional, battle-tested catalog used by Hadoop/Hive for over a decade. Stores table metadata in a relational database.
Setup:
Install Hive Metastore (Docker quickstart)
docker run -p 9083:9083 -e METASTORE_DB_TYPE=postgresql
-e METASTORE_DB_URL=jdbc:postgresql://localhost/metastore
apache/hive:4.0
Initialize metastore schema (run once)
schematool -dbType postgresql -initSchema
PyIceberg configuration:
from pyiceberg.catalog import Catalog from pyiceberg.catalog.hive import HiveCatalog
catalog = HiveCatalog( name="my_hive", uri="thrift://localhost:9083", warehouse="s3://my-bucket/warehouse/", properties={ "postgresql.connection": "jdbc:postgresql://localhost/metastore", "postgresql.user": "hive", "postgresql.password": "hive" } )
Create table
catalog.create_table( identifier=("my_db", "my_table"), schema=table_schema, location="s3://my-bucket/my_table/" )
Pros:
-
Mature, battle-tested (10+ years)
-
Handles 500k+ partitions with proper DB tuning
-
Works with any Hadoop ecosystem (Spark, Presto, Hive, Flink)
-
No vendor lock-in
Cons:
-
Self-managed (DB ops, backups, HA)
-
Limited Iceberg-specific features (no branching/namespace)
-
Thrift protocol (legacy)
AWS Glue Data Catalog
Fully managed catalog by AWS, serverless, Hive-compatible API.
Setup:
from pyiceberg.catalog.glue import GlueCatalog
catalog = GlueCatalog( name="my_glue", region="us-east-1", warehouse="s3://my-bucket/warehouse/" )
Tables automatically appear in AWS Glue console
Unity Catalog Federation (Databricks):
Databricks can read AWS Glue tables via federation
catalog = GlueCatalog( name="aws_glue_fed", region="us-east-1" )
Spark on Databricks queries glue catalog
Pros:
-
Serverless, no operations
-
Integrated with Athena, EMR, Redshift Spectrum
-
Fine-grained IAM permissions via Lake Formation
-
Works with Delta Lake, Iceberg, Hudi
Cons:
-
Performance degrades >10k partitions (known issue)
-
Limited metadata operations (slow operations)
-
AWS lock-in
Tabular (Iceberg-Native SaaS)
Tabular is a managed catalog service built for Iceberg with Nessie-like versioning (branching, tags, time travel on metadata).
from pyiceberg.catalog.tabular import TabularCatalog
catalog = TabularCatalog( name="tabular", warehouse="s3://my-bucket/warehouse/", token="tabular-token-..." )
Git-like operations
catalog.create_table("my_table", schema) catalog.create_branch("dev") # Branch for development catalog.set_current_branch("dev")
Later merge to main
Pros:
-
Native Iceberg features (branching, tags, atomic multi-table ops)
-
No ops overhead
-
Git-like workflow for datasets
Cons:
-
Commercial SaaS (cost)
-
Newer ecosystem
Using DuckDB as a Multi-Source Catalog
DuckDB does NOT provide a production data catalog service (no REST API, limited concurrency). However, you can use it as a lightweight embedded catalog to unify queries across multiple heterogeneous sources using the ATTACH statement. This pattern is suitable for:
-
Single-user analytics notebooks
-
PoC/mVP data platforms
-
Local dev environments
-
Small teams (< 10 users)
Architecture
DuckDB Process ├── ATTACH 'postgres://...' AS postgres ├── ATTACH 's3://lakehouse/delta/' AS delta_uc (TYPE unity_catalog) ├── ATTACH 's3://lakehouse/iceberg/' AS iceberg ├── ATTACH 'ducklake:data/catalog.ducklake' AS ducklake └── CREATE VIEW unified_dataset AS SELECT 'postgres' AS source, * FROM postgres.public.orders UNION ALL SELECT 'delta' AS source, * FROM delta.default.orders UNION ALL SELECT 'iceberg' AS source, * FROM iceberg.db.orders
Attaching PostgreSQL
Use Postgres as a metadata storage backend (DuckLake) or query external Postgres tables:
-- Attach external Postgres database ATTACH 'postgres://user:pass@host:5432/mydb' AS pg_db;
-- Query Postgres tables SELECT * FROM pg_db.public.orders LIMIT 10;
DuckLake pattern: Attach a Postgres database as a DuckLake catalog:
-- Install and load DuckLake extension INSTALL ducklake; LOAD ducklake;
-- Attach Postgres as DuckLake catalog ATTACH 'postgres://user:pass@host:5432/lakehouse_catalog' AS my_lakehouse ( DATA_PATH 's3://my-bucket/lakehouse/' );
-- Create Delta/Iceberg tables managed by DuckLake CREATE TABLE my_lakehouse.silver.orders ( order_id BIGINT, amount DOUBLE, order_date DATE ) USING DELTA; -- Or USING ICEBERG
Benefits: Single SQL catalog with ACID transactions, time travel via Postgres WAL.
Attaching Delta Lake (Unity Catalog)
Use the Unity Catalog extension (experimental, works without Databricks):
INSTALL unity_catalog; LOAD unity_catalog;
-- Create a Unity Catalog connection CREATE SECRET uc_cred ( TYPE 'aws', REGION 'us-east-1', ACCESS_KEY_ID 'AKIA...', SECRET_ACCESS_KEY '...' );
ATTACH 'my_uc' AS uc ( TYPE unity_catalog, ENDPOINT 'https://api.uc.tabular.io', SECRET 'uc_cred' );
-- Query Delta tables SELECT * FROM uc.my_db.my_delta_table;
Note: This is experimental and may require Tabular's hosted Unity Catalog. For local Delta tables, use delta_scan() directly:
INSTALL delta; LOAD delta;
SELECT * FROM delta_scan('s3://bucket/delta_table/');
Attaching Iceberg
Use Iceberg extension or read via REST catalog:
from duckdb import DuckDBPyConnection
con = duckdb.connect() con.execute("INSTALL iceberg; LOAD iceberg;")
Attach Iceberg catalog
con.execute(""" ATTACH 'iceberg_catalog' ( TYPE iceberg, CATALOG 'hive', URI 'thrift://localhost:9083', WAREHOUSE 's3://bucket/warehouse/' ); """)
Query
df = con.execute("SELECT * FROM iceberg.db.my_table").df()
Unified Multi-Source View
Create a virtual view that unions data from all sources:
CREATE VIEW unified_orders AS SELECT 'postgres' AS source_system, o.order_id, o.amount, o.order_date, NULL AS metadata FROM pg_db.public.orders o UNION ALL SELECT 'delta' AS source_system, o.order_id, o.amount, o.order_date, o._metadata FROM uc.production.orders o UNION ALL SELECT 'iceberg' AS source_system, o.order_id, o.amount, o.order_date, o._metadata FROM iceberg.analytics.orders o;
Now query the unified view:
SELECT source_system, COUNT(*) FROM unified_orders GROUP BY source_system;
Use case: Cross-platform migration validation (ensure counts match between Postgres source and Delta target).
Limitations of DuckDB-as-Catalog
Limitation Impact Workaround
No REST API Only in-process access; no multi-service sharing Run query gateway (FastAPI) that wraps DuckDB (but single point of failure)
Write lock Only one writer at a time (file-based DB) Use Postgres as DuckLake backend for multi-client
No fine-grained auth All queries share same DB credentials Use separate DuckDB files per user (not shared)
Scalability Metadata fits in memory; works for ≤ 100k tables Large enterprises need dedicated catalog service (Glue, Hive, Tabular)
Availability Single file corruption = total loss Back up catalog DB frequently
Conclusion: Use DuckDB for single-user/developer catalog or small team PoC. For production multi-user data platforms, use AWS Glue (AWS), Hive Metastore (self-hosted), or Tabular (Iceberg-native).
Open Source Catalogs Comparison (2024)
Amundsen (Lyft)
Focus: Data discovery, lightweight
-
Metadata store: Neo4j (graph) + MySQL (text search)
-
Lineage: Limited (no built-in column-level)
-
Search: Simple keyword search, easy to use
-
Governance: Minimal
-
Deployment: Easiest (fewer services)
-
Best for: Small teams needing basic discovery
-
Status: Slower development post-2023 acquisition
DataHub (LinkedIn)
Focus: Scalable, enterprise-grade
-
Metadata store: MySQL (RDBMS) + Kafka (streaming updates)
-
Lineage: Full end-to-end (auto-ingested from Spark, Flink, etc.)
-
Search: Advanced faceted search, elastic
-
Governance: Strong (PII tagging, access policies)
-
Deployment: Complex (requires multiple services)
-
Best for: Large enterprises ( LinkedIn-scale)
-
Extensibility: High (many metadata source connectors)
OpenMetadata
Focus: Unified governance & quality
-
Metadata store: Postgres + Elasticsearch/OpenSearch
-
Lineage: Built-in, column-level
-
Search: Good
-
Governance: Excellent (workflows, tasks, approvals, data quality tests integration)
-
Deployment: Moderate (needs Postgres, OpenSearch, Airflow for ingestion)
-
Best for: Teams needing strong governance + quality testing
-
Modern UI: Cleanest interface
Comparison Summary
Tool Discovery Lineage Governance Scale Ops Complexity
Amundsen ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐ Small Low
DataHub ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ Enterprise High
OpenMetadata ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Medium-Large Moderate
Recommendation:
-
Start with OpenMetadata for new projects (governance + quality focus)
-
Use DataHub if you need massive scale and lineage automation
-
Avoid Amundsen for new deployments (stale)
Catalog Selection Matrix for Lakehouse
Scenario Recommended Catalog
AWS-native (Athena, Redshift) AWS Glue
Azure-native (Synapse, Fabric) Azure Data Catalog / Unity Catalog
Self-hosted Hadoop/Spark Hive Metastore
Iceberg-first, multi-cloud Tabular or Hive Metastore
Small team, PoC DuckDB + DuckLake (single file)
Governance-heavy OpenMetadata (with separate metastore for Iceberg/Delta)
LinkedIn-scale DataHub
Setting Up a Multi-Source Catalog (DuckDB Pattern)
Use this concise flow in the main skill:
-
Create a local DuckDB catalog DB and load extensions (httpfs , postgres , delta , iceberg , ducklake )
-
Attach each source system (Postgres, Delta, Iceberg)
-
Build a unified view over all sources
-
Run validation queries across sources
For complete runnable examples, see:
- duckdb-multisource.md
Best Practices
Catalog Selection
-
Default to Hive Metastore if you have existing Hadoop investments.
-
Use AWS Glue for pure AWS stacks (Athena, EMR).
-
Choose Tabular for Iceberg-first with branching needs.
-
Keep separate Iceberg catalog from business metadata catalog (OpenMetadata) - they serve different purposes.
DuckDB Multi-Source Pattern
-
Use for development only - Not production multi-user
-
Store catalog DuckDB file in version control (encrypted if credentials)
-
Separate credentials from catalog - use environment variables or secret manager
-
Read-only attaches for source systems - prevent accidental writes
-
Back up DuckDB regularly if using as primary catalog
References
-
@data-engineering-storage-lakehouse
-
Delta Lake, Iceberg table formats
-
@data-engineering-best-practices
-
Medallion architecture, dataset lifecycle, partitioning, schema evolution
-
duckdb-multisource.md
-
Step-by-step DuckDB multi-source catalog setup
-
Apache Iceberg Catalog Documentation
-
PyIceberg Catalog API
-
DuckDB ATTACH Documentation
-
DuckLake Documentation
-
Unity Catalog Extension
-
OpenMetadata vs DataHub vs Amundsen