Database Management
Decision guidance for PostgreSQL, DuckDB, Parquet, and Neo4j in hybrid storage architectures.
Contents
-
When to use which database
-
PostgreSQL quick reference
-
DuckDB quick reference
-
Parquet quick reference
-
PGVector quick reference
-
Neo4j quick reference
-
Cross-database conventions
-
Performance debugging checklist
When to use which database
Workload Use Why
Transactional (CRUD, users, sessions) PostgreSQL ACID, row-level locking, indexes
Analytical (aggregations, scans) DuckDB Columnar, vectorized, parallel
Data storage/interchange Parquet Compressed, columnar, portable
Metadata + relationships PostgreSQL Foreign keys, constraints
Ad-hoc exploration DuckDB Fast on Parquet, no ETL needed
Time-series with point lookups PostgreSQL + partitioning Partition pruning + indexes
Time-series analytics DuckDB on Parquet Scan performance
Vector similarity search PostgreSQL + PGVector HNSW/IVFFlat indexes, hybrid search
RAG / semantic search PostgreSQL + PGVector Embeddings + metadata in same DB
Graph traversals / relationships Neo4j Native graph, index-free adjacency
Pattern matching / fraud detection Neo4j Multi-hop traversal, path finding
Knowledge graphs / ontologies Neo4j Flexible schema, relationship-first
Hybrid pattern example:
-
PostgreSQL: transactional data, relationships, users (metadata)
-
DuckDB + Parquet: analytical content, aggregations, time-series
PostgreSQL quick reference
Use for: Metadata, relationships, OLTP workloads, anything needing ACID.
Key decisions:
-
Partition tables >100M rows or with retention requirements
-
Index columns in WHERE/JOIN clauses, not everything
-
Tune autovacuum for high-churn tables
See references/postgres-architecture.md for maintenance patterns. See references/postgres-querying.md for advanced query techniques.
DuckDB quick reference
Use for: Analytics, aggregations, Parquet queries, data exploration.
Key decisions:
-
Prefer Parquet files over CSV (10-100x faster)
-
Let DuckDB auto-parallelize; don't micro-optimize
-
For remote data, increase threads beyond CPU count
See references/duckdb-architecture.md for storage and parallelism. See references/duckdb-querying.md for DuckDB-specific SQL features.
Parquet quick reference
Use for: Storing analytical data, data interchange, columnar compression.
Key decisions:
-
Target 128MB-1GB file sizes
-
Partition by low-to-moderate cardinality columns (date, region)
-
Sort by columns used in filters for better pruning
See references/parquet-architecture.md for file design. See references/parquet-querying.md for query optimization.
PGVector quick reference
Use for: Similarity search, RAG applications, semantic search, recommendations.
Key decisions:
-
HNSW for low-latency, high-recall (default choice)
-
IVFFlat for memory-constrained or batch-updated data
-
Use iterative scan for filtered queries
-
Consider hybrid search (vector + keyword) for 8-15% accuracy boost
See references/pgvector-architecture.md for index configuration. See references/pgvector-querying.md for hybrid search and filtering.
Neo4j quick reference
Use for: Graph traversals, relationship-heavy queries, pattern matching, knowledge graphs.
Key decisions:
-
Model around your queries, not your source data
-
Promote properties to nodes when you need to traverse through shared values
-
Use specific relationship types to avoid supernode bottlenecks
-
Bound all variable-length paths ([1..5] , never [] )
-
Use parameters in Cypher for execution plan caching
See references/neo4j-architecture.md for data modeling, indexing, and maintenance. See references/neo4j-querying.md for Cypher optimization and anti-patterns.
Cross-database conventions
Naming
Convention Example Applies to
snake_case tables dataset_jobs
All
snake_case columns created_at
PG, DuckDB, Parquet
camelCase properties createdAt
Neo4j
PascalCase labels :UserAccount
Neo4j
Singular table names dataset not datasets
PostgreSQL
Plural for collections datasets/ directory Parquet files
Normalization decisions
Pattern When to normalize When to denormalize
Lookup tables PostgreSQL, changes frequently DuckDB/Parquet, static data
Repeated values PostgreSQL, storage matters Parquet, compression handles it
Joins at query time PostgreSQL, complex relationships Parquet, pre-join for analytics
Timestamps
-
Store as UTC always
-
PostgreSQL: TIMESTAMPTZ
-
Parquet: TIMESTAMP with isAdjustedToUTC=true
-
DuckDB: reads both correctly
Performance debugging checklist
PostgreSQL slow query
-
Run EXPLAIN (ANALYZE, BUFFERS) on the query
-
Check for sequential scans on large tables
-
Verify indexes exist on filter/join columns
-
Check pg_stat_user_tables for bloat (dead tuples)
-
Review work_mem if seeing disk sorts
DuckDB slow query
-
Check if reading CSV instead of Parquet
-
Verify not doing SELECT * on remote data
-
Check thread count matches workload
-
Look for unnecessary type conversions
Parquet slow reads
-
Verify predicate pushdown is working (check query plan)
-
Check file sizes (too small = overhead, too large = no parallelism)
-
Confirm data is sorted by filter columns
-
Look for high-cardinality partition keys (too many small files)
PGVector slow search
-
Verify index exists and is being used (EXPLAIN)
-
Check ef_search (HNSW) or probes (IVFFlat) settings
-
Enable iterative scan for filtered queries
-
Check if IVFFlat recall degraded (rebuild index if heavily updated)
-
Consider partial indexes for common filters
Neo4j slow query
-
Run PROFILE on the query, read operators bottom-up
-
Look for AllNodesScan or NodeByLabelScan (missing index)
-
Check for CartesianProduct (disconnected MATCH patterns)
-
Verify parameters are used instead of literals (plan caching)
-
Check for unbounded variable-length paths
-
Monitor page_cache.hit_ratio (below 98% = need more page cache memory)