database-design

Database Design

Good database design determines the long-term maintainability, performance, and correctness of any data-driven application. Schema decisions made early are expensive to reverse later. Every table, column, index, and constraint should exist for a reason backed by access patterns and business rules.

Data Modeling Principles

Start from Access Patterns

Design tables around how the application reads and writes data, not around how entities look in a domain model. Two questions drive every schema decision:

What queries will run most frequently? -- these determine table structure, indexes, and denormalization choices
What consistency guarantees does the data need? -- these determine normalization level, constraints, and transaction boundaries

Entity Relationships

Relationship Implementation When to Use

One-to-one Foreign key with UNIQUE constraint on the child table Splitting rarely-accessed columns into a separate table, or enforcing exactly-one semantics

One-to-many Foreign key on the child table referencing the parent Orders to order items, users to addresses

Many-to-many Join table with composite primary key Tags to articles, students to courses

Many-to-many with attributes Join table with its own columns beyond the two foreign keys Enrollment with grade, membership with role

Self-referential Foreign key referencing the same table Org charts, category trees, threaded comments

Naming Conventions

Consistent naming prevents confusion across teams and tools:

Tables: plural nouns in snake_case (order_items , user_addresses )
Columns: singular snake_case describing the value (created_at , total_amount )
Foreign keys: <referenced_table_singular>_id (user_id , order_id )
Indexes: idx_<table>_<columns> (idx_orders_user_id_created_at )
Constraints: chk_<table><rule> , uq<table><columns> , fk<table>_<referenced>

Normalization vs Denormalization

Normal Forms

Form Rule Violation Example

1NF Every column holds atomic values; no repeating groups Storing comma-separated tags in a single column

2NF Every non-key column depends on the entire primary key In a composite-key table, a column depending on only part of the key

3NF No non-key column depends on another non-key column Storing both city and zip_code when zip determines city

BCNF Every determinant is a candidate key A scheduling table where room determines building but room is not a key

When to Denormalize

Normalization prevents anomalies but adds JOINs. Denormalize selectively when:

Read-heavy workloads dominate and JOIN cost is measurable in profiling
Reporting tables need pre-aggregated data that would otherwise require expensive queries
Caching a computed value avoids recalculating on every read (e.g., order_total stored on the order row)
Document-oriented access retrieves an entire aggregate in one read

Rules for safe denormalization:

Always keep the normalized source of truth -- denormalized data is a derived cache
Define how and when the denormalized copy is updated (trigger, application event, batch job)
Monitor for drift between the source and the copy
Document why the denormalization exists and what access pattern it serves

Choosing a Database Type

Type Strengths Fits When

Relational (PostgreSQL, MySQL) ACID transactions, complex queries, mature tooling, JOINs Structured data with relationships, transactional workloads, most CRUD applications

Document (MongoDB, DynamoDB) Flexible schema, nested data, horizontal scaling Aggregates accessed as a unit, rapidly evolving schemas, per-tenant isolation

Key-value (Redis, Memcached) Sub-millisecond reads, simple data model Session storage, caching, counters, rate limiting

Column-family (Cassandra, ScyllaDB) High write throughput, wide rows, linear scaling Time-series, IoT telemetry, append-heavy workloads

Graph (Neo4j, Neptune) Traversal queries, relationship-centric data Social networks, recommendation engines, fraud detection

Time-series (TimescaleDB, InfluxDB) Optimized for time-stamped data, automatic partitioning Metrics, monitoring, financial tick data

Polyglot persistence -- using different databases for different parts of the same system -- is valid when access patterns genuinely differ. It is not valid as a way to avoid learning one database well.

Indexing Fundamentals

Indexes accelerate reads at the cost of slower writes and additional storage. Every index must justify its existence through query patterns.

Index Types

Type Structure Best For

B-tree Balanced tree, sorted data Equality and range queries, ORDER BY, most general-purpose indexing

Hash Hash table Exact equality lookups only; no range support

GiST Generalized search tree Spatial data, geometric queries, range types, nearest-neighbor

GIN Generalized inverted index Full-text search, JSONB containment, array membership

BRIN Block range index Large tables with naturally ordered data (timestamps, sequential IDs)

Composite Index Design

The order of columns in a composite index matters. The leftmost prefix rule means a composite index on (a, b, c) supports queries filtering on (a) , (a, b) , or (a, b, c) , but not (b, c) alone.

Column ordering guidelines:

Equality conditions first -- columns compared with =
Range conditions last -- columns compared with > , < , BETWEEN
Most selective column first among equals

Covering and Partial Indexes

Covering index: includes all columns the query needs, so the database reads only the index. Use INCLUDE (PostgreSQL) or just add columns to the index key.
Partial index: indexes only rows matching a condition, reducing size and write overhead. Ideal for querying a small subset of a large table (e.g., WHERE status = 'pending' ).

See Indexing Strategies Reference for detailed index types, EXPLAIN analysis, and anti-patterns.

Schema Evolution

Schema changes are inevitable. The question is whether they break running applications.

Backward-Compatible Changes (Safe)

Adding a new nullable column
Adding a new table
Adding a new index (may lock briefly on some engines)
Widening a column type (e.g., VARCHAR(50) to VARCHAR(100) )

Breaking Changes (Require Migration Strategy)

Renaming or removing a column
Changing a column type in incompatible ways
Adding a NOT NULL constraint to an existing column with null data
Splitting or merging tables

The Expand-Contract Pattern

For breaking changes in production with zero downtime:

Expand -- add the new structure alongside the old one
Migrate -- backfill data from old to new, dual-write during transition
Switch -- update application code to use the new structure
Contract -- remove the old structure once nothing references it

See Migration Patterns Reference for zero-downtime strategies, rollback techniques, and multi-tool examples.

Partitioning and Sharding

Table Partitioning (Single Database)

Strategy How It Works Use Case

Range Rows split by value ranges (e.g., by month) Time-series data, log tables, archival

List Rows split by discrete values (e.g., by region) Multi-tenant data, geographic segmentation

Hash Rows distributed by hash of a column Even distribution when no natural range exists

Sharding (Multiple Databases)

Sharding distributes data across separate database instances. Use it only after single-instance optimizations (indexing, caching, read replicas) are exhausted.

Shard key selection criteria:

High cardinality -- many distinct values to distribute evenly
Present in most queries -- avoids scatter-gather across all shards
Stable -- values that do not change after creation
Avoid hotspots -- do not shard by a value that concentrates writes (e.g., current date)

Quick Reference: Common Design Mistakes

Mistake Consequence Fix

No foreign key constraints Orphaned rows, inconsistent data Always define foreign keys unless there is a documented reason not to

Over-indexing Slow writes, wasted storage Index only columns used in WHERE, JOIN, ORDER BY of actual queries

Storing computed values without a refresh strategy Stale data, silent bugs Define update triggers, events, or batch jobs alongside any denormalization

Using ENUM types for values that change Schema migration for every new value Use a lookup table with a foreign key instead

Storing money as floating-point Rounding errors Use DECIMAL/NUMERIC or store as integer cents

Missing created_at / updated_at timestamps No auditability, difficult debugging Add timestamp columns to every table by default

Generic type

type_id polymorphism everywhere No referential integrity, complex queries Evaluate STI, CTI, or separate tables first

Reference Files

Reference Contents

Modeling Patterns Polymorphic associations (STI/CTI/TPT), soft deletes, audit trails, temporal data, self-referential trees, JSON columns

Indexing Strategies B-tree/hash/GiST/GIN details, composite index design, covering and partial indexes, EXPLAIN analysis, anti-patterns

Migration Patterns Version-based vs state-based migrations, expand-contract, data migrations, rollback strategies, multi-tool examples

Integration with Other Skills

Situation Recommended Skill

Optimizing query performance and caching Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for performance optimization patterns

Designing domain models and aggregates Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for clean architecture and DDD guidance

Building APIs that expose database-backed resources Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for API design principles

Testing database interactions Install knowledge-virtuoso from krzysztofsurdy/code-virtuoso for testing strategies

database-design

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

solid

refactoring

symfony-components