Scalable Data Schema Design
Design database schemas that grow with your application—from prototype to millions of users.
When to Use
Use this skill when:
-
Starting a new project and designing initial data models
-
Existing schema is causing performance issues
-
Planning to scale from thousands to millions of records
-
Migrating between database systems
-
Data model needs to evolve without breaking changes
Core Principles
- Start Simple, Plan for Complex
Prototype Phase (< 10K records):
-
Optimize for development speed
-
Denormalize for convenience
-
Use simple indexes
-
Single database instance
Growth Phase (10K - 1M records):
-
Normalize to reduce redundancy
-
Add strategic indexes
-
Introduce read replicas
-
Monitor query performance
Scale Phase (> 1M records):
-
Partition large tables
-
Shard by logical boundaries
-
Add caching layers
-
Consider polyglot persistence
- Query Patterns Drive Schema
Anti-pattern: Design schema, then write queries Best practice: Identify queries, then design schema
Workflow:
-
List top 10 most frequent queries
-
List top 5 most expensive queries
-
Design indexes to support both
-
Benchmark with realistic data volumes
-
Plan for Evolution
Schema versioning strategy:
-
Additive changes only (new columns, new tables)
-
Deprecate old fields instead of deleting
-
Use database migration tools (Flyway, Liquibase, Alembic)
-
Version your schema in git
Design Patterns
Pattern 1: Event Sourcing for Audit Trails
Use when: You need complete history of changes
-- Anti-pattern: Update in place UPDATE users SET email = 'new@example.com' WHERE id = 123; -- Lost: Previous email, who changed it, when
-- Better: Event log CREATE TABLE user_events ( id BIGSERIAL PRIMARY KEY, user_id BIGINT NOT NULL, event_type VARCHAR(50) NOT NULL, -- 'email_changed', 'profile_updated' event_data JSONB NOT NULL, -- {old: 'old@...', new: 'new@...'} created_at TIMESTAMPTZ DEFAULT NOW(), created_by BIGINT -- Who made the change );
CREATE INDEX idx_user_events_user_id ON user_events(user_id); CREATE INDEX idx_user_events_created_at ON user_events(created_at DESC);
-- Current state is derived from events CREATE VIEW current_users AS SELECT DISTINCT ON (user_id) user_id, event_data->>'email' as email, created_at as last_updated FROM user_events WHERE event_type = 'email_changed' ORDER BY user_id, created_at DESC;
Trade-offs:
-
✓ Complete audit trail
-
✓ Time travel queries
-
✗ More storage
-
✗ Queries more complex
Pattern 2: Soft Deletes for Recovery
Use when: Data is valuable or deletion is rare
-- Basic soft delete CREATE TABLE products ( id BIGSERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, price DECIMAL(10,2), deleted_at TIMESTAMPTZ, -- NULL = active, NOT NULL = deleted deleted_by BIGINT -- Who deleted it );
-- Index for active records only (partial index) CREATE INDEX idx_products_active ON products(id) WHERE deleted_at IS NULL;
-- Queries SELECT * FROM products WHERE deleted_at IS NULL; -- Active only SELECT * FROM products WHERE deleted_at IS NOT NULL; -- Deleted only
Trade-offs:
-
✓ Easy recovery
-
✓ Audit compliance
-
✗ Every query needs WHERE deleted_at IS NULL
-
✗ Unique constraints more complex
Pattern 3: Polymorphic Associations
Use when: Multiple entity types share a relationship
-- Anti-pattern: Nullable foreign keys CREATE TABLE comments ( id BIGSERIAL PRIMARY KEY, text TEXT NOT NULL, post_id BIGINT, -- Comment on post photo_id BIGINT, -- OR comment on photo video_id BIGINT, -- OR comment on video -- Only one should be set, but DB can't enforce this created_at TIMESTAMPTZ DEFAULT NOW() );
-- Better: Polymorphic with type CREATE TABLE comments ( id BIGSERIAL PRIMARY KEY, text TEXT NOT NULL, commentable_type VARCHAR(50) NOT NULL, -- 'Post', 'Photo', 'Video' commentable_id BIGINT NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW(),
-- Ensure only valid types CONSTRAINT valid_commentable_type CHECK (commentable_type IN ('Post', 'Photo', 'Video')) );
CREATE INDEX idx_comments_commentable ON comments(commentable_type, commentable_id);
-- Query SELECT * FROM comments WHERE commentable_type = 'Post' AND commentable_id = 123;
Trade-offs:
-
✓ Flexible, extensible
-
✗ Can't use foreign key constraints
-
✗ Joins more complex
When to avoid: Use separate tables if types have very different fields.
Pattern 4: Hierarchical Data (Adjacency List vs Nested Sets)
Adjacency List (simple, common):
CREATE TABLE categories ( id BIGSERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, parent_id BIGINT REFERENCES categories(id), created_at TIMESTAMPTZ DEFAULT NOW() );
CREATE INDEX idx_categories_parent ON categories(parent_id);
-- Get immediate children (fast) SELECT * FROM categories WHERE parent_id = 5;
-- Get all descendants (requires recursive CTE - slower) WITH RECURSIVE subcategories AS ( SELECT * FROM categories WHERE id = 5 UNION ALL SELECT c.* FROM categories c JOIN subcategories s ON c.parent_id = s.id ) SELECT * FROM subcategories;
Nested Sets (fast reads, slow writes):
CREATE TABLE categories ( id BIGSERIAL PRIMARY KEY, name VARCHAR(255) NOT NULL, lft INT NOT NULL, -- Left boundary rgt INT NOT NULL, -- Right boundary created_at TIMESTAMPTZ DEFAULT NOW() );
CREATE INDEX idx_categories_lft_rgt ON categories(lft, rgt);
-- Get all descendants (fast!) SELECT * FROM categories WHERE lft > 10 AND rgt < 20;
-- Get path to root (fast!) SELECT * FROM categories c1 JOIN categories c2 ON c1.lft BETWEEN c2.lft AND c2.rgt WHERE c1.id = 15 ORDER BY c2.lft;
Choose:
-
Adjacency List: Frequent updates, shallow hierarchies
-
Nested Sets: Mostly reads, deep hierarchies
-
Materialized Path: Mix of both (store path as string: /1/5/12/ )
Pattern 5: Time-Series Data (Partitioning)
Use when: Data grows continuously over time
-- Basic time-series table CREATE TABLE metrics ( id BIGSERIAL, metric_name VARCHAR(100) NOT NULL, value DOUBLE PRECISION NOT NULL, recorded_at TIMESTAMPTZ NOT NULL, tags JSONB, PRIMARY KEY (id, recorded_at) -- Include partition key ) PARTITION BY RANGE (recorded_at);
-- Create partitions (monthly) CREATE TABLE metrics_2026_01 PARTITION OF metrics FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');
CREATE TABLE metrics_2026_02 PARTITION OF metrics FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');
-- Index on each partition (automatically created on partition key) CREATE INDEX idx_metrics_2026_01_name ON metrics_2026_01(metric_name); CREATE INDEX idx_metrics_2026_02_name ON metrics_2026_02(metric_name);
-- Queries automatically use correct partition SELECT AVG(value) FROM metrics WHERE recorded_at BETWEEN '2026-01-15' AND '2026-01-20' AND metric_name = 'cpu_usage';
Benefits:
-
✓ Query only relevant partitions (partition pruning)
-
✓ Drop old partitions easily (instant delete)
-
✓ Parallel queries across partitions
-
✓ Manage storage per partition
Indexing Strategy
When to Index
Index when:
-
Column used in WHERE clause frequently
-
Column used in JOIN conditions
-
Column used in ORDER BY
-
Foreign keys (even if DB doesn't auto-create)
Don't index when:
-
Table is small (< 1000 rows)
-
Column has low cardinality (few distinct values)
-
Column is updated frequently
-
Index would be larger than table
Index Types
B-Tree (default, most common):
CREATE INDEX idx_users_email ON users(email); -- Use for: =, <, >, <=, >=, BETWEEN, IN, LIKE 'prefix%'
Hash (exact matches only):
CREATE INDEX idx_users_email ON users USING HASH(email); -- Use for: = only (faster than B-Tree for equality)
GIN (Generalized Inverted Index - for arrays/JSONB):
CREATE INDEX idx_products_tags ON products USING GIN(tags); -- Use for: JSONB @>, arrays && or @> operators
Partial Index (index subset of rows):
CREATE INDEX idx_active_users ON users(email) WHERE deleted_at IS NULL; -- Smaller, faster for common queries
Composite Index (multiple columns):
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC); -- Order matters! Works for: -- WHERE user_id = X -- WHERE user_id = X ORDER BY created_at DESC -- Doesn't help: -- WHERE created_at = Y (user_id not in query)
Index Maintenance
-- Find unused indexes (PostgreSQL) SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexname NOT LIKE '%_pkey';
-- Find duplicate indexes SELECT tablename, array_agg(indexname) FROM pg_indexes GROUP BY tablename, indexdef HAVING COUNT(*) > 1;
-- Rebuild bloated indexes REINDEX INDEX CONCURRENTLY idx_users_email;
Schema Evolution
Additive-Only Migrations
Safe (zero downtime):
-- Add new column with default ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- Add new table CREATE TABLE user_preferences (...);
-- Add new index concurrently (PostgreSQL) CREATE INDEX CONCURRENTLY idx_users_created ON users(created_at);
Risky (requires downtime or careful planning):
-- Drop column (breaks old code) ALTER TABLE users DROP COLUMN old_field;
-- Rename column (breaks old code) ALTER TABLE users RENAME COLUMN old_name TO new_name;
-- Change column type (may lock table) ALTER TABLE users ALTER COLUMN age TYPE BIGINT;
Safe Migration Pattern
3-Step Deploy:
-- Step 1: Add new column (deploy, code ignores it) ALTER TABLE users ADD COLUMN email_new VARCHAR(255);
-- Step 2: Backfill + dual write (deploy, code writes both) UPDATE users SET email_new = email WHERE email_new IS NULL; -- App code now writes to both email and email_new
-- Step 3: Switch over (deploy, code uses only email_new) ALTER TABLE users DROP COLUMN email; ALTER TABLE users RENAME COLUMN email_new TO email;
Versioned Schema
-- Track migrations CREATE TABLE schema_migrations ( version VARCHAR(50) PRIMARY KEY, applied_at TIMESTAMPTZ DEFAULT NOW() );
-- Example migration file: 20260121_add_user_phone.sql BEGIN;
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
INSERT INTO schema_migrations (version) VALUES ('20260121_add_user_phone');
COMMIT;
Database Selection Guide
SQL (Relational)
PostgreSQL:
-
✓ Best for: Complex queries, ACID transactions, JSONB
-
✓ Advanced features: Full-text search, geospatial (PostGIS)
-
Use when: You need strong consistency and rich query capabilities
MySQL/MariaDB:
-
✓ Best for: Read-heavy workloads, replication
-
✓ Widely supported, mature ecosystem
-
Use when: You need simple replication or hosting constraints
SQLite:
-
✓ Best for: Embedded databases, single-user apps
-
Use when: No separate DB server needed
NoSQL
MongoDB (Document):
-
✓ Best for: Flexible schema, rapid iteration
-
✓ Schema-less (actually schema-on-read)
-
Use when: Schema changes frequently, hierarchical data
Redis (Key-Value):
-
✓ Best for: Caching, sessions, pub/sub
-
Use when: Ultra-fast reads, temporary data
DynamoDB (Key-Value):
-
✓ Best for: Serverless, predictable performance
-
Use when: AWS ecosystem, pay-per-request pricing
Cassandra (Wide-Column):
-
✓ Best for: Time-series, write-heavy workloads
-
Use when: Multi-datacenter, massive scale
Decision Matrix
Use Case Database Why
User accounts, transactions PostgreSQL ACID, relations
Product catalog (e-commerce) PostgreSQL Complex queries, inventory
Session storage Redis Fast, ephemeral
Activity feed Cassandra Write-heavy, time-series
Analytics events ClickHouse Columnar, OLAP
Document storage MongoDB Flexible schema
Geospatial queries PostgreSQL + PostGIS Best geo support
Performance Optimization
Query Analysis
-- Analyze query plan (PostgreSQL) EXPLAIN ANALYZE SELECT u.name, COUNT(o.id) as order_count FROM users u LEFT JOIN orders o ON u.id = o.user_id WHERE u.created_at > '2026-01-01' GROUP BY u.id, u.name;
-- Look for: -- "Seq Scan" on large tables → Add index -- "Hash Join" instead of "Nested Loop" → May need different index -- High "actual time" → Slow query
Common Fixes
Problem: Slow JOIN on large tables
-- Before SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE o.status = 'pending';
-- Fix: Index foreign key CREATE INDEX idx_orders_user_id ON orders(user_id); CREATE INDEX idx_orders_status ON orders(status);
Problem: Slow COUNT(*) on large table
-- Before (scans entire table) SELECT COUNT(*) FROM orders;
-- Fix: Use estimate for large tables SELECT reltuples::BIGINT AS estimate FROM pg_class WHERE relname = 'orders';
-- Or: Maintain counter CREATE TABLE table_stats ( table_name VARCHAR(50) PRIMARY KEY, row_count BIGINT, updated_at TIMESTAMPTZ );
Problem: N+1 queries
-- Before (1 query + N queries for each user) users = query("SELECT * FROM users LIMIT 10") for user in users: orders = query("SELECT * FROM orders WHERE user_id = ?", user.id)
-- Fix: Eager load with JOIN or IN SELECT u., o. FROM users u LEFT JOIN orders o ON u.id = o.user_id LIMIT 10;
-- Or use IN for large sets user_ids = [1, 2, 3, ...] SELECT * FROM orders WHERE user_id IN (?, ?, ?)
Anti-Patterns to Avoid
Anti-Pattern Why It's Bad Better Approach
ENUM in column type Hard to change Lookup table with foreign key
Storing arrays as strings Can't query efficiently Array column or join table
GUID/UUID as primary key Fragmented indexes, 16 bytes BIGSERIAL (8 bytes, sequential)
Premature optimization Complexity without benefit Start simple, optimize when needed
Missing foreign keys Data integrity issues Always use FK constraints
No created_at/updated_at Can't audit or debug Add to all tables
Over-normalization Too many JOINs Denormalize for read patterns
Under-normalization Data duplication, update anomalies Normalize until 3NF, then denormalize strategically
Output Format
When helping with schema design:
Schema Analysis
Current Pain Points
- [Specific issue 1]
- [Specific issue 2]
Recommended Schema
[SQL DDL for new tables/changes]
Indexes
[Recommended indexes with rationale]
Migration Strategy
[Step-by-step migration plan]
Trade-offs
- ✓ Advantages
- ✗ Disadvantages
Expected Performance Impact
- Reads: [faster/slower, by how much]
- Writes: [faster/slower, by how much]
- Storage: [increase/decrease]
Integration
Works with:
-
database-schema - Initial schema design
-
api-design - Align schema with API needs
-
systems-decompose - Feature-driven schema design
-
data-infrastructure-at-scale - Infrastructure for scaled schemas