Vector Databases
When to Use This Skill
Use this skill when:
-
Choosing between vector database options
-
Designing semantic/similarity search systems
-
Optimizing vector search performance
-
Understanding ANN algorithm trade-offs
-
Scaling vector search infrastructure
-
Implementing hybrid search (vectors + filters)
Keywords: vector database, embeddings, vector search, similarity search, ANN, approximate nearest neighbor, HNSW, IVF, FAISS, Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, cosine similarity, semantic search
Vector Database Comparison
Managed Services
Database Strengths Limitations Best For
Pinecone Fully managed, easy scaling, enterprise Vendor lock-in, cost at scale Enterprise production
Weaviate Cloud GraphQL, hybrid search, modules Complexity Knowledge graphs
Zilliz Cloud Milvus-based, high performance Learning curve High-scale production
MongoDB Atlas Vector Existing MongoDB users Newer feature MongoDB shops
Elastic Vector Existing Elastic stack Resource heavy Search platforms
Self-Hosted Options
Database Strengths Limitations Best For
Milvus Feature-rich, scalable, GPU support Operational complexity Large-scale production
Qdrant Rust performance, filtering, easy Smaller ecosystem Performance-focused
Weaviate Modules, semantic, hybrid Memory usage Knowledge applications
Chroma Simple, Python-native Limited scale Development, prototyping
pgvector PostgreSQL extension Performance limits Postgres shops
FAISS Library, not DB, fastest No persistence, no filtering Research, embedded
Selection Decision Tree
Need managed, don't want operations? ├── Yes → Pinecone (simplest) or Weaviate Cloud └── No (self-hosted) └── Already using PostgreSQL? ├── Yes, <1M vectors → pgvector └── No └── Need maximum performance at scale? ├── Yes → Milvus or Qdrant └── No └── Prototyping/development? ├── Yes → Chroma └── No → Qdrant (balanced choice)
ANN Algorithms
Algorithm Overview
Exact KNN: • Search ALL vectors • O(n) time complexity • Perfect accuracy • Impractical at scale
Approximate NN (ANN): • Search SUBSET of vectors • O(log n) to O(1) complexity • Near-perfect accuracy • Practical at any scale
HNSW (Hierarchical Navigable Small World)
Layer 3: ○───────────────────────○ (sparse, long connections) │ │ Layer 2: ○───○───────○───────○───○ (medium density) │ │ │ │ │ Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○ (denser) │││││││││││││││││││││││ Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○ (all vectors)
Search: Start at top layer, greedily descend • Fast: O(log n) search time • High recall: >95% typically • Memory: Extra graph storage
HNSW Parameters:
Parameter Description Trade-off
M
Connections per node Memory vs. recall
ef_construction
Build-time search width Build time vs. recall
ef_search
Query-time search width Latency vs. recall
IVF (Inverted File Index)
Clustering Phase: ┌─────────────────────────────────────────┐ │ Cluster vectors into K centroids │ │ │ │ ● ● ● ● │ │ /│\ /│\ /│\ /│\ │ │ ○○○○○ ○○○○○ ○○○○○ ○○○○○ │ │ Cluster 1 Cluster 2 Cluster 3 Cluster 4│ └─────────────────────────────────────────┘
Search Phase:
- Find nprobe nearest centroids
- Search only those clusters
- Much faster than exhaustive
IVF Parameters:
Parameter Description Trade-off
nlist
Number of clusters Build time vs. search quality
nprobe
Clusters to search Latency vs. recall
IVF-PQ (Product Quantization)
Original Vector (128 dim): [0.1, 0.2, ..., 0.9] (128 × 4 bytes = 512 bytes)
PQ Compressed (8 subvectors, 8-bit codes): [23, 45, 12, 89, 56, 34, 78, 90] (8 bytes)
Memory reduction: 64x Accuracy trade-off: ~5% recall drop
Algorithm Comparison
Algorithm Search Speed Memory Build Time Recall
Flat/Brute Slow (O(n)) Low None 100%
IVF Fast Low Medium 90-95%
IVF-PQ Very fast Very low Medium 85-92%
HNSW Very fast High Slow 95-99%
HNSW+PQ Very fast Medium Slow 90-95%
When to Use Which
< 100K vectors: └── Flat index (exact search is fast enough)
100K - 1M vectors: └── HNSW (best recall/speed trade-off)
1M - 100M vectors: ├── Memory available → HNSW └── Memory constrained → IVF-PQ or HNSW+PQ
100M vectors: └── Sharded IVF-PQ or distributed HNSW
Distance Metrics
Common Metrics
Metric Formula Range Best For
Cosine Similarity A·B / (||A|| ||B||)
[-1, 1] Normalized embeddings
Dot Product A·B
(-∞, ∞) When magnitude matters
Euclidean (L2) √Σ(A-B)²
[0, ∞) Absolute distances
Manhattan (L1) Σ|A-B|
[0, ∞) High-dimensional sparse
Metric Selection
Embeddings pre-normalized (unit vectors)? ├── Yes → Cosine = Dot Product (use Dot, faster) └── No └── Magnitude meaningful? ├── Yes → Dot Product └── No → Cosine Similarity
Note: Most embedding models output normalized vectors → Dot product is usually the best choice
Filtering and Hybrid Search
Pre-filtering vs Post-filtering
Pre-filtering (Filter → Search): ┌─────────────────────────────────────────┐ │ 1. Apply metadata filter │ │ (category = "electronics") │ │ Result: 10K of 1M vectors │ │ │ │ 2. Vector search on 10K vectors │ │ Much faster, guaranteed filter match │ └─────────────────────────────────────────┘
Post-filtering (Search → Filter): ┌─────────────────────────────────────────┐ │ 1. Vector search on 1M vectors │ │ Return top-1000 │ │ │ │ 2. Apply metadata filter │ │ May return < K results! │ └─────────────────────────────────────────┘
Hybrid Search Architecture
Query: "wireless headphones under $100" │ ┌─────┴─────┐ ▼ ▼ ┌───────┐ ┌───────┐ │Vector │ │Filter │ │Search │ │ Build │ │"wire- │ │price │ │less │ │< 100 │ │head- │ │ │ │phones"│ │ │ └───────┘ └───────┘ │ │ └─────┬─────┘ ▼ ┌───────────┐ │ Combine │ │ Results │ └───────────┘
Metadata Index Design
Metadata Type Index Strategy Query Example
Categorical Bitmap/hash index category = "books"
Numeric range B-tree price BETWEEN 10 AND 50
Keyword search Inverted index tags CONTAINS "sale"
Geospatial R-tree/geohash location NEAR (lat, lng)
Scaling Strategies
Sharding Approaches
Naive Sharding (by ID): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│ └─────────┘ └─────────┘ └─────────┘ Query → Search ALL shards → Merge results
Semantic Sharding (by cluster): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ Tech │ │ Health │ │ Finance │ │ docs │ │ docs │ │ docs │ └─────────┘ └─────────┘ └─────────┘ Query → Route to relevant shard(s) → Faster!
Replication
┌─────────────────────────────────────────┐ │ Load Balancer │ └─────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │Replica 1│ │Replica 2│ │Replica 3│ │ (Read) │ │ (Read) │ │ (Read) │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └───────────┼───────────┘ │ ┌─────────┐ │ Primary │ │ (Write) │ └─────────┘
Scaling Decision Matrix
Scale (vectors) Architecture Replication
< 1M Single node Optional
1-10M Single node, more RAM For HA
10-100M Sharded, few nodes Required
100M-1B Sharded, many nodes Required
1B Sharded + tiered Required
Performance Optimization
Index Build Optimization
Optimization Description Impact
Batch insertion Insert in batches of 1K-10K 10x faster
Parallel build Multi-threaded index construction 2-4x faster
Incremental index Add to existing index Avoids rebuild
GPU acceleration Use GPU for training (IVF) 10-100x faster
Query Optimization
Optimization Description Impact
Warm cache Keep index in memory 10x latency reduction
Query batching Batch similar queries Higher throughput
Reduce dimensions PCA, random projection 2-4x faster
Early termination Stop when "good enough" Lower latency
Memory Optimization
Memory per vector: ┌────────────────────────────────────────┐ │ 1536 dims × 4 bytes = 6KB per vector │ │ │ │ 1M vectors: │ │ Raw: 6GB │ │ + HNSW graph: +2-4GB (M-dependent) │ │ = 8-10GB total │ │ │ │ With PQ (64 subquantizers): │ │ 1M vectors: ~64MB │ │ = 100x reduction │ └────────────────────────────────────────┘
Operational Considerations
Backup and Recovery
Strategy Description RPO/RTO
Snapshots Periodic full backup Hours
WAL replication Write-ahead log streaming Minutes
Real-time sync Synchronous replication Seconds
Monitoring Metrics
Metric Description Alert Threshold
Query latency p99 99th percentile latency
100ms
Recall Search accuracy < 90%
QPS Queries per second Capacity dependent
Memory usage Index memory
80%
Index freshness Time since last update Domain dependent
Index Maintenance
┌─────────────────────────────────────────┐ │ Index Maintenance Tasks │ ├─────────────────────────────────────────┤ │ • Compaction: Merge small segments │ │ • Reindex: Rebuild degraded index │ │ • Vacuum: Remove deleted vectors │ │ • Optimize: Tune parameters │ │ │ │ Schedule during low-traffic periods │ └─────────────────────────────────────────┘
Common Patterns
Multi-Tenant Vector Search
Option 1: Namespace/Collection per tenant ┌─────────────────────────────────────────┐ │ tenant_1_collection │ │ tenant_2_collection │ │ tenant_3_collection │ └─────────────────────────────────────────┘ Pro: Complete isolation Con: Many indexes, operational overhead
Option 2: Single collection + tenant filter ┌─────────────────────────────────────────┐ │ shared_collection │ │ metadata: { tenant_id: "..." } │ │ Pre-filter by tenant_id │ └─────────────────────────────────────────┘ Pro: Simpler operations Con: Requires efficient filtering
Real-Time Updates
Write Path: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Write │ │ Buffer │ │ Merge │ │ Request │───▶│ (Memory) │───▶│ to Index │ └─────────────┘ └─────────────┘ └─────────────┘
Strategy:
- Buffer writes in memory
- Periodically merge to main index
- Search: main index + buffer
- Compact periodically
Embedding Versioning
Version 1 embeddings ──┐ │ Version 2 embeddings ──┼──▶ Parallel indexes during migration │ │ ┌─────────────────────┐ └───▶│ Gradual reindexing │ │ Blue-green switch │ └─────────────────────┘
Cost Estimation
Storage Costs
Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month
Example: 10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB At $0.10/GB/month = $18.40/month storage
Note: Memory (for serving) costs more than storage
Compute Costs
Factors: • QPS (queries per second) • Latency requirements • Index type (HNSW needs more RAM) • Filtering complexity
Rule of thumb: • 1M vectors, HNSW, <50ms latency: 16GB RAM • 10M vectors, HNSW, <50ms latency: 64-128GB RAM • 100M vectors: Distributed system required
Related Skills
-
rag-architecture
-
Using vector databases in RAG systems
-
llm-serving-patterns
-
LLM inference with vector retrieval
-
ml-system-design
-
End-to-end ML pipeline design
-
estimation-techniques
-
Capacity planning for vector systems
Version History
- v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design
Last Updated
Date: 2025-12-26