vector-databases

Vector Databases

When to Use This Skill

Use this skill when:

Choosing between vector database options
Designing semantic/similarity search systems
Optimizing vector search performance
Understanding ANN algorithm trade-offs
Scaling vector search infrastructure
Implementing hybrid search (vectors + filters)

Keywords: vector database, embeddings, vector search, similarity search, ANN, approximate nearest neighbor, HNSW, IVF, FAISS, Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, cosine similarity, semantic search

Vector Database Comparison

Managed Services

Database Strengths Limitations Best For

Pinecone Fully managed, easy scaling, enterprise Vendor lock-in, cost at scale Enterprise production

Weaviate Cloud GraphQL, hybrid search, modules Complexity Knowledge graphs

Zilliz Cloud Milvus-based, high performance Learning curve High-scale production

MongoDB Atlas Vector Existing MongoDB users Newer feature MongoDB shops

Elastic Vector Existing Elastic stack Resource heavy Search platforms

Self-Hosted Options

Database Strengths Limitations Best For

Milvus Feature-rich, scalable, GPU support Operational complexity Large-scale production

Qdrant Rust performance, filtering, easy Smaller ecosystem Performance-focused

Weaviate Modules, semantic, hybrid Memory usage Knowledge applications

Chroma Simple, Python-native Limited scale Development, prototyping

pgvector PostgreSQL extension Performance limits Postgres shops

FAISS Library, not DB, fastest No persistence, no filtering Research, embedded

Selection Decision Tree

Need managed, don't want operations? ├── Yes → Pinecone (simplest) or Weaviate Cloud └── No (self-hosted) └── Already using PostgreSQL? ├── Yes, <1M vectors → pgvector └── No └── Need maximum performance at scale? ├── Yes → Milvus or Qdrant └── No └── Prototyping/development? ├── Yes → Chroma └── No → Qdrant (balanced choice)

ANN Algorithms

Algorithm Overview

Exact KNN: • Search ALL vectors • O(n) time complexity • Perfect accuracy • Impractical at scale

Approximate NN (ANN): • Search SUBSET of vectors • O(log n) to O(1) complexity • Near-perfect accuracy • Practical at any scale

HNSW (Hierarchical Navigable Small World)

Layer 3: ○───────────────────────○ (sparse, long connections) │ │ Layer 2: ○───○───────○───────○───○ (medium density) │ │ │ │ │ Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○ (denser) │││││││││││││││││││││││ Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○ (all vectors)

Search: Start at top layer, greedily descend • Fast: O(log n) search time • High recall: >95% typically • Memory: Extra graph storage

HNSW Parameters:

Parameter Description Trade-off

Connections per node Memory vs. recall

ef_construction

Build-time search width Build time vs. recall

ef_search

Query-time search width Latency vs. recall

IVF (Inverted File Index)

Clustering Phase: ┌─────────────────────────────────────────┐ │ Cluster vectors into K centroids │ │ │ │ ● ● ● ● │ │ /│\ /│\ /│\ /│\ │ │ ○○○○○ ○○○○○ ○○○○○ ○○○○○ │ │ Cluster 1 Cluster 2 Cluster 3 Cluster 4│ └─────────────────────────────────────────┘

Search Phase:

Find nprobe nearest centroids
Search only those clusters
Much faster than exhaustive

IVF Parameters:

Parameter Description Trade-off

nlist

Number of clusters Build time vs. search quality

nprobe

Clusters to search Latency vs. recall

IVF-PQ (Product Quantization)

Original Vector (128 dim): [0.1, 0.2, ..., 0.9] (128 × 4 bytes = 512 bytes)

PQ Compressed (8 subvectors, 8-bit codes): [23, 45, 12, 89, 56, 34, 78, 90] (8 bytes)

Memory reduction: 64x Accuracy trade-off: ~5% recall drop

Algorithm Comparison

Algorithm Search Speed Memory Build Time Recall

Flat/Brute Slow (O(n)) Low None 100%

IVF Fast Low Medium 90-95%

IVF-PQ Very fast Very low Medium 85-92%

HNSW Very fast High Slow 95-99%

HNSW+PQ Very fast Medium Slow 90-95%

When to Use Which

< 100K vectors: └── Flat index (exact search is fast enough)

100K - 1M vectors: └── HNSW (best recall/speed trade-off)

1M - 100M vectors: ├── Memory available → HNSW └── Memory constrained → IVF-PQ or HNSW+PQ

100M vectors: └── Sharded IVF-PQ or distributed HNSW

Distance Metrics

Common Metrics

Metric Formula Range Best For

Cosine Similarity A·B / (||A|| ||B||)

[-1, 1] Normalized embeddings

Dot Product A·B

(-∞, ∞) When magnitude matters

Euclidean (L2) √Σ(A-B)²

[0, ∞) Absolute distances

Manhattan (L1) Σ|A-B|

[0, ∞) High-dimensional sparse

Metric Selection

Embeddings pre-normalized (unit vectors)? ├── Yes → Cosine = Dot Product (use Dot, faster) └── No └── Magnitude meaningful? ├── Yes → Dot Product └── No → Cosine Similarity

Note: Most embedding models output normalized vectors → Dot product is usually the best choice

Filtering and Hybrid Search

Pre-filtering vs Post-filtering

Pre-filtering (Filter → Search): ┌─────────────────────────────────────────┐ │ 1. Apply metadata filter │ │ (category = "electronics") │ │ Result: 10K of 1M vectors │ │ │ │ 2. Vector search on 10K vectors │ │ Much faster, guaranteed filter match │ └─────────────────────────────────────────┘

Post-filtering (Search → Filter): ┌─────────────────────────────────────────┐ │ 1. Vector search on 1M vectors │ │ Return top-1000 │ │ │ │ 2. Apply metadata filter │ │ May return < K results! │ └─────────────────────────────────────────┘

Hybrid Search Architecture

Query: "wireless headphones under $100" │ ┌─────┴─────┐ ▼ ▼ ┌───────┐ ┌───────┐ │Vector │ │Filter │ │Search │ │ Build │ │"wire- │ │price │ │less │ │< 100 │ │head- │ │ │ │phones"│ │ │ └───────┘ └───────┘ │ │ └─────┬─────┘ ▼ ┌───────────┐ │ Combine │ │ Results │ └───────────┘

Metadata Index Design

Metadata Type Index Strategy Query Example

Categorical Bitmap/hash index category = "books"

Numeric range B-tree price BETWEEN 10 AND 50

Keyword search Inverted index tags CONTAINS "sale"

Geospatial R-tree/geohash location NEAR (lat, lng)

Scaling Strategies

Sharding Approaches

Naive Sharding (by ID): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│ └─────────┘ └─────────┘ └─────────┘ Query → Search ALL shards → Merge results

Semantic Sharding (by cluster): ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │ │ Tech │ │ Health │ │ Finance │ │ docs │ │ docs │ │ docs │ └─────────┘ └─────────┘ └─────────┘ Query → Route to relevant shard(s) → Faster!

Replication

┌─────────────────────────────────────────┐ │ Load Balancer │ └─────────────────────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │Replica 1│ │Replica 2│ │Replica 3│ │ (Read) │ │ (Read) │ │ (Read) │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └───────────┼───────────┘ │ ┌─────────┐ │ Primary │ │ (Write) │ └─────────┘

Scaling Decision Matrix

Scale (vectors) Architecture Replication

< 1M Single node Optional

1-10M Single node, more RAM For HA

10-100M Sharded, few nodes Required

100M-1B Sharded, many nodes Required

1B Sharded + tiered Required

Performance Optimization

Index Build Optimization

Optimization Description Impact

Batch insertion Insert in batches of 1K-10K 10x faster

Parallel build Multi-threaded index construction 2-4x faster

Incremental index Add to existing index Avoids rebuild

GPU acceleration Use GPU for training (IVF) 10-100x faster

Query Optimization

Optimization Description Impact

Warm cache Keep index in memory 10x latency reduction

Query batching Batch similar queries Higher throughput

Reduce dimensions PCA, random projection 2-4x faster

Early termination Stop when "good enough" Lower latency

Memory Optimization

Memory per vector: ┌────────────────────────────────────────┐ │ 1536 dims × 4 bytes = 6KB per vector │ │ │ │ 1M vectors: │ │ Raw: 6GB │ │ + HNSW graph: +2-4GB (M-dependent) │ │ = 8-10GB total │ │ │ │ With PQ (64 subquantizers): │ │ 1M vectors: ~64MB │ │ = 100x reduction │ └────────────────────────────────────────┘

Operational Considerations

Backup and Recovery

Strategy Description RPO/RTO

Snapshots Periodic full backup Hours

WAL replication Write-ahead log streaming Minutes

Real-time sync Synchronous replication Seconds

Monitoring Metrics

Metric Description Alert Threshold

Query latency p99 99th percentile latency

100ms

Recall Search accuracy < 90%

QPS Queries per second Capacity dependent

Memory usage Index memory

80%

Index freshness Time since last update Domain dependent

Index Maintenance

┌─────────────────────────────────────────┐ │ Index Maintenance Tasks │ ├─────────────────────────────────────────┤ │ • Compaction: Merge small segments │ │ • Reindex: Rebuild degraded index │ │ • Vacuum: Remove deleted vectors │ │ • Optimize: Tune parameters │ │ │ │ Schedule during low-traffic periods │ └─────────────────────────────────────────┘

Common Patterns

Multi-Tenant Vector Search

Option 1: Namespace/Collection per tenant ┌─────────────────────────────────────────┐ │ tenant_1_collection │ │ tenant_2_collection │ │ tenant_3_collection │ └─────────────────────────────────────────┘ Pro: Complete isolation Con: Many indexes, operational overhead

Option 2: Single collection + tenant filter ┌─────────────────────────────────────────┐ │ shared_collection │ │ metadata: { tenant_id: "..." } │ │ Pre-filter by tenant_id │ └─────────────────────────────────────────┘ Pro: Simpler operations Con: Requires efficient filtering

Real-Time Updates

Write Path: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Write │ │ Buffer │ │ Merge │ │ Request │───▶│ (Memory) │───▶│ to Index │ └─────────────┘ └─────────────┘ └─────────────┘

Strategy:

Buffer writes in memory
Periodically merge to main index
Search: main index + buffer
Compact periodically

Embedding Versioning

Version 1 embeddings ──┐ │ Version 2 embeddings ──┼──▶ Parallel indexes during migration │ │ ┌─────────────────────┐ └───▶│ Gradual reindexing │ │ Blue-green switch │ └─────────────────────┘

Cost Estimation

Storage Costs

Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month

Example: 10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB At $0.10/GB/month = $18.40/month storage

Note: Memory (for serving) costs more than storage

Compute Costs

Factors: • QPS (queries per second) • Latency requirements • Index type (HNSW needs more RAM) • Filtering complexity

Rule of thumb: • 1M vectors, HNSW, <50ms latency: 16GB RAM • 10M vectors, HNSW, <50ms latency: 64-128GB RAM • 100M vectors: Distributed system required

Related Skills

rag-architecture
Using vector databases in RAG systems
llm-serving-patterns
LLM inference with vector retrieval
ml-system-design
End-to-end ML pipeline design
estimation-techniques
Capacity planning for vector systems

Version History

v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design

Last Updated

Date: 2025-12-26

vector-databases

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering