Mapping Mistakes
- Always define explicit mappings—dynamic mapping guesses wrong (first "123" makes field integer, later "abc" fails)
textfor full-text search,keywordfor exact match/aggregations—using text for IDs breaks filters- Can't change field type after indexing—must reindex to new index with correct mapping
- Set
dynamic: "strict"to reject unmapped fields—catches typos in field names
Text vs Keyword
textis analyzed (tokenized, lowercased)—"Quick Brown" matches search for "quick"keywordis exact bytes—"Quick Brown" only matches exactly "Quick Brown"- Need both? Use multi-field:
"title": { "type": "text", "fields": { "raw": { "type": "keyword" }}} - Sort/aggregate on
title.raw, search ontitle
Query vs Filter Context
- Query context calculates relevance score—expensive, use for search ranking
- Filter context is yes/no—cacheable, use for exact conditions (status, date ranges)
- Combine:
bool.mustfor scoring,bool.filterfor filtering without scoring - Range queries on dates/numbers almost always belong in filter, not query
Analyzers
standardanalyzer lowercases and removes punctuation—fine for most textkeywordanalyzer keeps exact string—use for codes, SKUs, emails- Language analyzers (
english) stem words—"running" matches "run" - Test analyzer with
_analyzeendpoint before indexing—surprises in production hurt
Nested vs Object
- Object type flattens arrays—
{"tags": [{"key":"a","val":1}, {"key":"b","val":2}]}becomestags.key: [a,b], tags.val: [1,2] - Flattened loses association—query
key=a AND val=2incorrectly matches above - Use
nestedtype to preserve object boundaries—requiresnestedquery wrapper - Nested is expensive—avoid for high-cardinality arrays
Pagination Traps
from+sizelimited to 10,000 hits—deep pagination failssearch_afterfor deep pagination—requires consistent sort, typically_id- Scroll API for bulk export—keeps point-in-time view, but ties up resources
- Don't use scroll for user pagination—search_after is correct choice
Bulk Operations
- Never index documents one-by-one—use
_bulkAPI, 5-15MB batches - Bulk format: newline-delimited JSON, action line then document line
- Check response for partial failures—bulk can succeed overall with individual doc errors
- Set
refresh=falseduring bulk loads—refresh after batch completes
Performance
_source: falsewithstored_fieldsif you don't need full document—reduces I/O- Use
filterfor cacheable conditions—Elasticsearch caches filter results - Avoid leading wildcards (
*term)—forces full scan; usereversefield for suffix search profile: trueshows query execution breakdown—find slow clauses
Sharding
- Shard size 10-50GB optimal—too small = overhead, too large = slow recovery
- Number of shards fixed at creation—can't reshard without reindexing
- Replicas for read throughput and availability—set based on query load
- Start with 1 shard for small indices—over-sharding kills performance
Index Management
- Use index templates—new indices get consistent mappings and settings
- Use aliases for zero-downtime reindexing—point alias to new index after reindex
- ILM (Index Lifecycle Management) for time-series—auto-rollover, delete old indices
- Close unused indices to free memory—closed index uses no heap
Aggregations
termsagg needskeywordfield—text fields fail or give garbage- Default
size: 10on terms agg—increase to get all buckets, or use composite - Cardinality is approximate (HyperLogLog)—exact count requires scanning all docs
- Nested aggs require
nestedwrapper—matches nested query pattern
Common Errors
- "cluster_block_exception"—disk > 85%, cluster goes read-only; clear disk, reset with
_cluster/settings - "version conflict"—concurrent update; retry with
retry_on_conflictor use optimistic locking - "circuit_breaker_exception"—query uses too much memory; reduce aggregation scope
- Mapping explosion from dynamic fields—set
index.mapping.total_fields.limitand use strict mapping