Elasticsearch Analysis

Authentication

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for ELASTICSEARCH_URL , ES_USER , or ES_PASSWORD in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.

MANDATORY: Statistics-First Investigation

NEVER dump raw logs. Always follow this pattern:

STATISTICS → SAMPLE → PATTERNS → CORRELATE

Statistics First - Know volume, error rate, and top patterns before sampling
Strategic Sampling - Choose the right strategy based on statistics
Pattern Extraction - Cluster similar errors to find root causes
Context Correlation - Investigate around anomaly timestamps

Available Scripts

All scripts are in .claude/skills/observability-elasticsearch/scripts/

PRIMARY INVESTIGATION SCRIPTS

get_statistics.py - ALWAYS START HERE

Comprehensive statistics with pattern extraction.

python .claude/skills/observability-elasticsearch/scripts/get_statistics.py [--index INDEX] [--time-range MINUTES]

Examples:

python .claude/skills/observability-elasticsearch/scripts/get_statistics.py --time-range 60 python .claude/skills/observability-elasticsearch/scripts/get_statistics.py --index logs-production

Output includes:

Total count, error count, error rate percentage
Status distribution (info, warn, error)
Top services/sources by log volume
Top error patterns (crucial for quick triage)
Actionable recommendation

sample_logs.py - Strategic Sampling

Choose the right sampling strategy based on statistics.

python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy STRATEGY [--index INDEX] [--limit N]

Strategies:

errors_only - Only error logs (default for incidents)

warnings_up - Warning and error logs

around_time - Logs around a specific timestamp

all - All log levels

Examples:

python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy errors_only --index logs-production python .claude/skills/observability-elasticsearch/scripts/sample_logs.py --strategy around_time --timestamp "2026-01-27T05:00:00Z" --window 5

Lucene Query Syntax

Basic Searches

Simple term

error

Phrase

"connection refused"

Field search

level:ERROR

Wildcard

message:timeout*

Multiple terms (implicit OR)

error warning

Required term (AND)

+error +timeout

Field Queries

Exact match

level:ERROR

Wildcard

host:web-*

Range (numeric)

status:[400 TO 599]

Range (dates)

@timestamp:[2024-01-15T10:00:00 TO 2024-01-15T11:00:00]

Exists

exists:error.stack_trace

Boolean Operators

AND

error AND timeout

OR

error OR warning

NOT

error NOT debug

Grouping

(error OR warning) AND service:api

Query DSL (JSON)

Match Query

{ "query": { "match": { "message": "connection error" } } }

Term Query (Exact Match)

{ "query": { "term": { "level": "ERROR" } } }

Bool Query (Compound)

{ "query": { "bool": { "must": [ {"term": {"level": "ERROR"}}, {"match": {"message": "timeout"}} ], "must_not": [ {"term": {"service": "healthcheck"}} ], "filter": [ {"range": {"@timestamp": {"gte": "now-1h"}}} ] } } }

Aggregations

{ "size": 0, "aggs": { "errors_by_service": { "terms": { "field": "service.keyword", "size": 10 } } } }

Investigation Workflow

Standard Incident Investigation

┌─────────────────────────────────────────────────────────────┐ │ 1. STATISTICS FIRST (mandatory) │ │ python get_statistics.py --index <index> │ │ → Know volume, error rate, top patterns │ └─────────────────────────────────────────────────────────────┘ │ ▼ High Error Rate? ┌─────────────┴─────────────┐ │ │ YES (>5%) NO │ │ ▼ ▼ ┌─────────────────────────────┐ ┌───────────────────────────────────────────┐ │ 2. FAST PATH │ │ 2. TARGETED INVESTIGATION │ │ Sample errors directly │ │ Filter by specific criteria │ │ python sample_logs.py │ │ python sample_logs.py --strategy all │ │ --strategy errors_only │ │ → Look for anomalies │ └─────────────────────────────┘ └───────────────────────────────────────────┘

Quick Commands Reference

Goal Command

Start investigation get_statistics.py --index X

Sample errors only sample_logs.py --strategy errors_only --index X

Investigate spike sample_logs.py --strategy around_time --timestamp T

All logs sample_logs.py --strategy all --index X --limit 20

Common Aggregation Patterns

Errors Over Time

{ "size": 0, "query": {"term": {"level": "ERROR"}}, "aggs": { "errors_over_time": { "date_histogram": { "field": "@timestamp", "fixed_interval": "5m" } } } }

Top Error Messages

{ "size": 0, "query": {"term": {"level": "ERROR"}}, "aggs": { "top_errors": { "terms": { "field": "message.keyword", "size": 10 } } } }

Nested Aggregation (Errors by Service, then by Message)

{ "size": 0, "aggs": { "by_service": { "terms": {"field": "service.keyword", "size": 10}, "aggs": { "by_message": { "terms": {"field": "message.keyword", "size": 5} } } } } }

Field Types

Keyword vs Text

keyword: Exact match, aggregatable (service.keyword )
text: Full-text search, not aggregatable (message )

// For aggregation, use .keyword suffix "terms": {"field": "service.keyword"}

// For full-text search, use text field "match": {"message": "connection error"}

Anti-Patterns to Avoid

❌ NEVER skip statistics - get_statistics.py is MANDATORY first step
❌ Unbounded queries - Always specify time ranges and limits
❌ Fetching all logs - Use sampling strategies, not unbounded searches
❌ Ignoring error rate - High error rate means immediate investigation
❌ Text field in aggregation - Use .keyword suffix for terms aggs
❌ Wildcard prefix - error is expensive, prefer error or exact match

elasticsearch-analysis

Safety Notice

Copy this and send it to your AI assistant to learn

Examples:

Strategies:

errors_only - Only error logs (default for incidents)

warnings_up - Warning and error logs

around_time - Logs around a specific timestamp

all - All log levels

Examples:

Simple term

Phrase

Field search

Wildcard

Multiple terms (implicit OR)

Required term (AND)

Exact match

Wildcard

Range (numeric)

Range (dates)

Exists

AND

OR

NOT

Grouping

Source Transparency

Related Skills

log-analysis

metrics-analysis

knowledge-base