Debug Slow Queries
Quick Start
Goal: go from "slow" to a verified fix with minimal risk.
When to use this skill
- p95 latency spikes, timeouts, or sudden cpu growth tied to database load
- a deploy introduces query regression
- tables grow and plans flip (index → seq scan, join strategy changes)
inputs to collect (before changing anything)
- database engine + version
- query text + typical parameters (not placeholders only)
- runtime evidence: p50/p95, rows returned, timeouts, error logs
- table sizes / row counts (approx)
- relevant indexes + constraints
- environment: prod vs staging, read replica vs primary
Workflow (default)
- reproduce or capture
- reproduce in staging with production-like data, or capture the exact query + params from logs
- baseline
- record current p50/p95 and the explain plan
- read the plan
- identify where time is spent (scan, join, sort, aggregate)
- pick the cheapest safe fix
- rewrite query → add/adjust index → update stats → change schema
- verify
- re-run explain, compare timing, validate correctness
- roll out safely
- ship behind a flag if needed, monitor, keep rollback path
Explain plan patterns (what to look for)
scans
- sequential scan on large table
- likely missing selective predicate index, or predicate is not sargable
- index scan returning many rows
- index exists but selectivity is poor; may need different leading columns or a partial index
joins
- nested loop with huge inner iterations
- inner side needs an index on join key, or join order needs improvement
- hash join spilling
- work_mem/memory pressure or too many rows; reduce join input with earlier filters
sorts / aggregates
- sort on large result
- add index to match order by, or paginate differently
- group by on many rows
- pre-filter, pre-aggregate, or add covering index for grouping columns
Engine-specific commands
PostgreSQL
get plan + timing:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) <query>;
check table stats freshness:
SELECT relname, n_live_tup, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
see indexes:
SELECT tablename, indexname, indexdef
FROM pg_indexes
WHERE schemaname = 'public' AND tablename = 'your_table';
safe index creation:
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_name ON your_table (col1, col2);
MySQL
get plan:
EXPLAIN <query>;
get runtime plan details (mysql 8):
EXPLAIN ANALYZE <query>;
see indexes:
SHOW INDEX FROM your_table;
SQLite
get plan:
EXPLAIN QUERY PLAN <query>;
Fix toolbox (ordered by typical safety)
1) query rewrites (often best first)
- limit early
- move filters into subqueries/ctes only if they reduce rows before joins
- avoid non-sargable predicates
- avoid wrapping indexed columns in functions in where clauses
- replace
select *- reduce io and sort payload
- avoid
offsetpagination at scale- use keyset pagination when possible
2) index changes
Use this checklist:
- does the where clause filter on a selective column?
- does the join predicate have an index on the inner side?
- does order by match an index prefix?
- is a composite index needed (leading columns matter)?
- can a partial index reduce size (postgres)?
3) statistics / maintenance
- analyze / vacuum (postgres) or optimize/analyze table (mysql) can fix bad estimates
- confirm cardinality estimates vs actual rows in explain analyze
4) schema-level changes (highest cost)
- denormalize only after proving query/index fixes are insufficient
- consider materialized views / summary tables for heavy aggregates
Output format (copy/paste)
## slow query report
### symptom
- endpoint/job:
- p50/p95:
- query:
### evidence
- explain plan notes:
- row counts:
- indexes involved:
### root cause
- primary bottleneck:
- why it happens:
### fix
- change:
- risk:
- rollback:
### verification
- before:
- after:
- correctness checks: