Flaky Test Detective

Hunt down flaky tests — the ones that pass sometimes and fail sometimes with no code change. Analyze CI run history, detect timing-sensitive tests, find shared mutable state, identify race conditions, and generate targeted fixes for each flaky test.

Use when: "find flaky tests", "test keeps failing randomly", "CI is unreliable", "non-deterministic test failures", "test stability", "intermittent failures", or when builds fail but pass on retry.

Commands

1. `detect` — Find Flaky Tests

Step 1: Analyze CI History

# GitHub Actions — get recent test failures
gh run list --limit 30 --json conclusion,headBranch,createdAt,databaseId | \
  python3 -c "
import json, sys
runs = json.load(sys.stdin)
failures = [r for r in runs if r['conclusion'] == 'failure']
retried = [r for r in runs if r['headBranch'] == runs[0]['headBranch'] and r['conclusion'] != failures[0]['conclusion']]
print(f'Total runs: {len(runs)}')
print(f'Failures: {len(failures)} ({len(failures)/len(runs)*100:.0f}%)')
if retried:
    print(f'Flaky signal: same branch has both pass and fail')
"

# Download test results (JUnit XML)
gh run download <RUN_ID> --name test-results 2>/dev/null

Step 2: Statistical Detection

Run the test suite multiple times and track results:

# Run tests N times and collect results
RESULTS_FILE="/tmp/flaky-results.json"
echo '[]' > "$RESULTS_FILE"

for i in $(seq 1 5); do
  echo "=== Run $i/5 ==="
  # Capture per-test results (adjust for your framework)
  npm test -- --json 2>/dev/null | python3 -c "
import json, sys
try:
    data = json.load(sys.stdin)
    for suite in data.get('testResults', []):
        for test in suite.get('testResults', []):
            print(f'{test[\"status\"]}\t{test[\"fullName\"]}')
except: pass
" >> "/tmp/run-$i.txt"
done

# Find tests with inconsistent results across runs
python3 -c "
import os, collections
results = collections.defaultdict(list)
for i in range(1, 6):
    path = f'/tmp/run-{i}.txt'
    if os.path.exists(path):
        for line in open(path):
            parts = line.strip().split('\t', 1)
            if len(parts) == 2:
                results[parts[1]].append(parts[0])

for test, outcomes in sorted(results.items()):
    unique = set(outcomes)
    if len(unique) > 1:
        pass_rate = outcomes.count('passed') / len(outcomes) * 100
        print(f'🎯 FLAKY ({pass_rate:.0f}% pass rate): {test}')
        print(f'   Outcomes: {\" → \".join(outcomes)}')
"

Step 3: Pattern Analysis

For each flaky test, analyze the failure to classify the root cause:

Timing-dependent:

# Check for setTimeout, sleep, waitFor with hardcoded timeouts
rg "setTimeout|sleep|waitFor|delay|\.timeout" --type ts --type js -g '*test*' -g '*spec*' 2>/dev/null

Shared state:

# Check for global variables, singletons, shared fixtures
rg "beforeAll|before\(|global\.|singleton|shared" --type ts --type js -g '*test*' -g '*spec*' 2>/dev/null

Test order dependency:

# Run tests in random order
npm test -- --randomize 2>/dev/null
pytest --randomly 2>/dev/null
go test -shuffle=on ./... 2>/dev/null

Environment dependency:

# Check for hardcoded ports, paths, dates, timezones
rg "localhost:[0-9]+|/tmp/|/var/|new Date\(\)|Date\.now\(\)" --type ts --type js -g '*test*' 2>/dev/null

Network dependency:

# Check for real HTTP calls in tests
rg "fetch\(|axios\.|http\.get|requests\.(get|post)" --type ts --type js --type py -g '*test*' 2>/dev/null

Step 4: Generate Report

# Flaky Test Report

## Summary
- Tests analyzed: 342
- Flaky tests found: 7
- Test suite reliability: 98% (target: 99.9%)

## Flaky Tests

### 1. `UserService.test.ts` — "should send welcome email"
- **Pass rate:** 60% (3/5 runs)
- **Root cause:** Timing — test waits 100ms but email service sometimes takes 200ms
- **Fix:** Replace `setTimeout(100)` with `waitFor(() => expect(emailSent).toBe(true))`
- **Category:** ⏱️ Timing

### 2. `OrderController.test.ts` — "should calculate total"
- **Pass rate:** 80% (4/5 runs)
- **Root cause:** Shared state — previous test modifies global discount config
- **Fix:** Reset `DiscountConfig` in `beforeEach()` or isolate with `jest.isolateModules()`
- **Category:** 🔗 Shared State

### 3. `DateFormatter.test.ts` — "should format date correctly"
- **Pass rate:** 40% (2/5 runs)
- **Root cause:** Timezone — test assumes UTC but CI runs in different timezone
- **Fix:** Use `new Date('2024-01-15T00:00:00Z')` instead of `new Date('2024-01-15')`
- **Category:** 🌐 Environment

2. `fix` — Generate Fixes for Flaky Tests

For each root cause category, apply the appropriate fix pattern:

Timing: Replace fixed delays with polling/retry assertions
Shared state: Add setup/teardown, use test isolation
Order dependency: Make tests independent, reset state
Network: Mock external calls, use test fixtures
Environment: Pin timezone, use deterministic dates, avoid temp paths

3. `quarantine` — Manage Flaky Test Quarantine

Move known-flaky tests to a quarantine suite that runs separately:

Tag with @flaky or skip with documentation
Track quarantined tests in a manifest file
Alert when quarantine grows beyond threshold
Automatically un-quarantine after fix is merged and verified stable (5 consecutive passes)