Writing Quality Tests

Overview

High-signal tests prove behavior, not implementation. While TDD focuses on the process of writing tests first, this skill focuses on the artifact—making tests stable, explicit, and valuable long-term.

Core rule: If a test is nondeterministic or tied to internals, it is debt. Fix it.

When to Use

New features: "I need to add tests for this new API endpoint/component."
Bug fixes: "Help me write a regression test for this bug before fixing it."
Flaky tests: "This test fails randomly on CI. How do I make it deterministic?"
Refactoring: "I want to refactor this legacy code but the tests are brittle. How do I improve them first?"
Slow tests: "The test suite takes too long. How can I speed it up or mock dependencies effectively?"
Test Design: "Should I use a unit test or integration test for this logic?"
Review: "Check these tests for maintainability, coverage, and clarity."
Not for manual exploratory testing or load/perf-only work.

Non-Negotiables

Deterministic: same input -> same result; no hidden time/network randomness
Behavioral oracles: assertions map to business behavior or contracts, never incidental internals
Minimal coupling: tests fail for product changes, not helper refactors
Focused scope: one behavior per test; isolated fixtures; clear names
Fast feedback: prefer fast layers; cache expensive setup; parallelize safely

Workflow

Prove it fails: capture the regression input or wished-for case and watch the test fail first (or reproduce the bug) before code changes.
Clarify behavior: preconditions, action, postconditions, invariants. Capture regression input if fixing a bug.
Pick level: unit for pure logic; contract for external calls; integration for seams; E2E only to prove flows or contracts end-to-end.
Design oracle: assert outputs, state, events, and invariants; avoid implementation details or transient UI.
Shape fixtures: use builders/factories; avoid globals; randomize with seeds only when helpful and log the seed.
Write the test: AAA (arrange-act-assert) or GWT; table-driven for variants; property-based for algebraic invariants.
Validate: run focused test first, then suite. If flaky, hunt nondeterminism (time, randomness, order, network) and remove it.
Document intent: name states behavior; failure message points to the expected contract.

Patterns to Prefer

Boundary and mutation pairs: min/max/empty/null plus one mutated variation to prove invariants.
Table-driven cases: enumerate input/output pairs to avoid duplicate tests and improve diffability.
Property-based checks: algebraic properties (idempotence, reversibility, ordering), round-trips, monotonic counters.
Contracts at seams: mock at boundaries you own; for third-party calls, pin to contract tests or recorded fixtures.
Guarded goldens: only for complex structured output; require explicit review of golden updates.

Coverage Strategy

Coverage is opt-in: never run coverage unless explicitly requested by the user in the current session (e.g., "improve coverage on file X to Y%"). PM/teammate/CI pressure does not override this rule.
Pyramid discipline: many unit tests, fewer integration, very few E2E. Use E2E to prove cross-service flow or UI contract.
Change-based coverage: every test should fail without the code change and pass with it; capture the regression input/output.
Critical paths first: auth, billing, migrations, data loss, irreversible actions. Add invariants that must never be violated.
Data and time: cover time zones, DST, leap years, ordering, pagination, idempotency, and retry semantics.
Observability: log seeds for randomized tests; emit diagnostics on failure (inputs, seed, environment versions).

Example (explicit coverage request): User: "improve coverage on file X to 80%". Run targeted coverage for that file only, add behavior-driven tests to hit missing branches, and avoid coverage runs outside that request.

pytest --cov=path/to/file.py --cov-report=term-missing

Flake Prevention

Remove time races: replace sleeps with waits on explicit conditions; freeze or inject clocks.
Isolate state: fresh fixtures per test; unique temp dirs/ports; clean databases; no shared singletons.
Control randomness: seed RNG, capture seed in failure output, prefer deterministic builders.
Network and IO: stub external calls; if unavoidable, record/replay; set tight timeouts and retries with jitter disabled in tests.
Parallel safety: ensure fixtures are parallel-safe or mark tests serial; avoid global mutable state.

Review Checklist

Name states behavior and level (e.g., "adds item to cart (integration)").
Single reason to fail; assertions map to user-visible behavior or contract.
Fixtures minimal and local; builders hide irrelevant details; no shared hidden state.
Negative and edge cases present; regression case for the original bug captured.
Tests run quickly; slow/expensive flows justified and focused.

Hygiene (adaptable patterns)

Structure: Given–When–Then or AAA so intent is obvious.
Hypothesis: fix generators or code instead of suppressing health checks; log seeds for repro.
Async correctness: use real async paths/fakes; don’t hide missing awaits with sync doubles.
Assertion scope: assert behavior/contract fields; avoid brittle full-payload snapshots unless testing a contract.
Coverage as health, not blocker: focus on low-coverage behavior-heavy files; be pragmatic with legacy or infra-heavy areas.

Marks (for selective runs)

unit: isolated logic with external deps mocked
contract/integration: cross-component seams with real wiring or adapters
async: true async paths; avoid sync fakes masking awaits
property: Hypothesis-based invariants in dedicated property files
slow: >1s or real infra; justify and keep focused

Common Anti-Patterns

Brittle UI or text snapshots without intent; prefer semantic assertions or scoped snapshots.
Over-mocking internals; mocking within the module under test; asserting call order that is not part of the contract.
Sleep-based waits; reliance on wall-clock time; unseeded randomness.
Combined scenarios covering multiple behaviors in one test; global fixtures that hide setup.
Golden files updated blindly; tests that assert logging implementation rather than outcomes.
Running coverage by default instead of waiting for explicit coverage requests.

Red Flags - Stop and Fix

Tests pass or fail intermittently
Assertions tied to private methods or call order instead of observable behavior
Unseeded randomness, sleeps instead of explicit waits, or shared mutable fixtures
Golden updates accepted without review of intent
A test never failed before the code change
Running coverage without the user explicitly asking
Running coverage due to PM/teammate/CI pressure