json-data-validation-test-design

JSON Data File Validation Test Design

Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules

Problem

JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:

Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)
Schema violations that the app silently swallows
Cross-reference mismatches (source data vs generated output)
Missing or duplicate entries
Business rule violations (e.g., correct answer not in choices)

Manual review of large files (60+ entries, 3000+ lines) is unreliable.

Solution: Layered Pytest Validation

Structure tests in layers from structural to semantic:

Layer 1: Top-level structure

class TestTopLevelStructure: def test_required_fields(self, data): ... def test_count_matches(self, data): assert data["totalItems"] == len(data["items"])

Layer 2: Per-entry schema validation

class TestEntryFields: def test_required_fields(self, entries): for e in entries: missing = REQUIRED - e.keys() assert not missing, f"Entry {e['id']}: missing {missing}"

def test_enum_values(self, entries):
    for e in entries:
        assert e["type"] in VALID_TYPES

Layer 3: Cross-entry consistency

class TestConsistency: def test_no_duplicates(self, entries): ids = [e["id"] for e in entries] assert len(ids) == len(set(ids))

def test_references_resolve(self, entries, categories):
    # Every entry's category exists in categories list

Layer 4: Source cross-reference

class TestSourceCrossReference: @pytest.fixture def source_data(self): # Parse original source files ...

def test_values_match_source(self, entries, source_data):
    mismatches = []
    for e in entries:
        if e["answer"] != source_data[e["id"]]:
            mismatches.append(...)
    assert not mismatches, f"{len(mismatches)} mismatches"

Layer 5: Content quality heuristics

class TestContentQuality: def test_min_text_length(self, entries): for e in entries: assert len(e["text"]) >= THRESHOLD

def test_no_stray_characters(self, entries):
    stray = {"ß", "€", "£"}  # Characters unlikely in this domain
    issues = []
    for e in entries:
        for ch in stray:
            if ch in e["text"]:
                issues.append(f"{e['id']}: '{ch}'")
    assert not issues

Key Design Decisions

Module-scoped fixtures for the parsed JSON (scope="module" ) to avoid re-reading per test
Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like "過学習")

When to Use

After generating/rebuilding JSON data files from external sources
As a CI gate for data files that feed into apps
When a data file is too large for manual review
When data is parsed from inconsistent sources (OCR, PDF export, manual entry)

json-data-validation-test-design

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

content-hash-cache-pattern

cross-source-fact-verification

long-document-llm-pipeline