JSON Data File Validation Test Design
Extracted: 2026-02-11 Context: Validating a large JSON data file (exam questions) generated by a build script against its schema, source data, and business rules
Problem
JSON data files generated by scripts (from text, CSV, API, etc.) can contain subtle issues:
-
Stray characters from OCR/copy-paste (e.g., ß mixed into Japanese text)
-
Schema violations that the app silently swallows
-
Cross-reference mismatches (source data vs generated output)
-
Missing or duplicate entries
-
Business rule violations (e.g., correct answer not in choices)
Manual review of large files (60+ entries, 3000+ lines) is unreliable.
Solution: Layered Pytest Validation
Structure tests in layers from structural to semantic:
Layer 1: Top-level structure
class TestTopLevelStructure: def test_required_fields(self, data): ... def test_count_matches(self, data): assert data["totalItems"] == len(data["items"])
Layer 2: Per-entry schema validation
class TestEntryFields: def test_required_fields(self, entries): for e in entries: missing = REQUIRED - e.keys() assert not missing, f"Entry {e['id']}: missing {missing}"
def test_enum_values(self, entries):
for e in entries:
assert e["type"] in VALID_TYPES
Layer 3: Cross-entry consistency
class TestConsistency: def test_no_duplicates(self, entries): ids = [e["id"] for e in entries] assert len(ids) == len(set(ids))
def test_references_resolve(self, entries, categories):
# Every entry's category exists in categories list
Layer 4: Source cross-reference
class TestSourceCrossReference: @pytest.fixture def source_data(self): # Parse original source files ...
def test_values_match_source(self, entries, source_data):
mismatches = []
for e in entries:
if e["answer"] != source_data[e["id"]]:
mismatches.append(...)
assert not mismatches, f"{len(mismatches)} mismatches"
Layer 5: Content quality heuristics
class TestContentQuality: def test_min_text_length(self, entries): for e in entries: assert len(e["text"]) >= THRESHOLD
def test_no_stray_characters(self, entries):
stray = {"ß", "€", "£"} # Characters unlikely in this domain
issues = []
for e in entries:
for ch in stray:
if ch in e["text"]:
issues.append(f"{e['id']}: '{ch}'")
assert not issues
Key Design Decisions
-
Module-scoped fixtures for the parsed JSON (scope="module" ) to avoid re-reading per test
-
Collect-all-errors pattern: accumulate issues in a list, assert at end, so one test run shows all problems
-
Graceful degradation: source cross-reference tests skip with pytest.skip() if source files are absent
-
Domain-aware thresholds: min length for text depends on the domain (e.g., 2 chars for Japanese terms like "過学習")
When to Use
-
After generating/rebuilding JSON data files from external sources
-
As a CI gate for data files that feed into apps
-
When a data file is too large for manual review
-
When data is parsed from inconsistent sources (OCR, PDF export, manual entry)