Golden Dataset
Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.
Quick Reference
Category Rules Impact When to Use
Curation 3 HIGH Content collection, annotation pipelines, diversity analysis
Management 3 HIGH Versioning, backup/restore, CI/CD automation
Validation 3 CRITICAL Quality scoring, drift detection, regression testing
Add Workflow 1 HIGH 9-phase curation, quality scoring, bias detection, silver-to-gold
Total: 10 rules across 4 categories
Curation
Content collection, multi-agent annotation, and diversity analysis for golden datasets.
Rule File Key Pattern
Collection rules/curation-collection.md
Content type classification, quality thresholds, duplicate prevention
Annotation rules/curation-annotation.md
Multi-agent pipeline, consensus aggregation, Langfuse tracing
Diversity rules/curation-diversity.md
Difficulty stratification, domain coverage, balance guidelines
Management
Versioning, storage, and CI/CD automation for golden datasets.
Rule File Key Pattern
Versioning rules/management-versioning.md
JSON backup format, embedding regeneration, disaster recovery
Storage rules/management-storage.md
Backup strategies, URL contract, data integrity checks
CI Integration rules/management-ci.md
GitHub Actions automation, pre-deployment validation, weekly backups
Validation
Quality scoring, drift detection, and regression testing for golden datasets.
Rule File Key Pattern
Quality rules/validation-quality.md
Schema validation, content quality, referential integrity
Drift rules/validation-drift.md
Duplicate detection, semantic similarity, coverage gap analysis
Regression rules/validation-regression.md
Difficulty distribution, pre-commit hooks, full dataset validation
Add Workflow
Structured workflow for adding new documents to the golden dataset.
Rule File Key Pattern
Add Document rules/curation-add-workflow.md
9-phase curation, parallel quality analysis, bias detection
Quick Start Example
from app.shared.services.embeddings import embed_text
async def validate_before_add(document: dict, source_url_map: dict) -> dict: """Pre-addition validation for golden dataset entries.""" errors = []
# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
errors.append("URL must be canonical, not a placeholder")
# 2. Content quality
if len(document.get("title", "")) < 10:
errors.append("Title too short (min 10 chars)")
# 3. Tag requirements
if len(document.get("tags", [])) < 2:
errors.append("At least 2 domain tags required")
return {"valid": len(errors) == 0, "errors": errors}
Key Decisions
Decision Recommendation
Backup format JSON (version controlled, portable)
Embedding storage Exclude from backup (regenerate on restore)
Quality threshold
= 0.70 quality score for inclusion
Confidence threshold
= 0.65 for auto-include
Duplicate threshold
= 0.90 similarity blocks, >= 0.85 warns
Min tags per entry 2 domain tags
Min test queries 3 per document
Difficulty balance Trivial 3, Easy 3, Medium 5, Hard 3 minimum
CI frequency Weekly automated backup (Sunday 2am UTC)
Common Mistakes
-
Using placeholder URLs instead of canonical source URLs
-
Skipping embedding regeneration after restore
-
Not validating referential integrity between documents and queries
-
Over-indexing on articles (neglecting tutorials, research papers)
-
Missing difficulty distribution balance in test queries
-
Not running verification after backup/restore operations
-
Testing restore procedures in production instead of staging
-
Committing SQL dumps instead of JSON (not version-control friendly)
Evaluations
See test-cases.json for 9 test cases across all categories.
Related Skills
-
ork:rag-retrieval
-
Retrieval evaluation using golden dataset
-
langfuse-observability
-
Tracing patterns for curation workflows
-
ork:testing-unit
-
Unit testing patterns and strategies
-
ai-native-development
-
Embedding generation for restore
Capability Details
curation
Keywords: golden dataset, curation, content collection, annotation, quality criteria
Solves:
-
Classify document content types for golden dataset
-
Run multi-agent quality analysis pipelines
-
Generate test queries for new documents
management
Keywords: golden dataset, backup, restore, versioning, disaster recovery
Solves:
-
Backup and restore golden datasets with JSON
-
Regenerate embeddings after restore
-
Automate backups with CI/CD
validation
Keywords: golden dataset, validation, schema, duplicate detection, quality metrics
Solves:
-
Validate entries against document schema
-
Detect duplicate or near-duplicate entries
-
Analyze dataset coverage and distribution gaps