Data Quality Validator
Toolkit for validating and profiling tabular data quality.
Features
- Schema validation - Check column types, constraints, and rules
- Completeness analysis - Missing value detection and reporting
- Anomaly detection - Statistical outlier detection
- Profiling - Summary statistics and distribution analysis
- Constraint checking - Range checks, uniqueness, regex patterns
Quick Start
from scripts.data_profiler import DataProfiler
from scripts.schema_validator import SchemaValidator
# Profile a dataset
profiler = DataProfiler()
report = profiler.profile(df) # pandas DataFrame
print(report["missing"])
print(report["outliers"])
# Validate against schema
schema = {
"age": {"type": "int", "min": 0, "max": 150},
"email": {"type": "str", "regex": r"^\S+@\S+\.\S+$"},
"id": {"type": "int", "unique": True}
}
validator = SchemaValidator(schema)
errors = validator.validate(df)
for err in errors:
print(err)
Scripts
scripts/data_profiler.py- Dataset profiling and summary statsscripts/schema_validator.py- Schema-based validation enginescripts/anomaly_detector.py- Statistical anomaly detection
References
references/validation_rules.md- Common validation patterns