ML Training Validation Scripts
Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
-
Validating training datasets before fine-tuning
-
Checking model checkpoints and outputs
-
Testing end-to-end training pipelines
-
Verifying system dependencies and GPU availability
-
Debugging training failures or data issues
-
Ensuring data format compliance
-
Validating model configurations
-
Testing inference pipelines
Key Resources:
-
scripts/validate-data.sh
-
Comprehensive dataset validation
-
scripts/validate-model.sh
-
Model checkpoint and config validation
-
scripts/test-pipeline.sh
-
End-to-end pipeline testing
-
scripts/check-dependencies.sh
-
System and package dependency checking
-
templates/test-config.yaml
-
Pipeline test configuration template
-
templates/validation-schema.json
-
Data validation schema template
-
examples/data-validation-example.md
-
Dataset validation workflow
-
examples/pipeline-testing-example.md
-
Complete pipeline testing guide
Quick Start
Validate Training Data
bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--format jsonl
--schema templates/validation-schema.json
Validate Model Checkpoint
bash scripts/validate-model.sh
--model-path ./checkpoints/epoch-3
--framework pytorch
--check-weights
Test Complete Pipeline
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--data ./data/sample.jsonl
--verbose
Check System Dependencies
bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16
Validation Scripts
- Data Validation (validate-data.sh)
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash scripts/validate-data.sh [OPTIONS]
Options:
-
--data-path PATH
-
Path to dataset file or directory (required)
-
--format FORMAT
-
Data format: jsonl, csv, parquet, arrow (default: jsonl)
-
--schema PATH
-
Path to validation schema JSON file
-
--sample-size N
-
Number of samples to validate (default: all)
-
--check-duplicates
-
Check for duplicate entries
-
--check-null
-
Check for null/missing values
-
--check-length
-
Validate text length constraints
-
--check-tokens
-
Validate tokenization compatibility
-
--tokenizer MODEL
-
Tokenizer to use for token validation (default: gpt2)
-
--max-length N
-
Maximum sequence length (default: 2048)
-
--output REPORT
-
Output validation report path (default: validation-report.json)
Validation Checks:
-
Format Compliance: Verify file format and structure
-
Schema Validation: Check against JSON schema if provided
-
Data Types: Validate field types (string, int, float, etc.)
-
Required Fields: Ensure all required fields are present
-
Value Ranges: Check numeric values within expected ranges
-
Text Quality: Detect empty strings, excessive whitespace, encoding issues
-
Duplicates: Identify duplicate entries (exact or near-duplicate)
-
Token Counts: Verify sequences fit within model context length
-
Label Distribution: Check class balance for classification tasks
-
Missing Values: Detect and report null/missing values
Example Output:
{ "status": "PASS", "total_samples": 10000, "valid_samples": 9987, "invalid_samples": 13, "validation_errors": [ { "sample_id": 42, "field": "text", "error": "Exceeds max token length: 2150 > 2048" }, { "sample_id": 156, "field": "label", "error": "Invalid label value: 'unknwn' (typo)" } ], "statistics": { "avg_text_length": 487, "avg_token_count": 128, "label_distribution": { "positive": 4892, "negative": 4895, "neutral": 213 } }, "recommendations": [ "Remove or fix 13 invalid samples before training", "Label distribution is imbalanced - consider class weighting" ] }
Exit Codes:
-
0
-
All validations passed
-
1
-
Validation errors found
-
2
-
Script error (invalid arguments, file not found)
- Model Validation (validate-model.sh)
Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash scripts/validate-model.sh [OPTIONS]
Options:
-
--model-path PATH
-
Path to model checkpoint directory (required)
-
--framework FRAMEWORK
-
Framework: pytorch, tensorflow, jax (default: pytorch)
-
--check-weights
-
Verify model weights are loadable
-
--check-config
-
Validate model configuration file
-
--check-tokenizer
-
Verify tokenizer files are present
-
--check-inference
-
Test inference with sample input
-
--sample-input TEXT
-
Sample text for inference test
-
--expected-output TEXT
-
Expected output for verification
-
--output REPORT
-
Output validation report path
Validation Checks:
-
File Structure: Verify required files are present
-
Config Validation: Check model_config.json for correctness
-
Weight Integrity: Load and verify model weights
-
Tokenizer Files: Ensure tokenizer files exist and are loadable
-
Model Architecture: Validate architecture matches config
-
Memory Requirements: Estimate GPU/CPU memory needed
-
Inference Test: Run sample inference if requested
-
Quantization: Verify quantized models load correctly
-
LoRA/PEFT: Validate adapter weights and configuration
Example Output:
{ "status": "PASS", "model_path": "./checkpoints/llama-7b-finetuned", "framework": "pytorch", "checks": { "file_structure": "PASS", "config": "PASS", "weights": "PASS", "tokenizer": "PASS", "inference": "PASS" }, "model_info": { "architecture": "LlamaForCausalLM", "parameters": "7.2B", "precision": "float16", "lora_enabled": true, "lora_rank": 16 }, "memory_estimate": { "model_size_gb": 13.5, "inference_vram_gb": 16.2, "training_vram_gb": 24.8 }, "inference_test": { "input": "Hello, world!", "output": "Hello, world! How can I help you today?", "latency_ms": 142 } }
- Pipeline Testing (test-pipeline.sh)
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash scripts/test-pipeline.sh [OPTIONS]
Options:
-
--config PATH
-
Pipeline test configuration file (required)
-
--data PATH
-
Sample data for testing
-
--steps STEPS
-
Comma-separated steps to test (default: all)
-
--verbose
-
Enable detailed output
-
--output REPORT
-
Output test report path
-
--fail-fast
-
Stop on first failure
-
--cleanup
-
Clean up temporary files after test
Pipeline Steps:
-
data_loading: Test data loading and preprocessing
-
tokenization: Verify tokenization pipeline
-
model_loading: Load model and verify initialization
-
training_step: Run single training step
-
validation_step: Run single validation step
-
checkpoint_save: Test checkpoint saving
-
checkpoint_load: Test checkpoint loading
-
inference: Test inference pipeline
-
metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
pipeline: name: llama-7b-finetuning framework: pytorch
data: train_path: ./data/sample-train.jsonl val_path: ./data/sample-val.jsonl format: jsonl
model: base_model: meta-llama/Llama-2-7b-hf checkpoint_path: ./checkpoints/test load_in_8bit: false lora: enabled: true r: 16 alpha: 32
training: batch_size: 1 gradient_accumulation: 1 learning_rate: 2e-4 max_steps: 5
testing: sample_size: 10 timeout_seconds: 300 expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report
Pipeline: llama-7b-finetuning Started: 2025-11-01 12:34:56 Duration: 127 seconds
Test Results: ✓ data_loading (2.3s) - Loaded 10 samples successfully ✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens ✓ model_loading (8.7s) - Model loaded, 7.2B parameters ✓ training_step (15.4s) - Training step completed, loss: 1.847 ✓ validation_step (12.1s) - Validation step completed, loss: 1.923 ✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test ✓ checkpoint_load (6.8s) - Checkpoint loaded successfully ✓ inference (2.9s) - Inference completed, latency: 142ms ✓ metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput
- Dependency Checking (check-dependencies.sh)
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash scripts/check-dependencies.sh [OPTIONS]
Options:
-
--framework FRAMEWORK
-
ML framework: pytorch, tensorflow, jax (default: pytorch)
-
--gpu-required
-
Require GPU availability
-
--min-vram GB
-
Minimum GPU VRAM required (GB)
-
--cuda-version VERSION
-
Required CUDA version
-
--packages FILE
-
Path to requirements.txt or packages list
-
--platform PLATFORM
-
Platform: modal, lambda, runpod, local (default: local)
-
--fix
-
Attempt to install missing packages
-
--output REPORT
-
Output dependency report path
Dependency Checks:
-
Python Version: Verify compatible Python version
-
ML Framework: Check PyTorch/TensorFlow/JAX installation
-
CUDA/cuDNN: Verify CUDA toolkit and cuDNN
-
GPU Availability: Detect and validate GPUs
-
VRAM: Check GPU memory capacity
-
System Packages: Verify system-level dependencies
-
Python Packages: Check required pip packages
-
Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
-
Storage: Check available disk space
-
Network: Test internet connectivity for model downloads
Example Output:
{ "status": "PASS", "platform": "modal", "checks": { "python": { "status": "PASS", "version": "3.10.12", "required": ">=3.9" }, "pytorch": { "status": "PASS", "version": "2.1.0", "cuda_available": true, "cuda_version": "12.1" }, "gpu": { "status": "PASS", "count": 1, "type": "NVIDIA A100", "vram_gb": 40, "driver_version": "535.129.03" }, "packages": { "status": "PASS", "installed": 42, "missing": 0, "outdated": 3 }, "storage": { "status": "PASS", "available_gb": 128, "required_gb": 50 } }, "recommendations": [ "Update transformers to latest version (4.36.0 -> 4.37.2)", "Consider upgrading to PyTorch 2.2.0 for better performance" ] }
Templates
Validation Schema Template (templates/validation-schema.json)
Defines expected data structure and validation rules for training datasets.
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "required": ["text", "label"], "properties": { "text": { "type": "string", "minLength": 10, "maxLength": 5000, "description": "Input text for training" }, "label": { "type": "string", "enum": ["positive", "negative", "neutral"], "description": "Classification label" }, "metadata": { "type": "object", "properties": { "source": { "type": "string" }, "timestamp": { "type": "string", "format": "date-time" } } } }, "validation_rules": { "max_token_length": 2048, "tokenizer": "meta-llama/Llama-2-7b-hf", "check_duplicates": true, "min_label_count": 100, "max_label_imbalance_ratio": 10.0 } }
Customization:
-
Modify required fields for your use case
-
Add custom properties and validation rules
-
Set appropriate length constraints
-
Define label enums for classification
-
Configure tokenizer and max length
Test Configuration Template (templates/test-config.yaml)
Defines pipeline testing parameters and expected behaviors.
pipeline: name: test-pipeline framework: pytorch platform: modal # modal, lambda, runpod, local
data: train_path: ./data/sample-train.jsonl val_path: ./data/sample-val.jsonl format: jsonl # jsonl, csv, parquet sample_size: 10 # Number of samples to use for testing
model: base_model: meta-llama/Llama-2-7b-hf checkpoint_path: ./checkpoints/test quantization: null # null, 8bit, 4bit lora: enabled: true r: 16 alpha: 32 dropout: 0.05 target_modules: ["q_proj", "v_proj"]
training: batch_size: 1 gradient_accumulation: 1 learning_rate: 2e-4 max_steps: 5 warmup_steps: 0 eval_steps: 2
testing: sample_size: 10 timeout_seconds: 300 expected_loss_range: [0.5, 3.0] fail_fast: true cleanup: true
validation: metrics: ["loss", "perplexity"] check_gradients: true check_memory_leak: true
gpu: required: true min_vram_gb: 16 allow_cpu_fallback: false
Integration with ML Training Workflow
Pre-Training Validation
1. Check system dependencies
bash scripts/check-dependencies.sh
--framework pytorch
--gpu-required
--min-vram 16
2. Validate training data
bash scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema templates/validation-schema.json
--check-duplicates
--check-tokens
3. Test pipeline with sample data
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--verbose
Post-Training Validation
1. Validate model checkpoint
bash scripts/validate-model.sh
--model-path ./checkpoints/final
--framework pytorch
--check-weights
--check-inference
2. Test inference pipeline
bash scripts/test-pipeline.sh
--config templates/test-config.yaml
--steps inference,metrics
CI/CD Integration
.github/workflows/validate-training.yml
-
name: Validate Training Data run: | bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh
--data-path ./data/train.jsonl
--schema ./validation-schema.json -
name: Test Training Pipeline run: | bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh
--config ./test-config.yaml
--fail-fast
Error Handling and Debugging
Common Validation Failures
Data Validation Failures:
-
Token length exceeded → Truncate or filter long sequences
-
Missing required fields → Fix data preprocessing
-
Duplicate entries → Deduplicate dataset
-
Invalid labels → Clean label data
Model Validation Failures:
-
Missing config files → Ensure complete checkpoint
-
Weight loading errors → Check framework compatibility
-
Tokenizer mismatch → Verify tokenizer version
Pipeline Test Failures:
-
Out of memory → Reduce batch size, enable gradient checkpointing
-
CUDA errors → Check GPU drivers, CUDA version
-
Slow performance → Profile and optimize data loading
Debug Mode
All scripts support verbose debugging:
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
-
Detailed progress logs
-
Stack traces for errors
-
Performance metrics
-
Memory usage statistics
Best Practices
-
Always validate before training - Catch data issues early
-
Use schema validation - Enforce data quality standards
-
Test with sample data first - Verify pipeline before full training
-
Check dependencies on new platforms - Avoid runtime failures
-
Validate checkpoints - Ensure model integrity before deployment
-
Monitor validation metrics - Track data quality over time
-
Automate validation in CI/CD - Prevent bad data from entering pipeline
-
Keep validation schemas updated - Reflect current data requirements
Performance Tips
-
Use --sample-size for quick validation of large datasets
-
Run dependency checks once per environment, cache results
-
Parallelize validation with GNU parallel for multi-file datasets
-
Use --fail-fast in CI/CD to save time
-
Cache tokenizers to avoid re-downloading
Troubleshooting
Script Not Found:
Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts bash scripts/validate-data.sh --help
Permission Denied:
Make scripts executable
chmod +x scripts/*.sh
Missing Dependencies:
Install required tools
pip install jsonschema pandas pyarrow sudo apt-get install jq bc
Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local