VCF Toolkit
Toolkit for VCF/BCF variant file analysis: calculate statistics, filter variants, and export as JSON. Designed for WGS/WES sequencing result inspection and quality control.
Quick Start
Install
uv pip install pysam typer
Basic Usage
1. VCF 統計情報を取得
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1
2. 高品質バリアントのみをフィルタして新しい VCF を作成
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output high_quality.vcf
--min-qual 30
--min-dp 10
3. フィルタされたバリアントを JSON で出力(≤100 エントリ)
python scripts/inspect_vcf.py
--vcf high_quality.vcf
--chrom chr1
--output chr1.json
Scripts
inspect_vcf.py - VCF Inspection & JSON Export
Extract variants from VCF files for specific chromosomes or regions and export as JSON format.
Required Arguments
-
--vcf PATH
-
Input VCF file path
-
--chrom TEXT or --region TEXT
-
Either one required
-
--chrom : Entire chromosome (e.g., chr1 )
-
--region : Specific region (e.g., chr1:1000000-2000000 )
Optional Arguments
Output:
- --output PATH
- JSON output path (default: stdout)
Filter Conditions:
-
--min-qual FLOAT
-
Minimum quality score (QUAL >= X)
-
--min-dp INT
-
Minimum depth (INFO/DP >= X)
-
--min-af FLOAT
-
Minimum allele frequency (INFO/AF >= X)
-
--max-af FLOAT
-
Maximum allele frequency (INFO/AF <= X)
-
--pass-only / --all-filters
-
PASS only (default) / Include all filters
Limits:
-
--max-variants INT
-
Maximum variant count (default: 100)
-
--force
-
Ignore entry limit (allows large JSON output)
Output Format (JSON)
{ "num_variants": 45, "samples": ["sample1", "sample2"], "variants": [ { "chrom": "chr1", "pos": 12345, "id": "rs123456", "ref": "A", "alts": ["G"], "qual": 100.0, "filter": ["PASS"], "info": { "DP": 50, "AF": [0.5], "AC": [25] }, "samples": { "sample1": {"GT": "0/1", "DP": 25, "GQ": 99}, "sample2": {"GT": "0/0", "DP": 25, "GQ": 99} } } ] }
vcf_stats.py - VCF Statistics
Calculate comprehensive statistics from VCF files and output as JSON. Includes variant counts, quality distributions, depth distributions, and allele frequency statistics.
Arguments
Required:
- --vcf PATH
- Input VCF file path
Optional:
-
--chrom TEXT
-
Chromosome specification (default: all chromosomes)
-
--region TEXT
-
Region specification (e.g., chr1:1000-2000 )
-
--output PATH
-
JSON output path (default: stdout)
Output Content (JSON)
-
total_variants
-
Total variant count
-
filter_counts
-
Breakdown by filter (PASS, LowQual, etc.)
-
variant_types
-
Breakdown by variant type (SNP, insertion, deletion)
-
chrom_counts
-
Variant count per chromosome
-
quality_stats
-
Quality score statistics (min, max, mean, median)
-
depth_stats
-
Depth statistics (INFO/DP)
-
allele_frequency_stats
-
Allele frequency statistics (INFO/AF)
Usage Examples
Calculate statistics for chr1
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1
Calculate statistics for all chromosomes (output to JSON file)
python scripts/vcf_stats.py --vcf variants.vcf.gz --output stats.json
Calculate statistics for specific region
python scripts/vcf_stats.py --vcf variants.vcf.gz --region chr1:10000-20000
filter_vcf.py - VCF Filtering
Filter VCF files by quality, depth, and allele frequency criteria. Output filtered variants as a new VCF file.
Arguments
Required:
-
--vcf PATH
-
Input VCF file path
-
--output PATH
-
Output VCF file path
Optional:
-
--chrom TEXT
-
Chromosome specification
-
--region TEXT
-
Region specification (e.g., chr1:1000-2000 )
-
--min-qual FLOAT
-
Minimum quality score
-
--min-dp INT
-
Minimum depth (INFO/DP)
-
--min-af FLOAT
-
Minimum allele frequency (INFO/AF)
-
--max-af FLOAT
-
Maximum allele frequency (INFO/AF)
-
--pass-only
-
PASS variants only (default: False)
Usage Examples
Extract chr1 PASS variants only
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output chr1_pass.vcf
--chrom chr1
--pass-only
Extract high-quality variants (QUAL >= 30, DP >= 10)
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output high_quality.vcf
--min-qual 30
--min-dp 10
Extract rare variants (AF <= 0.01)
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output rare_variants.vcf
--max-af 0.01
Workflow Examples
Example 1: Comprehensive Variant Analysis Workflow
Combine all three scripts for complete VCF analysis:
Step 1: Calculate overall statistics
python scripts/vcf_stats.py --vcf variants.vcf.gz --chrom chr1 --output stats.json
Step 2: Filter high-quality variants to new VCF
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output high_quality.vcf
--chrom chr1
--min-qual 30
--min-dp 10
--pass-only
Step 3: Export filtered variants as JSON for downstream analysis
python scripts/inspect_vcf.py
--vcf high_quality.vcf
--chrom chr1
--output chr1_filtered.json
Example 2: Rare Variant Discovery
Identify and export rare variants from specific region:
Filter rare variants (AF <= 0.01)
python scripts/filter_vcf.py
--vcf variants.vcf.gz
--output rare.vcf
--region chr17:41196312-41277500
--max-af 0.01
Export as JSON for analysis
python scripts/inspect_vcf.py
--vcf rare.vcf
--region chr17:41196312-41277500
--output brca1_rare.json
Error Handling
Variant Count Exceeds Limit
$ python scripts/inspect_vcf.py --vcf huge.vcf --chrom chr1 --output out.json
Error: VCF contains 1,234+ variants after filtering (limit: 100).
Suggestions:
- Apply more restrictive filters: --min-qual, --min-dp, --pass-only
- Specify a genomic region: --region chr1:1000-2000
- Override limit with --force (warning: may produce very large JSON)
- Use bcftools directly for large-scale processing
Current filter conditions: --chrom chr1 --pass-only
Solutions:
-
Apply more restrictive filters: --min-qual 30 , --min-dp 10
-
Narrow down the region: --region chr1:1000000-1100000
-
Override limit with --force (use cautiously)
Missing Chromosome/Region Specification
$ python scripts/inspect_vcf.py --vcf variants.vcf --output out.json
Error: Either --chrom or --region must be specified.
Solutions:
- Add --chrom chr1 or --region chr1:1000-2000 to the command
Best Practices
- Always Specify Chromosome or Region
Always specify chromosome or region when using inspect_vcf.py to avoid processing entire VCF files inefficiently.
❌ Bad: No chromosome specified
python scripts/inspect_vcf.py --vcf variants.vcf
✅ Good: Chromosome specified
python scripts/inspect_vcf.py --vcf variants.vcf --chrom chr1
- Apply Additional Filters for Efficiency
Combine quality and depth filters with default PASS-only filtering for better results.
✅ Good: Multiple filters applied
python scripts/inspect_vcf.py
--vcf variants.vcf
--chrom chr1
--min-qual 30
--min-dp 10
- Respect 100-Entry Limit for JSON Export
Use inspect_vcf.py for small datasets only. Pre-filter large VCF files with filter_vcf.py or bcftools before JSON export.
Pre-filter large datasets with bcftools
bcftools view -i 'QUAL>=30 && DP>=10' -r chr1:1000000-2000000 variants.vcf > filtered.vcf
Then export to JSON
python scripts/inspect_vcf.py --vcf filtered.vcf --chrom chr1 --output filtered.json
- Use --force Cautiously
Use --force only when necessary. JSON files with thousands of entries can become several MB to tens of MB in size.
When to Use vcf-toolkit vs bcftools
Task vcf-toolkit bcftools
Small dataset JSON export ✅ inspect_vcf.py
Large-scale filtering filter_vcf.py ✅ bcftools view
Complex filter expressions
✅ bcftools
VCF-to-VCF conversion filter_vcf.py ✅ bcftools
Variant statistics ✅ vcf_stats.py ✅ bcftools stats
Recommended Workflow:
-
Pre-filter large datasets with bcftools or filter_vcf.py
-
Export filtered results to JSON with inspect_vcf.py for detailed inspection
-
Perform downstream analysis in Python/R using JSON output
Related Skills
-
pysam - BAM/CRAM alignment file operations
-
sequence-io - FASTA/FASTQ sequence file operations
-
blast-search - BLAST homology search
-
blat-api-searching - BLAT genome mapping
Troubleshooting
VCF File Too Large
Specify a narrower region or pre-filter with bcftools before JSON export.
Specify narrower region
python scripts/inspect_vcf.py --vcf variants.vcf --region chr1:1000000-1100000
Pre-filter with bcftools
bcftools view -i 'QUAL>=50' variants.vcf | python scripts/inspect_vcf.py --vcf - --chrom chr1
Index Error
Create tabix index for compressed VCF files.
Compress with bgzip
bgzip variants.vcf
Create tabix index
tabix -p vcf variants.vcf.gz
Use indexed VCF
python scripts/inspect_vcf.py --vcf variants.vcf.gz --chrom chr1
Include Non-PASS Variants
Use --all-filters flag to include all variants regardless of FILTER field.
python scripts/inspect_vcf.py --vcf variants.vcf --chrom chr1 --all-filters