Joint Calling
Call variants jointly across multiple samples for improved accuracy and consistent genotyping.
Why Joint Calling?
-
Improved sensitivity - Leverage information across samples
-
Consistent genotyping - Same sites called across all samples
-
VQSR eligible - Requires cohort for machine learning filtering
-
Population analysis - Allele frequencies across cohort
Workflow Overview
Sample BAMs │ ├── HaplotypeCaller (per-sample, -ERC GVCF) │ └── sample1.g.vcf.gz, sample2.g.vcf.gz, ... │ ├── CombineGVCFs or GenomicsDBImport │ └── Combine into cohort database │ ├── GenotypeGVCFs │ └── Joint genotyping │ └── VQSR or Hard Filtering └── Final VCF
Step 1: Per-Sample gVCF Generation
Generate gVCF for each sample
gatk HaplotypeCaller
-R reference.fa
-I sample1.bam
-O sample1.g.vcf.gz
-ERC GVCF
With intervals (faster)
gatk HaplotypeCaller
-R reference.fa
-I sample1.bam
-O sample1.g.vcf.gz
-ERC GVCF
-L intervals.bed
Batch Processing
Process all samples
for bam in *.bam; do
sample=$(basename $bam .bam)
gatk HaplotypeCaller
-R reference.fa
-I $bam
-O ${sample}.g.vcf.gz
-ERC GVCF &
done
wait
Step 2a: CombineGVCFs (Small Cohorts)
For <100 samples:
gatk CombineGVCFs
-R reference.fa
-V sample1.g.vcf.gz
-V sample2.g.vcf.gz
-V sample3.g.vcf.gz
-O cohort.g.vcf.gz
From Sample Map
Create sample map file
sample1 /path/to/sample1.g.vcf.gz
sample2 /path/to/sample2.g.vcf.gz
ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt
Combine with -V for each
gatk CombineGVCFs
-R reference.fa
$(cat sample_map.txt | cut -f2 | sed 's/^/-V /')
-O cohort.g.vcf.gz
Step 2b: GenomicsDBImport (Large Cohorts)
For >100 samples, use GenomicsDB:
Create sample map
ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt
Import to GenomicsDB (per chromosome for parallelism)
gatk GenomicsDBImport
--sample-name-map sample_map.txt
--genomicsdb-workspace-path genomicsdb_chr1
-L chr1
--reader-threads 4
Or all chromosomes
for chr in {1..22} X Y; do
gatk GenomicsDBImport
--sample-name-map sample_map.txt
--genomicsdb-workspace-path genomicsdb_chr${chr}
-L chr${chr} &
done
wait
Update GenomicsDB with New Samples
gatk GenomicsDBImport
--genomicsdb-update-workspace-path genomicsdb_chr1
--sample-name-map new_samples.txt
-L chr1
Step 3: GenotypeGVCFs
From Combined gVCF
gatk GenotypeGVCFs
-R reference.fa
-V cohort.g.vcf.gz
-O cohort.vcf.gz
From GenomicsDB
gatk GenotypeGVCFs
-R reference.fa
-V gendb://genomicsdb_chr1
-O chr1.vcf.gz
All chromosomes
for chr in {1..22} X Y; do
gatk GenotypeGVCFs
-R reference.fa
-V gendb://genomicsdb_chr${chr}
-O chr${chr}.vcf.gz &
done
wait
Merge chromosomes
bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz
-Oz -o cohort.vcf.gz
With Allele-Specific Annotations
gatk GenotypeGVCFs
-R reference.fa
-V gendb://genomicsdb
-O cohort.vcf.gz
-G StandardAnnotation
-G AS_StandardAnnotation
Step 4: Filtering
VQSR (Recommended for >30 Samples)
SNPs
gatk VariantRecalibrator
-R reference.fa
-V cohort.vcf.gz
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR
-mode SNP
-O snps.recal
--tranches-file snps.tranches
gatk ApplyVQSR
-R reference.fa
-V cohort.vcf.gz
--recal-file snps.recal
--tranches-file snps.tranches
-mode SNP
--truth-sensitivity-filter-level 99.5
-O cohort.snps.vcf.gz
Indels
gatk VariantRecalibrator
-R reference.fa
-V cohort.snps.vcf.gz
--resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf.gz
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR
-mode INDEL
-O indels.recal
--tranches-file indels.tranches
gatk ApplyVQSR
-R reference.fa
-V cohort.snps.vcf.gz
--recal-file indels.recal
--tranches-file indels.tranches
-mode INDEL
--truth-sensitivity-filter-level 99.0
-O cohort.filtered.vcf.gz
Hard Filtering (Small Cohorts)
See filtering-best-practices skill
gatk VariantFiltration
-R reference.fa
-V cohort.vcf.gz
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "FS > 60.0" --filter-name "FS60"
--filter-expression "MQ < 40.0" --filter-name "MQ40"
-O cohort.filtered.vcf.gz
Complete Pipeline Script
#!/bin/bash set -euo pipefail
REFERENCE=$1 OUTPUT_DIR=$2 THREADS=16
mkdir -p $OUTPUT_DIR/{gvcfs,genomicsdb,vcfs}
echo "=== Step 1: Generate gVCFs ==="
for bam in data/*.bam; do
sample=$(basename $bam .bam)
gatk HaplotypeCaller
-R $REFERENCE
-I $bam
-O $OUTPUT_DIR/gvcfs/${sample}.g.vcf.gz
-ERC GVCF &
# Limit parallelism
while [ $(jobs -r | wc -l) -ge $THREADS ]; do sleep 1; done
done wait
echo "=== Step 2: Create sample map ===" ls $OUTPUT_DIR/gvcfs/*.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$(realpath $f)" done > $OUTPUT_DIR/sample_map.txt
echo "=== Step 3: GenomicsDBImport ==="
gatk GenomicsDBImport
--sample-name-map $OUTPUT_DIR/sample_map.txt
--genomicsdb-workspace-path $OUTPUT_DIR/genomicsdb
-L intervals.bed
--reader-threads 4
echo "=== Step 4: Joint genotyping ==="
gatk GenotypeGVCFs
-R $REFERENCE
-V gendb://$OUTPUT_DIR/genomicsdb
-O $OUTPUT_DIR/vcfs/cohort.vcf.gz
echo "=== Step 5: Index ===" bcftools index -t $OUTPUT_DIR/vcfs/cohort.vcf.gz
echo "=== Statistics ===" bcftools stats $OUTPUT_DIR/vcfs/cohort.vcf.gz > $OUTPUT_DIR/vcfs/cohort_stats.txt
echo "=== Complete ===" echo "Joint VCF: $OUTPUT_DIR/vcfs/cohort.vcf.gz"
Tips
Memory for Large Cohorts
Increase Java heap
gatk --java-options "-Xmx64g" GenotypeGVCFs ...
Batch size for GenomicsDBImport
gatk GenomicsDBImport --batch-size 50 ...
Incremental Updates
Add new samples to existing database
gatk GenomicsDBImport
--genomicsdb-update-workspace-path existing_db
--sample-name-map new_samples.txt
Related Skills
-
variant-calling/gatk-variant-calling - Single-sample calling
-
variant-calling/filtering-best-practices - VQSR and hard filtering
-
population-genetics/plink-basics - Population analysis of joint calls
-
workflows/fastq-to-variants - End-to-end germline pipeline