GATK CNV Workflow
Somatic CNV Workflow Overview
- PreprocessIntervals → intervals.interval_list
- CollectReadCounts → sample.counts.hdf5
- CreateReadCountPanelOfNormals → pon.hdf5
- DenoiseReadCounts → sample.denoised.tsv
- CollectAllelicCounts → sample.allelicCounts.tsv
- ModelSegments → sample.modelFinal.seg
- CallCopyRatioSegments → sample.called.seg
Step 1: Preprocess Intervals
For WES/targeted
gatk PreprocessIntervals
-R reference.fa
-L targets.interval_list
--bin-length 0
--interval-merging-rule OVERLAPPING_ONLY
-O preprocessed.interval_list
For WGS
gatk PreprocessIntervals
-R reference.fa
--bin-length 1000
--padding 0
-O wgs.interval_list
Step 2: Collect Read Counts
For each sample
gatk CollectReadCounts
-R reference.fa
-I sample.bam
-L preprocessed.interval_list
--interval-merging-rule OVERLAPPING_ONLY
-O sample.counts.hdf5
Step 3: Create Panel of Normals
Combine multiple normal samples
gatk CreateReadCountPanelOfNormals
-I normal1.counts.hdf5
-I normal2.counts.hdf5
-I normal3.counts.hdf5
--minimum-interval-median-percentile 5.0
-O cnv_pon.hdf5
Step 4: Denoise Read Counts
Using panel of normals
gatk DenoiseReadCounts
-I tumor.counts.hdf5
--count-panel-of-normals cnv_pon.hdf5
--standardized-copy-ratios tumor.standardized.tsv
--denoised-copy-ratios tumor.denoised.tsv
Step 5: Collect Allelic Counts
From known SNP sites (for LOH detection)
gatk CollectAllelicCounts
-R reference.fa
-I tumor.bam
-L common_snps.vcf
-O tumor.allelicCounts.tsv
Step 6: Model Segments
Somatic with matched normal allelic counts
gatk ModelSegments
--denoised-copy-ratios tumor.denoised.tsv
--allelic-counts tumor.allelicCounts.tsv
--normal-allelic-counts normal.allelicCounts.tsv
--output-prefix tumor
-O results/
Output files: tumor.cr.seg, tumor.modelFinal.seg, tumor.hets.tsv
Step 7: Call Copy Ratio Segments
gatk CallCopyRatioSegments
-I results/tumor.cr.seg
-O results/tumor.called.seg
Plotting
Plot copy ratios and segments
gatk PlotDenoisedCopyRatios
--standardized-copy-ratios tumor.standardized.tsv
--denoised-copy-ratios tumor.denoised.tsv
--sequence-dictionary reference.dict
--minimum-contig-length 46709983
--output-prefix tumor
-O plots/
Plot segments with allelic information
gatk PlotModeledSegments
--denoised-copy-ratios tumor.denoised.tsv
--allelic-counts results/tumor.hets.tsv
--segments results/tumor.modelFinal.seg
--sequence-dictionary reference.dict
--minimum-contig-length 46709983
--output-prefix tumor
-O plots/
Germline CNV Workflow
For germline: use cohort mode
1. Collect counts (same as above)
2. Determine contig ploidy
gatk DetermineGermlineContigPloidy
-I sample1.counts.hdf5
-I sample2.counts.hdf5
--model cohort_ploidy_model
--contig-ploidy-priors ploidy_priors.tsv
-O ploidy-calls/
3. Call germline CNVs
gatk GermlineCNVCaller
--run-mode COHORT
-I sample1.counts.hdf5
-I sample2.counts.hdf5
--contig-ploidy-calls ploidy-calls/ploidy_calls
--annotated-intervals annotated_intervals.tsv
--output-prefix cohort
-O germline_cnv_calls/
4. Post-process calls per sample
gatk PostprocessGermlineCNVCalls
--calls-shard-path germline_cnv_calls/cohort-calls
--model-shard-path germline_cnv_calls/cohort-model
--sample-index 0
--contig-ploidy-calls ploidy-calls/ploidy_calls
--sequence-dictionary reference.dict
--output-genotyped-intervals sample1.genotyped.tsv
--output-denoised-copy-ratios sample1.denoised.tsv
-O sample1_segments.vcf
Complete Somatic Pipeline Script
#!/bin/bash REFERENCE=reference.fa INTERVALS=targets.interval_list PON=cnv_pon.hdf5 SNP_SITES=common_snps.vcf TUMOR=$1 NORMAL=$2 OUTDIR=$3
mkdir -p $OUTDIR
Collect read counts
gatk CollectReadCounts -R $REFERENCE -I $TUMOR -L $INTERVALS
-O $OUTDIR/tumor.counts.hdf5
gatk CollectReadCounts -R $REFERENCE -I $NORMAL -L $INTERVALS
-O $OUTDIR/normal.counts.hdf5
Denoise
gatk DenoiseReadCounts -I $OUTDIR/tumor.counts.hdf5
--count-panel-of-normals $PON
--standardized-copy-ratios $OUTDIR/tumor.standardized.tsv
--denoised-copy-ratios $OUTDIR/tumor.denoised.tsv
Allelic counts
gatk CollectAllelicCounts -R $REFERENCE -I $TUMOR -L $SNP_SITES
-O $OUTDIR/tumor.allelicCounts.tsv
gatk CollectAllelicCounts -R $REFERENCE -I $NORMAL -L $SNP_SITES
-O $OUTDIR/normal.allelicCounts.tsv
Model and call
gatk ModelSegments
--denoised-copy-ratios $OUTDIR/tumor.denoised.tsv
--allelic-counts $OUTDIR/tumor.allelicCounts.tsv
--normal-allelic-counts $OUTDIR/normal.allelicCounts.tsv
--output-prefix tumor -O $OUTDIR/
gatk CallCopyRatioSegments -I $OUTDIR/tumor.cr.seg -O $OUTDIR/tumor.called.seg
Key Output Files
File Description
.counts.hdf5 Raw read counts per interval
.denoised.tsv Denoised log2 copy ratios
.modelFinal.seg Segmented copy ratios with confidence
.called.seg Final called segments with CN state
.hets.tsv Heterozygous SNP allelic counts
Related Skills
-
copy-number/cnvkit-analysis - Alternative CNV caller
-
copy-number/cnv-visualization - Plotting results
-
alignment-files/bam-statistics - Input BAM QC
-
variant-calling/variant-calling - SNP calling for allelic counts