bio-gatk-variant-calling

GATK HaplotypeCaller is the gold standard for germline variant calling. This skill covers the GATK Best Practices workflow.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-gatk-variant-calling" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-gatk-variant-calling

GATK Variant Calling

GATK HaplotypeCaller is the gold standard for germline variant calling. This skill covers the GATK Best Practices workflow.

Prerequisites

BAM files should be preprocessed:

  • Mark duplicates

  • Base quality score recalibration (BQSR) - optional but recommended

Single-Sample Calling

Basic HaplotypeCaller

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-O sample.vcf.gz

With Standard Annotations

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-O sample.vcf.gz
-A Coverage
-A QualByDepth
-A FisherStrand
-A StrandOddsRatio
-A MappingQualityRankSumTest
-A ReadPosRankSumTest

Target Intervals (Exome/Panel)

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-L targets.interval_list
-O sample.vcf.gz

Adjust Calling Confidence

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-O sample.vcf.gz
--standard-min-confidence-threshold-for-calling 20

GVCF Workflow (Recommended for Cohorts)

The GVCF workflow enables joint genotyping across samples for better variant calls.

Step 1: Generate GVCFs per Sample

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-O sample.g.vcf.gz
-ERC GVCF

Step 2: Combine GVCFs (GenomicsDBImport)

Create sample map file

sample_map.txt:

sample1 /path/to/sample1.g.vcf.gz

sample2 /path/to/sample2.g.vcf.gz

gatk GenomicsDBImport
--genomicsdb-workspace-path genomicsdb
--sample-name-map sample_map.txt
-L intervals.interval_list

Alternative: CombineGVCFs (smaller cohorts)

gatk CombineGVCFs
-R reference.fa
-V sample1.g.vcf.gz
-V sample2.g.vcf.gz
-V sample3.g.vcf.gz
-O cohort.g.vcf.gz

Step 3: Joint Genotyping

From GenomicsDB

gatk GenotypeGVCFs
-R reference.fa
-V gendb://genomicsdb
-O cohort.vcf.gz

From combined GVCF

gatk GenotypeGVCFs
-R reference.fa
-V cohort.g.vcf.gz
-O cohort.vcf.gz

Variant Quality Score Recalibration (VQSR)

Machine learning-based filtering using known variant sites. Requires many variants (WGS preferred).

SNP Recalibration

Build SNP model

gatk VariantRecalibrator
-R reference.fa
-V cohort.vcf.gz
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR
-mode SNP
-O snp.recal
--tranches-file snp.tranches

Apply SNP filter

gatk ApplyVQSR
-R reference.fa
-V cohort.vcf.gz
-O cohort.snp_recal.vcf.gz
--recal-file snp.recal
--tranches-file snp.tranches
--truth-sensitivity-filter-level 99.5
-mode SNP

Indel Recalibration

Build Indel model

gatk VariantRecalibrator
-R reference.fa
-V cohort.snp_recal.vcf.gz
--resource:mills,known=false,training=true,truth=true,prior=12.0 Mills.vcf.gz
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR
-mode INDEL
--max-gaussians 4
-O indel.recal
--tranches-file indel.tranches

Apply Indel filter

gatk ApplyVQSR
-R reference.fa
-V cohort.snp_recal.vcf.gz
-O cohort.vqsr.vcf.gz
--recal-file indel.recal
--tranches-file indel.tranches
--truth-sensitivity-filter-level 99.0
-mode INDEL

Hard Filtering (When VQSR Not Suitable)

For small datasets, exomes, or single samples where VQSR fails.

Extract SNPs and Indels

gatk SelectVariants
-R reference.fa
-V cohort.vcf.gz
--select-type-to-include SNP
-O snps.vcf.gz

gatk SelectVariants
-R reference.fa
-V cohort.vcf.gz
--select-type-to-include INDEL
-O indels.vcf.gz

Apply Hard Filters

Filter SNPs

gatk VariantFiltration
-R reference.fa
-V snps.vcf.gz
-O snps.filtered.vcf.gz
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "FS > 60.0" --filter-name "FS60"
--filter-expression "MQ < 40.0" --filter-name "MQ40"
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5"
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8"
--filter-expression "SOR > 3.0" --filter-name "SOR3"

Filter Indels

gatk VariantFiltration
-R reference.fa
-V indels.vcf.gz
-O indels.filtered.vcf.gz
--filter-expression "QD < 2.0" --filter-name "QD2"
--filter-expression "FS > 200.0" --filter-name "FS200"
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20"
--filter-expression "SOR > 10.0" --filter-name "SOR10"

Merge Filtered Variants

gatk MergeVcfs
-I snps.filtered.vcf.gz
-I indels.filtered.vcf.gz
-O cohort.filtered.vcf.gz

Base Quality Score Recalibration (BQSR)

Preprocessing step to correct systematic errors in base quality scores.

Step 1: BaseRecalibrator

gatk BaseRecalibrator
-R reference.fa
-I sample.bam
--known-sites dbsnp.vcf.gz
--known-sites known_indels.vcf.gz
-O recal_data.table

Step 2: ApplyBQSR

gatk ApplyBQSR
-R reference.fa
-I sample.bam
--bqsr-recal-file recal_data.table
-O sample.recal.bam

Parallel Processing

Scatter by Interval

Split calling across intervals

for interval in chr{1..22} chrX chrY; do gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-L $interval
-O sample.${interval}.g.vcf.gz
-ERC GVCF & done wait

Gather GVCFs

gatk GatherVcfs
-I sample.chr1.g.vcf.gz
-I sample.chr2.g.vcf.gz
...
-O sample.g.vcf.gz

Native Pairwise Parallelism

gatk HaplotypeCaller
-R reference.fa
-I sample.bam
-O sample.vcf.gz
--native-pair-hmm-threads 4

CNN Score Variant Filter (Deep Learning)

Alternative to VQSR using convolutional neural network.

Score Variants

gatk CNNScoreVariants
-R reference.fa
-V cohort.vcf.gz
-O cohort.cnn_scored.vcf.gz
--tensor-type reference

Filter by CNN Score

gatk FilterVariantTranches
-V cohort.cnn_scored.vcf.gz
-O cohort.cnn_filtered.vcf.gz
--resource hapmap.vcf.gz
--resource mills.vcf.gz
--info-key CNN_1D
--snp-tranche 99.95
--indel-tranche 99.4

Complete Single-Sample Pipeline

#!/bin/bash SAMPLE=$1 REF=reference.fa DBSNP=dbsnp.vcf.gz KNOWN_INDELS=known_indels.vcf.gz

BQSR

gatk BaseRecalibrator -R $REF -I ${SAMPLE}.bam
--known-sites $DBSNP --known-sites $KNOWN_INDELS
-O ${SAMPLE}.recal.table

gatk ApplyBQSR -R $REF -I ${SAMPLE}.bam
--bqsr-recal-file ${SAMPLE}.recal.table
-O ${SAMPLE}.recal.bam

Call variants

gatk HaplotypeCaller -R $REF -I ${SAMPLE}.recal.bam
-O ${SAMPLE}.g.vcf.gz -ERC GVCF

Single-sample genotyping

gatk GenotypeGVCFs -R $REF -V ${SAMPLE}.g.vcf.gz
-O ${SAMPLE}.vcf.gz

Hard filter

gatk VariantFiltration -R $REF -V ${SAMPLE}.vcf.gz
-O ${SAMPLE}.filtered.vcf.gz
--filter-expression "QD < 2.0" --filter-name "LowQD"
--filter-expression "FS > 60.0" --filter-name "HighFS"
--filter-expression "MQ < 40.0" --filter-name "LowMQ"

Key Annotations

Annotation Description Good Values

QD Quality by Depth

2.0

FS Fisher Strand < 60 (SNP), < 200 (Indel)

SOR Strand Odds Ratio < 3 (SNP), < 10 (Indel)

MQ Mapping Quality

40

MQRankSum MQ Rank Sum Test

-12.5

ReadPosRankSum Read Position Rank Sum

-8.0 (SNP), > -20.0 (Indel)

Resource Files

Resource Use

dbSNP Known variants (prior=2.0)

HapMap Training/truth SNPs (prior=15.0)

Omni Training SNPs (prior=12.0)

1000G SNPs Training SNPs (prior=10.0)

Mills Indels Training/truth indels (prior=12.0)

Related Skills

  • variant-calling - bcftools alternative

  • alignment-files - BAM preprocessing

  • filtering-best-practices - Post-calling filtering

  • variant-normalization - Normalize before annotation

  • vep-snpeff-annotation - Annotate final calls

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

bio-read-qc-fastp-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflows-genome-assembly-pipeline

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflow-management-snakemake-workflows

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflows-scrnaseq-pipeline

No summary provided by upstream source.

Repository SourceNeeds Review