Polygenic Risk Score (PRS) Builder
Build and interpret polygenic risk scores for complex diseases using genome-wide association study (GWAS) data.
Overview
Use Cases:
- "Calculate my genetic risk for type 2 diabetes"
- "Build a polygenic risk score for coronary artery disease"
- "What's my genetic predisposition to Alzheimer's disease?"
- "Interpret my PRS percentile for breast cancer risk"
What This Skill Does:
- Extracts genome-wide significant variants (p < 5e-8) from GWAS Catalog
- Builds weighted PRS models using effect sizes (beta coefficients)
- Calculates individual risk scores from genotype data
- Interprets PRS as population percentiles and risk categories
What This Skill Does NOT Do:
- Diagnose disease (PRS is probabilistic, not deterministic)
- Replace clinical assessment or genetic counseling
- Account for non-genetic factors (lifestyle, environment)
- Provide treatment recommendations
Methodology
PRS Calculation Formula
A polygenic risk score is calculated as a weighted sum across genetic variants:
PRS = Σ (dosage_i × effect_size_i)
Where:
- dosage_i: Number of effect alleles at SNP i (0, 1, or 2)
- effect_size_i: Beta coefficient or log(odds ratio) from GWAS
Standardization
Raw PRS is standardized to z-scores for interpretation:
z-score = (PRS - population_mean) / population_std
This allows comparison to population distribution and percentile calculation.
Significance Thresholds
- Genome-wide significance: p < 5×10⁻⁸ (default threshold)
- This corrects for ~1 million independent tests across the genome
- Relaxed thresholds (e.g., p < 1×10⁻⁵) can include more SNPs but may add noise
Effect Size Handling
- Continuous traits (e.g., height, BMI): Beta coefficient (units of trait per allele)
- Binary traits (e.g., disease): Odds ratio converted to log-odds (beta = ln(OR))
- Missing effect sizes or non-significant SNPs are excluded
Data Sources
This skill uses ToolUniverse GWAS tools to query:
-
GWAS Catalog (EMBL-EBI)
- Curated GWAS associations
- 5000+ studies, millions of variants
- Tools:
gwas_get_associations_for_trait,gwas_get_snp_by_id
-
Open Targets Genetics
- Integrated genetics platform
- Fine-mapped credible sets
- Tools:
OpenTargets_search_gwas_studies_by_disease,OpenTargets_get_variant_info
Key Concepts
Polygenic Risk Scores (PRS)
Polygenic risk scores aggregate the effects of many genetic variants to estimate an individual's genetic predisposition to a trait or disease. Unlike Mendelian diseases caused by single mutations, complex diseases involve hundreds to thousands of variants, each with small effects.
Key Properties:
- Continuous distribution: PRS forms a bell curve in populations
- Relative risk: Compares individual to population average
- Probabilistic: High PRS doesn't guarantee disease, low PRS doesn't guarantee protection
- Ancestry-specific: PRS accuracy depends on matching GWAS and target ancestry
GWAS (Genome-Wide Association Studies)
GWAS compare allele frequencies between cases and controls (or correlate with trait values) across millions of SNPs to identify disease-associated variants.
Study Design:
- Discovery cohort: Initial identification of associations
- Replication cohort: Validation in independent samples
- Sample size: Larger studies detect smaller effects (power ∝ √N)
- Multiple testing correction: Bonferroni-type correction for ~1M tests
Effect Sizes and Odds Ratios
- Beta (β): Change in trait per copy of effect allele
- Example: β = 0.5 kg/m² means each allele increases BMI by 0.5 units
- Odds Ratio (OR): Multiplicative change in disease odds
- OR = 1.5 means 50% increased odds per allele
- Convert to beta: β = ln(OR)
Linkage Disequilibrium (LD) and Clumping
Nearby variants are often inherited together (LD). To avoid double-counting:
- LD clumping: Select independent variants (r² < 0.1 within 1 Mb windows)
- Fine-mapping: Statistical methods to identify causal variants
- This skill uses raw associations; production PRS should include LD pruning
Population Stratification
GWAS and PRS are most accurate when ancestries match:
- Population structure: Different ancestries have different allele frequencies
- Transferability: European-trained PRS perform worse in non-European populations
- Solution: Train PRS on diverse cohorts or use ancestry-matched references
Applications
Clinical Risk Assessment
PRS can stratify individuals for:
- Screening programs: Target high-risk individuals (e.g., mammography, colonoscopy)
- Prevention strategies: Lifestyle interventions for high genetic risk
- Drug response: Pharmacogenomics based on metabolism genes
Example: Khera et al. (2018) showed PRS identifies 3× more individuals at >3-fold coronary artery disease risk than monogenic mutations.
Research Applications
- Gene discovery: PRS-based phenome-wide association studies (PheWAS)
- Genetic correlation: Compare PRS across traits
- Causal inference: Mendelian randomization using PRS as instruments
- Simulation studies: Model polygenic architecture
Personal Genomics
Consumer genetic testing (23andMe, Ancestry DNA) provides raw genotypes. Users can:
- Calculate PRS for traits not reported
- Compare to published PRS models
- Understand genetic contribution vs. lifestyle factors
Caution: Personal PRS should not replace medical advice. Results may cause anxiety if not properly contextualized.
Limitations and Considerations
Scientific Limitations
-
Heritability Gap: PRS explains a fraction of genetic heritability
- Type 2 diabetes: ~50% heritable, PRS explains ~10-20%
- Rare variants, epistasis, and gene-environment interactions not captured
-
Ancestry Bias: Most GWAS are European ancestry
- PRS accuracy drops in non-European populations
- Need for diverse cohort recruitment
-
Winner's Curse: Discovery effect sizes often overestimated
- Replication studies show smaller effects
- Meta-analyses provide better estimates
-
Missing Heritability: Unexplained genetic contribution from:
- Rare variants not captured by SNP arrays
- Structural variants (CNVs, inversions)
- Epigenetic factors
Clinical Limitations
-
Not Diagnostic: PRS is probabilistic, not deterministic
- High PRS doesn't mean you will get disease
- Low PRS doesn't mean you won't get disease
-
Environmental Factors: Many complex diseases are 50%+ environmental
- Smoking, diet, exercise, stress, pollution
- PRS doesn't account for these
-
Pleiotropy: Same variants affect multiple traits
- Genetic correlation between diseases
- Risk for one may protect against another
-
Actionability: Not all high-risk predictions have interventions
- Alzheimer's PRS has limited actionability currently
- Ethical considerations for testing
Ethical Considerations
-
Privacy: Genetic data is identifiable and permanent
- Can't be changed like passwords
- Familial implications (relatives share genetics)
-
Discrimination: Potential for genetic discrimination
- GINA protects against health/employment discrimination (US)
- Life insurance and long-term care not protected
-
Psychological Impact: Knowledge of high risk can cause anxiety
- Need for genetic counseling
- Risk communication training
-
Equity: Ancestry bias means unequal benefits
- Europeans benefit most from current PRS
- Exacerbates health disparities
References
Key Publications
-
Lambert et al. (2021): "The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation"
- PGS Catalog: https://www.pgscatalog.org/
- Repository of published PRS models
-
Khera et al. (2018): "Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations"
- Nature Genetics, 50:1219–1224
- Demonstrated clinical utility of PRS
-
Torkamani et al. (2018): "The personal and clinical utility of polygenic risk scores"
- Nature Reviews Genetics, 19:581–590
- Comprehensive review of PRS applications
-
Martin et al. (2019): "Clinical use of current polygenic risk scores may exacerbate health disparities"
- Nature Genetics, 51:584–591
- Addresses ancestry bias and equity concerns
-
Choi et al. (2020): "Tutorial: a guide to performing polygenic risk score analyses"
- Nature Protocols, 15:2759–2772
- Practical guide to PRS calculation and evaluation
Resources
- PGS Catalog: https://www.pgscatalog.org/ - Published PRS models
- LD Hub: http://ldsc.broadinstitute.org/ - Genetic correlations
- PRSice: https://www.prsice.info/ - PRS calculation software
- GWAS Catalog: https://www.ebi.ac.uk/gwas/ - Association database
Workflow
1. Trait Selection
Identify the disease or trait of interest:
- Use standard terminology (e.g., "type 2 diabetes" not "T2D")
- Check GWAS Catalog for availability
- Verify sufficient GWAS studies exist (n > 10,000 samples ideal)
2. Association Collection
Query GWAS databases for genome-wide significant associations:
prs = build_polygenic_risk_score(
trait="coronary artery disease",
p_threshold=5e-8, # Genome-wide significance
max_snps=1000
)
Considerations:
- P-value threshold: 5e-8 is conservative, 1e-5 includes more variants
- LD clumping: Production systems should prune correlated SNPs
- Study quality: Prefer large meta-analyses over small studies
3. Effect Size Extraction
Extract beta coefficients or odds ratios:
- Beta for continuous traits (direct use)
- OR for binary traits (convert to log-odds)
- Handle missing values (exclude or impute from meta-analysis)
4. SNP Filtering
Quality control filters:
- MAF filter: Exclude rare variants (MAF < 0.01) for robustness
- Genotype QC: Remove SNPs with high missingness (> 10%)
- Hardy-Weinberg: Exclude SNPs violating HWE (p < 1e-6)
- Ambiguous SNPs: Remove A/T and G/C SNPs (strand ambiguity)
5. Score Calculation
Calculate weighted sum of genotype dosages:
result = calculate_personal_prs(
prs_weights=prs,
genotypes=my_genotypes,
population_mean=0.0,
population_std=1.0
)
Genotype Sources:
- 23andMe raw data export
- Ancestry DNA raw data
- Whole genome sequencing (VCF files)
- SNP array data (Illumina, Affymetrix)
6. Risk Interpretation
Convert to percentiles and risk categories:
result = interpret_prs_percentile(result)
print(f"Percentile: {result.percentile:.1f}%")
print(f"Risk: {result.risk_category}")
Risk Categories:
- Low risk: < 20th percentile (genetic protection)
- Average risk: 20-80th percentile (typical genetic predisposition)
- Elevated risk: 80-95th percentile (moderately increased risk)
- High risk: > 95th percentile (substantially increased risk)
Clinical Interpretation:
- Percentiles assume normal distribution
- Relative risk vs. average (not absolute risk)
- Combine with family history, clinical risk factors
- PRS is NOT diagnostic - many high-risk individuals never develop disease
Best Practices
PRS Construction
-
Use validated PRS from PGS Catalog when available
- Published models have been externally validated
- Include LD clumping and ancestry-specific weights
-
Match ancestries between GWAS and target population
- European GWAS for European individuals
- Use multi-ancestry GWAS when available
-
Include as many SNPs as practical
- More SNPs = better prediction (up to a point)
- Balance between coverage and genotyping cost
-
Consider trait architecture
- Highly polygenic traits (height, education): benefit from relaxed thresholds
- Oligogenic traits (IBD, T1D): few large-effect variants, strict thresholds
Clinical Use
-
Combine with clinical risk scores
- Add PRS to Framingham Risk Score, QRISK, etc.
- Integrated models improve prediction
-
Stratify screening and prevention
- Intensify surveillance for high PRS (e.g., earlier mammography)
- Lifestyle interventions for modifiable risk
-
Provide genetic counseling
- Explain probabilistic nature of PRS
- Discuss limitations and uncertainty
- Address psychological impact
-
Consider actionability
- Is there an intervention for high risk?
- Benefits vs. harms of knowing genetic risk
Research Use
-
Report methods transparently
- Document SNP selection criteria
- Report LD clumping parameters
- Specify ancestry of GWAS and target
-
Validate in held-out cohorts
- Split data: training vs. testing
- Report out-of-sample prediction accuracy (R², AUC)
-
Compare to existing PRS
- Benchmark against PGS Catalog models
- Report incremental improvement
-
Test across ancestries
- Evaluate transferability to non-European populations
- Report performance stratified by ancestry
Disclaimer
This skill is for educational and research purposes only.
- Not for clinical diagnosis or treatment decisions
- Not validated for clinical use - use PGS Catalog models for clinical-grade PRS
- Requires genetic counseling - interpretation requires expertise
- Does not account for family history, environment, or lifestyle factors
- Ancestry-specific - accuracy depends on matching GWAS ancestry
For clinical genetic testing, consult:
- Genetic counselors (certified by ABGC/ABMGG)
- Medical geneticists
- Healthcare providers with genomics training
PRS is a rapidly evolving field. Guidelines and best practices will continue to change as research progresses.
Regulatory Status:
- FDA does not currently regulate PRS (as of 2024)
- Some countries restrict direct-to-consumer genetic risk reporting
- Check local regulations before clinical implementation