Genomics and Epigenomics Data Processing
Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.
When to Use This Skill
Triggers:
- User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
- Questions about differential methylation analysis
- Age-related CpG detection or epigenetic clock questions
- Chromosome-level methylation density or statistics
- ChIP-seq peak files (BED format) with analysis questions
- ATAC-seq chromatin accessibility questions
- Multi-omics integration (expression + methylation, expression + ChIP-seq)
- Genome-wide epigenomic statistics
- Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
- Questions about missing data across clinical/genomic/epigenomic modalities
- Regulatory element annotation for processed epigenomic data
Example Questions:
- "How many patients have no missing data for vital status, gene expression, and methylation data?"
- "What is the ratio of filtered age-related CpG density between chromosomes?"
- "What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
- "How many CpG sites show significant differential methylation (padj < 0.05)?"
- "What is the Pearson correlation between methylation and expression for gene X?"
- "How many ChIP-seq peaks overlap with promoter regions?"
- "What fraction of ATAC-seq peaks are in enhancer regions?"
- "Which chromosome has the highest density of hypermethylated CpGs?"
- "Filter CpG sites by variance > threshold and map to nearest genes"
- "What is the average beta value difference between tumor and normal for chromosome 17?"
NOT for (use other skills instead):
- Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
- RNA-seq differential expression -> Use
tooluniverse-rnaseq-deseq2 - Variant calling/annotation from VCF -> Use
tooluniverse-variant-analysis - Gene enrichment analysis -> Use
tooluniverse-gene-enrichment - Protein structure analysis -> Use
tooluniverse-protein-structure-retrieval
Required Python Packages
# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt
# Optional but useful
import pysam # BAM/CRAM file access
import gseapy # Enrichment of genes from methylation analysis
# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse
Key Principles
- Data-first approach - Load and inspect data files BEFORE any analysis
- Question-driven - Parse what the user is actually asking and extract the specific numeric answer
- File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
- Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
- Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
- Missing data handling - Explicitly report and handle NaN/missing values
- Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
- CpG site identification - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
- Report-first - Create output file first, populate progressively
- English-first queries - Use English in all tool calls
Workflow Overview
Phase 0: Question Parsing and Data Discovery
Before writing any code, parse the question to identify:
- What data files are available (methylation, ChIP-seq, ATAC-seq, clinical, expression, manifest)
- What specific statistic or answer is being asked for
- What thresholds apply (significance, effect size, variance, chromosome filters)
- What genome build to use
Categorize files by scanning for keywords: methyl/beta/cpg/illumina, chip/peak/narrowpeak, atac/accessibility, clinical/patient/sample, express/rnaseq/fpkm, manifest/annotation/probe.
See ANALYSIS_PROCEDURES.md for the full decision tree and parameter extraction table.
Phase 1: Methylation Data Processing
Core functions for methylation analysis:
- Load methylation data - Supports CSV, TSV, parquet, HDF5; auto-detects beta vs M-values
- Load probe manifest - Illumina 450K/EPIC manifest with chromosome, position, gene annotation
- CpG filtering - Filter by variance, missing rate, probe type (cg/ch), chromosome, CpG island relation, gene group
- Differential methylation - T-test/Wilcoxon/KS between groups with FDR correction; identify DMPs (hyper/hypo)
- Age-related CpG analysis - Pearson/Spearman correlation of probes with age, FDR correction
- Chromosome-level density - CpG count per chromosome divided by chromosome length; density ratios; genome-wide average
See CODE_REFERENCE.md Phase 1 for full function implementations.
Phase 2: ChIP-seq Peak Analysis
- Load BED/narrowPeak/broadPeak - Auto-detect format, normalize chromosomes
- Peak statistics - Count, length distribution, signal values, q-values
- Peak annotation - Map peaks to nearest gene, classify as promoter/gene_body/proximal/distal
- Peak overlap - Pure Python interval intersection between two BED files; Jaccard similarity
See CODE_REFERENCE.md Phase 2 for full function implementations.
Phase 3: ATAC-seq Analysis
- Load ATAC peaks - Wrapper around BED loader for narrowPeak format
- ATAC-specific stats - Nucleosome-free region (NFR) detection (<150bp peaks), region classification
- Chromatin accessibility by region - Distribution of open chromatin across promoter/enhancer/intergenic
See CODE_REFERENCE.md Phase 3 for full function implementations.
Phase 4: Multi-Omics Integration
- Methylation-expression correlation - Align samples, compute per-probe-gene Pearson/Spearman with FDR
- ChIP-seq + expression - Find genes with promoter peaks and compare expression levels
See CODE_REFERENCE.md Phase 4 for full function implementations.
Phase 5: Clinical Data Integration
- Missing data analysis - Count samples present across clinical, expression, and methylation modalities
- Complete case identification - Find intersection of samples with non-missing values for specified variables
See CODE_REFERENCE.md Phase 5 for full function implementations.
Phase 6: ToolUniverse Annotation
Use ToolUniverse tools to add biological context after computational analysis:
- Gene annotation - Ensembl lookup for coordinates, biotype, cross-references
- Regulatory elements - SCREEN cCREs (enhancers, promoters, insulators)
- ChIPAtlas - Query available ChIP-seq experiments by antigen/cell type
- Ensembl regulatory features - Annotate genomic regions with regulatory overlaps
See CODE_REFERENCE.md Phase 6 and TOOLS_REFERENCE.md for parameters.
Phase 7: Genome-Wide Statistics
- Comprehensive methylation stats - Global mean/median beta, probe variance, chromosome density
- Differential methylation summary - Count significant, hyper/hypo split, effect sizes
See CODE_REFERENCE.md Phase 7 for full function implementations.
Common Analysis Patterns
| Pattern | Input | Key Steps | Output |
|---|---|---|---|
| Differential methylation | Beta matrix + clinical | Filter probes -> define groups -> t-test -> FDR -> threshold | Count of significant DMPs |
| Age-related CpG density | Beta matrix + manifest + ages | Correlate with age -> FDR -> map to chr -> density per chr | Density ratio between chromosomes |
| Multi-omics missing data | Clinical + expression + methylation | Extract sample IDs -> intersect -> check NaN | Complete case count |
| ChIP-seq peak annotation | BED/narrowPeak + gene annotation | Load peaks -> annotate to genes -> classify regions | Fraction in promoters |
| Methylation-expression | Beta matrix + expression + probe-gene map | Align samples -> correlate -> FDR | Significant anti-correlations |
See ANALYSIS_PROCEDURES.md for detailed step-by-step flows and edge case handling.
Key Functions Reference
| Function | Purpose | Input | Output |
|---|---|---|---|
load_methylation_data() | Load beta/M-value matrix | file path | DataFrame |
detect_methylation_type() | Detect beta vs M-values | DataFrame | 'beta' or 'mvalue' |
filter_cpg_probes() | Filter probes by criteria | DataFrame + filters | filtered DataFrame |
differential_methylation() | DM analysis between groups | beta + samples | DataFrame with padj |
identify_age_related_cpgs() | Age-correlated CpGs | beta + ages | DataFrame with padj |
chromosome_cpg_density() | CpG density per chromosome | probes + manifest | density DataFrame |
genome_wide_average_density() | Overall genome density | density DataFrame | float |
chromosome_density_ratio() | Ratio between chromosomes | density + chr names | float |
load_bed_file() | Load BED/narrowPeak | file path | DataFrame |
peak_statistics() | Basic peak stats | BED DataFrame | dict |
annotate_peaks_to_genes() | Annotate peaks to genes | peaks + genes | annotated DataFrame |
find_overlaps() | Peak overlap analysis | two BED DataFrames | overlap DataFrame |
missing_data_analysis() | Cross-modality completeness | multiple DataFrames | dict |
correlate_methylation_expression() | Meth-expression correlation | beta + expression | correlation DataFrame |
ToolUniverse Tools Used
Regulatory Annotation Tools
ensembl_lookup_gene- Gene coordinates, biotype (REQUIRESspecies='homo_sapiens')ensembl_get_regulatory_features- Regulatory features by region (NO "chr" prefix in region)ensembl_get_overlap_features- Gene/transcript overlap dataSCREEN_get_regulatory_elements- cCREs: enhancers, promoters, insulatorsReMap_get_transcription_factor_binding- TF binding sitesRegulomeDB_query_variant- Variant regulatory scorejaspar_search_matrices- TF binding matricesENCODE_search_experiments- Experiment metadata (assay_title must be "TF ChIP-seq" not "ChIP-seq")ChIPAtlas_get_experiments- ChIP-seq experiments (REQUIRESoperationparam)ChIPAtlas_search_datasets- Dataset search (REQUIRESoperationparam)ChIPAtlas_enrichment_analysis- Enrichment from BED/motifs/genesChIPAtlas_get_peak_data- Peak data download (REQUIRESoperationparam)FourDN_search_data- Chromatin conformation data (REQUIRESoperationparam)
Gene Annotation Tools
MyGene_query_genes- Gene queryMyGene_batch_query- Batch gene queryHGNC_get_gene_info- Gene symbol, aliases, IDsGO_get_annotations_for_gene- GO annotations
See TOOLS_REFERENCE.md for full parameter details and return schemas.
Data Format Notes
- Methylation data: Probes (rows) x samples (columns), beta values 0-1
- BED files: Tab-separated, 0-based half-open coordinates
- narrowPeak: 10-column BED extension with signalValue, pValue, qValue, peak
- Illumina manifests: Probe ID, chromosome, position, gene annotation
- Clinical data: Patient/sample-centric with clinical variables as columns
Genome Builds Supported
| Build | Species | Autosomes | Sex Chromosomes |
|---|---|---|---|
| hg38 (GRCh38) | Human | chr1-chr22 | chrX, chrY |
| hg19 (GRCh37) | Human | chr1-chr22 | chrX, chrY |
| mm10 (GRCm38) | Mouse | chr1-chr19 | chrX, chrY |
Limitations
- No native pybedtools: uses pure Python interval operations
- No native pyBigWig: cannot read BigWig files directly without package
- No R bridge: does not use methylKit, ChIPseeker, or DiffBind
- Illumina-centric: methylation functions designed for 450K/EPIC arrays
- Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
- No peak calling: assumes peaks are pre-called
- API rate limits: ToolUniverse annotation limited to ~20 genes per batch
Reference Files
CODE_REFERENCE.md- Full Python function implementations for all phasesTOOLS_REFERENCE.md- ToolUniverse tool parameter details and return schemasANALYSIS_PROCEDURES.md- Decision trees, step-by-step analysis patterns, edge cases, fallback strategiesQUICK_START.md- Quick start examples for common analysis types