Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

When to Use This Skill

Triggers:

User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
Questions about differential methylation analysis
Age-related CpG detection or epigenetic clock questions
Chromosome-level methylation density or statistics
ChIP-seq peak files (BED format) with analysis questions
ATAC-seq chromatin accessibility questions
Multi-omics integration (expression + methylation, expression + ChIP-seq)
Genome-wide epigenomic statistics
Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
Questions about missing data across clinical/genomic/epigenomic modalities
Regulatory element annotation for processed epigenomic data

Example Questions:

"How many patients have no missing data for vital status, gene expression, and methylation data?"
"What is the ratio of filtered age-related CpG density between chromosomes?"
"What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
"How many CpG sites show significant differential methylation (padj < 0.05)?"
"What is the Pearson correlation between methylation and expression for gene X?"
"How many ChIP-seq peaks overlap with promoter regions?"
"What fraction of ATAC-seq peaks are in enhancer regions?"
"Which chromosome has the highest density of hypermethylated CpGs?"
"Filter CpG sites by variance > threshold and map to nearest genes"
"What is the average beta value difference between tumor and normal for chromosome 17?"

NOT for (use other skills instead):

Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
RNA-seq differential expression -> Use tooluniverse-rnaseq-deseq2
Variant calling/annotation from VCF -> Use tooluniverse-variant-analysis
Gene enrichment analysis -> Use tooluniverse-gene-enrichment
Protein structure analysis -> Use tooluniverse-protein-structure-retrieval

Required Python Packages

# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt

# Optional but useful
import pysam      # BAM/CRAM file access
import gseapy     # Enrichment of genes from methylation analysis

# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse

Key Principles

Data-first approach - Load and inspect data files BEFORE any analysis
Question-driven - Parse what the user is actually asking and extract the specific numeric answer
File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
Missing data handling - Explicitly report and handle NaN/missing values
Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
CpG site identification - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
Report-first - Create output file first, populate progressively
English-first queries - Use English in all tool calls

Workflow Overview

Phase 0: Question Parsing and Data Discovery

Before writing any code, parse the question to identify:

What data files are available (methylation, ChIP-seq, ATAC-seq, clinical, expression, manifest)
What specific statistic or answer is being asked for
What thresholds apply (significance, effect size, variance, chromosome filters)
What genome build to use

Categorize files by scanning for keywords: methyl/beta/cpg/illumina, chip/peak/narrowpeak, atac/accessibility, clinical/patient/sample, express/rnaseq/fpkm, manifest/annotation/probe.

See ANALYSIS_PROCEDURES.md for the full decision tree and parameter extraction table.

Phase 1: Methylation Data Processing

Core functions for methylation analysis:

Load methylation data - Supports CSV, TSV, parquet, HDF5; auto-detects beta vs M-values
Load probe manifest - Illumina 450K/EPIC manifest with chromosome, position, gene annotation
CpG filtering - Filter by variance, missing rate, probe type (cg/ch), chromosome, CpG island relation, gene group
Differential methylation - T-test/Wilcoxon/KS between groups with FDR correction; identify DMPs (hyper/hypo)
Age-related CpG analysis - Pearson/Spearman correlation of probes with age, FDR correction
Chromosome-level density - CpG count per chromosome divided by chromosome length; density ratios; genome-wide average

See CODE_REFERENCE.md Phase 1 for full function implementations.

Phase 2: ChIP-seq Peak Analysis

Load BED/narrowPeak/broadPeak - Auto-detect format, normalize chromosomes
Peak statistics - Count, length distribution, signal values, q-values
Peak annotation - Map peaks to nearest gene, classify as promoter/gene_body/proximal/distal
Peak overlap - Pure Python interval intersection between two BED files; Jaccard similarity

See CODE_REFERENCE.md Phase 2 for full function implementations.

Phase 3: ATAC-seq Analysis

Load ATAC peaks - Wrapper around BED loader for narrowPeak format
ATAC-specific stats - Nucleosome-free region (NFR) detection (<150bp peaks), region classification
Chromatin accessibility by region - Distribution of open chromatin across promoter/enhancer/intergenic

See CODE_REFERENCE.md Phase 3 for full function implementations.

Phase 4: Multi-Omics Integration

Methylation-expression correlation - Align samples, compute per-probe-gene Pearson/Spearman with FDR
ChIP-seq + expression - Find genes with promoter peaks and compare expression levels

See CODE_REFERENCE.md Phase 4 for full function implementations.

Phase 5: Clinical Data Integration

Missing data analysis - Count samples present across clinical, expression, and methylation modalities
Complete case identification - Find intersection of samples with non-missing values for specified variables

See CODE_REFERENCE.md Phase 5 for full function implementations.

Phase 6: ToolUniverse Annotation

Use ToolUniverse tools to add biological context after computational analysis:

Gene annotation - Ensembl lookup for coordinates, biotype, cross-references
Regulatory elements - SCREEN cCREs (enhancers, promoters, insulators)
ChIPAtlas - Query available ChIP-seq experiments by antigen/cell type
Ensembl regulatory features - Annotate genomic regions with regulatory overlaps

See CODE_REFERENCE.md Phase 6 and TOOLS_REFERENCE.md for parameters.

Phase 7: Genome-Wide Statistics

Comprehensive methylation stats - Global mean/median beta, probe variance, chromosome density
Differential methylation summary - Count significant, hyper/hypo split, effect sizes

See CODE_REFERENCE.md Phase 7 for full function implementations.

Common Analysis Patterns

Pattern	Input	Key Steps	Output
Differential methylation	Beta matrix + clinical	Filter probes -> define groups -> t-test -> FDR -> threshold	Count of significant DMPs
Age-related CpG density	Beta matrix + manifest + ages	Correlate with age -> FDR -> map to chr -> density per chr	Density ratio between chromosomes
Multi-omics missing data	Clinical + expression + methylation	Extract sample IDs -> intersect -> check NaN	Complete case count
ChIP-seq peak annotation	BED/narrowPeak + gene annotation	Load peaks -> annotate to genes -> classify regions	Fraction in promoters
Methylation-expression	Beta matrix + expression + probe-gene map	Align samples -> correlate -> FDR	Significant anti-correlations

See ANALYSIS_PROCEDURES.md for detailed step-by-step flows and edge case handling.

Key Functions Reference

Function	Purpose	Input	Output
`load_methylation_data()`	Load beta/M-value matrix	file path	DataFrame
`detect_methylation_type()`	Detect beta vs M-values	DataFrame	'beta' or 'mvalue'
`filter_cpg_probes()`	Filter probes by criteria	DataFrame + filters	filtered DataFrame
`differential_methylation()`	DM analysis between groups	beta + samples	DataFrame with padj
`identify_age_related_cpgs()`	Age-correlated CpGs	beta + ages	DataFrame with padj
`chromosome_cpg_density()`	CpG density per chromosome	probes + manifest	density DataFrame
`genome_wide_average_density()`	Overall genome density	density DataFrame	float
`chromosome_density_ratio()`	Ratio between chromosomes	density + chr names	float
`load_bed_file()`	Load BED/narrowPeak	file path	DataFrame
`peak_statistics()`	Basic peak stats	BED DataFrame	dict
`annotate_peaks_to_genes()`	Annotate peaks to genes	peaks + genes	annotated DataFrame
`find_overlaps()`	Peak overlap analysis	two BED DataFrames	overlap DataFrame
`missing_data_analysis()`	Cross-modality completeness	multiple DataFrames	dict
`correlate_methylation_expression()`	Meth-expression correlation	beta + expression	correlation DataFrame

ToolUniverse Tools Used

Regulatory Annotation Tools

ensembl_lookup_gene - Gene coordinates, biotype (REQUIRES species='homo_sapiens')
ensembl_get_regulatory_features - Regulatory features by region (NO "chr" prefix in region)
ensembl_get_overlap_features - Gene/transcript overlap data
SCREEN_get_regulatory_elements - cCREs: enhancers, promoters, insulators
ReMap_get_transcription_factor_binding - TF binding sites
RegulomeDB_query_variant - Variant regulatory score
jaspar_search_matrices - TF binding matrices
ENCODE_search_experiments - Experiment metadata (assay_title must be "TF ChIP-seq" not "ChIP-seq")
ChIPAtlas_get_experiments - ChIP-seq experiments (REQUIRES operation param)
ChIPAtlas_search_datasets - Dataset search (REQUIRES operation param)
ChIPAtlas_enrichment_analysis - Enrichment from BED/motifs/genes
ChIPAtlas_get_peak_data - Peak data download (REQUIRES operation param)
FourDN_search_data - Chromatin conformation data (REQUIRES operation param)

Gene Annotation Tools

MyGene_query_genes - Gene query
MyGene_batch_query - Batch gene query
HGNC_get_gene_info - Gene symbol, aliases, IDs
GO_get_annotations_for_gene - GO annotations

See TOOLS_REFERENCE.md for full parameter details and return schemas.

Data Format Notes

Methylation data: Probes (rows) x samples (columns), beta values 0-1
BED files: Tab-separated, 0-based half-open coordinates
narrowPeak: 10-column BED extension with signalValue, pValue, qValue, peak
Illumina manifests: Probe ID, chromosome, position, gene annotation
Clinical data: Patient/sample-centric with clinical variables as columns

Genome Builds Supported

Build	Species	Autosomes	Sex Chromosomes
hg38 (GRCh38)	Human	chr1-chr22	chrX, chrY
hg19 (GRCh37)	Human	chr1-chr22	chrX, chrY
mm10 (GRCm38)	Mouse	chr1-chr19	chrX, chrY

Limitations

No native pybedtools: uses pure Python interval operations
No native pyBigWig: cannot read BigWig files directly without package
No R bridge: does not use methylKit, ChIPseeker, or DiffBind
Illumina-centric: methylation functions designed for 450K/EPIC arrays
Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
No peak calling: assumes peaks are pre-called
API rate limits: ToolUniverse annotation limited to ~20 genes per batch

Reference Files

CODE_REFERENCE.md - Full Python function implementations for all phases
TOOLS_REFERENCE.md - ToolUniverse tool parameter details and return schemas
ANALYSIS_PROCEDURES.md - Decision trees, step-by-step analysis patterns, edge cases, fallback strategies
QUICK_START.md - Quick start examples for common analysis types

tooluniverse-epigenomics

Safety Notice

Copy this and send it to your AI assistant to learn