tooluniverse-epigenomics

Production-ready genomics and epigenomics data processing for BixBench questions. Handles methylation array analysis (CpG filtering, differential methylation, age-related CpG detection, chromosome-level density), ChIP-seq peak analysis (peak calling, motif enrichment, coverage stats), ATAC-seq chromatin accessibility, multi-omics integration (expression + methylation correlation), and genome-wide statistics. Pure Python computation (pandas, scipy, numpy, pysam, statsmodels) plus ToolUniverse annotation tools (Ensembl, ENCODE, SCREEN, JASPAR, ReMap, RegulomeDB, ChIPAtlas). Supports BED, BigWig, methylation beta-value matrices, Illumina manifest files, and multi-sample clinical data. Use when processing methylation data, ChIP-seq peaks, ATAC-seq signals, or answering questions about CpG sites, differential methylation, chromatin accessibility, histone marks, or epigenomic statistics.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "tooluniverse-epigenomics" with this command: npx skills add mims-harvard/tooluniverse/mims-harvard-tooluniverse-tooluniverse-epigenomics

Genomics and Epigenomics Data Processing

Production-ready computational skill for processing and analyzing epigenomics data. Combines local Python computation (pandas, scipy, numpy, pysam, statsmodels) with ToolUniverse annotation tools for regulatory context. Designed to solve BixBench-style questions about methylation, ChIP-seq, ATAC-seq, and multi-omics integration.

When to Use This Skill

Triggers:

  • User provides methylation data (beta-value matrices, Illumina arrays) and asks about CpG sites
  • Questions about differential methylation analysis
  • Age-related CpG detection or epigenetic clock questions
  • Chromosome-level methylation density or statistics
  • ChIP-seq peak files (BED format) with analysis questions
  • ATAC-seq chromatin accessibility questions
  • Multi-omics integration (expression + methylation, expression + ChIP-seq)
  • Genome-wide epigenomic statistics
  • Questions mentioning "methylation", "CpG", "ChIP-seq", "ATAC-seq", "histone", "chromatin", "epigenetic"
  • Questions about missing data across clinical/genomic/epigenomic modalities
  • Regulatory element annotation for processed epigenomic data

Example Questions:

  1. "How many patients have no missing data for vital status, gene expression, and methylation data?"
  2. "What is the ratio of filtered age-related CpG density between chromosomes?"
  3. "What is the genome-wide average chromosomal density of unique age-related CpGs per base pair?"
  4. "How many CpG sites show significant differential methylation (padj < 0.05)?"
  5. "What is the Pearson correlation between methylation and expression for gene X?"
  6. "How many ChIP-seq peaks overlap with promoter regions?"
  7. "What fraction of ATAC-seq peaks are in enhancer regions?"
  8. "Which chromosome has the highest density of hypermethylated CpGs?"
  9. "Filter CpG sites by variance > threshold and map to nearest genes"
  10. "What is the average beta value difference between tumor and normal for chromosome 17?"

NOT for (use other skills instead):

  • Gene regulation lookup without data files -> Use existing epigenomics annotation pattern
  • RNA-seq differential expression -> Use tooluniverse-rnaseq-deseq2
  • Variant calling/annotation from VCF -> Use tooluniverse-variant-analysis
  • Gene enrichment analysis -> Use tooluniverse-gene-enrichment
  • Protein structure analysis -> Use tooluniverse-protein-structure-retrieval

Required Python Packages

# Core (MUST be available)
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.stats.multitest as mt

# Optional but useful
import pysam      # BAM/CRAM file access
import gseapy     # Enrichment of genes from methylation analysis

# ToolUniverse (for annotation)
from tooluniverse import ToolUniverse

Key Principles

  1. Data-first approach - Load and inspect data files BEFORE any analysis
  2. Question-driven - Parse what the user is actually asking and extract the specific numeric answer
  3. File format detection - Automatically detect methylation arrays, BED files, BigWig, clinical data
  4. Coordinate system awareness - Track genome build (hg19, hg38, mm10), handle chr prefix differences
  5. Statistical rigor - Proper multiple testing correction, effect size filtering, sample size awareness
  6. Missing data handling - Explicitly report and handle NaN/missing values
  7. Chromosome normalization - Always normalize chromosome names (chr1 vs 1, chrX vs X)
  8. CpG site identification - Parse Illumina probe IDs (cg/ch probes), genomic coordinates
  9. Report-first - Create output file first, populate progressively
  10. English-first queries - Use English in all tool calls

Workflow Overview

Phase 0: Question Parsing and Data Discovery

Before writing any code, parse the question to identify:

  • What data files are available (methylation, ChIP-seq, ATAC-seq, clinical, expression, manifest)
  • What specific statistic or answer is being asked for
  • What thresholds apply (significance, effect size, variance, chromosome filters)
  • What genome build to use

Categorize files by scanning for keywords: methyl/beta/cpg/illumina, chip/peak/narrowpeak, atac/accessibility, clinical/patient/sample, express/rnaseq/fpkm, manifest/annotation/probe.

See ANALYSIS_PROCEDURES.md for the full decision tree and parameter extraction table.

Phase 1: Methylation Data Processing

Core functions for methylation analysis:

  • Load methylation data - Supports CSV, TSV, parquet, HDF5; auto-detects beta vs M-values
  • Load probe manifest - Illumina 450K/EPIC manifest with chromosome, position, gene annotation
  • CpG filtering - Filter by variance, missing rate, probe type (cg/ch), chromosome, CpG island relation, gene group
  • Differential methylation - T-test/Wilcoxon/KS between groups with FDR correction; identify DMPs (hyper/hypo)
  • Age-related CpG analysis - Pearson/Spearman correlation of probes with age, FDR correction
  • Chromosome-level density - CpG count per chromosome divided by chromosome length; density ratios; genome-wide average

See CODE_REFERENCE.md Phase 1 for full function implementations.

Phase 2: ChIP-seq Peak Analysis

  • Load BED/narrowPeak/broadPeak - Auto-detect format, normalize chromosomes
  • Peak statistics - Count, length distribution, signal values, q-values
  • Peak annotation - Map peaks to nearest gene, classify as promoter/gene_body/proximal/distal
  • Peak overlap - Pure Python interval intersection between two BED files; Jaccard similarity

See CODE_REFERENCE.md Phase 2 for full function implementations.

Phase 3: ATAC-seq Analysis

  • Load ATAC peaks - Wrapper around BED loader for narrowPeak format
  • ATAC-specific stats - Nucleosome-free region (NFR) detection (<150bp peaks), region classification
  • Chromatin accessibility by region - Distribution of open chromatin across promoter/enhancer/intergenic

See CODE_REFERENCE.md Phase 3 for full function implementations.

Phase 4: Multi-Omics Integration

  • Methylation-expression correlation - Align samples, compute per-probe-gene Pearson/Spearman with FDR
  • ChIP-seq + expression - Find genes with promoter peaks and compare expression levels

See CODE_REFERENCE.md Phase 4 for full function implementations.

Phase 5: Clinical Data Integration

  • Missing data analysis - Count samples present across clinical, expression, and methylation modalities
  • Complete case identification - Find intersection of samples with non-missing values for specified variables

See CODE_REFERENCE.md Phase 5 for full function implementations.

Phase 6: ToolUniverse Annotation

Use ToolUniverse tools to add biological context after computational analysis:

  • Gene annotation - Ensembl lookup for coordinates, biotype, cross-references
  • Regulatory elements - SCREEN cCREs (enhancers, promoters, insulators)
  • ChIPAtlas - Query available ChIP-seq experiments by antigen/cell type
  • Ensembl regulatory features - Annotate genomic regions with regulatory overlaps

See CODE_REFERENCE.md Phase 6 and TOOLS_REFERENCE.md for parameters.

Phase 7: Genome-Wide Statistics

  • Comprehensive methylation stats - Global mean/median beta, probe variance, chromosome density
  • Differential methylation summary - Count significant, hyper/hypo split, effect sizes

See CODE_REFERENCE.md Phase 7 for full function implementations.


Common Analysis Patterns

PatternInputKey StepsOutput
Differential methylationBeta matrix + clinicalFilter probes -> define groups -> t-test -> FDR -> thresholdCount of significant DMPs
Age-related CpG densityBeta matrix + manifest + agesCorrelate with age -> FDR -> map to chr -> density per chrDensity ratio between chromosomes
Multi-omics missing dataClinical + expression + methylationExtract sample IDs -> intersect -> check NaNComplete case count
ChIP-seq peak annotationBED/narrowPeak + gene annotationLoad peaks -> annotate to genes -> classify regionsFraction in promoters
Methylation-expressionBeta matrix + expression + probe-gene mapAlign samples -> correlate -> FDRSignificant anti-correlations

See ANALYSIS_PROCEDURES.md for detailed step-by-step flows and edge case handling.


Key Functions Reference

FunctionPurposeInputOutput
load_methylation_data()Load beta/M-value matrixfile pathDataFrame
detect_methylation_type()Detect beta vs M-valuesDataFrame'beta' or 'mvalue'
filter_cpg_probes()Filter probes by criteriaDataFrame + filtersfiltered DataFrame
differential_methylation()DM analysis between groupsbeta + samplesDataFrame with padj
identify_age_related_cpgs()Age-correlated CpGsbeta + agesDataFrame with padj
chromosome_cpg_density()CpG density per chromosomeprobes + manifestdensity DataFrame
genome_wide_average_density()Overall genome densitydensity DataFramefloat
chromosome_density_ratio()Ratio between chromosomesdensity + chr namesfloat
load_bed_file()Load BED/narrowPeakfile pathDataFrame
peak_statistics()Basic peak statsBED DataFramedict
annotate_peaks_to_genes()Annotate peaks to genespeaks + genesannotated DataFrame
find_overlaps()Peak overlap analysistwo BED DataFramesoverlap DataFrame
missing_data_analysis()Cross-modality completenessmultiple DataFramesdict
correlate_methylation_expression()Meth-expression correlationbeta + expressioncorrelation DataFrame

ToolUniverse Tools Used

Regulatory Annotation Tools

  • ensembl_lookup_gene - Gene coordinates, biotype (REQUIRES species='homo_sapiens')
  • ensembl_get_regulatory_features - Regulatory features by region (NO "chr" prefix in region)
  • ensembl_get_overlap_features - Gene/transcript overlap data
  • SCREEN_get_regulatory_elements - cCREs: enhancers, promoters, insulators
  • ReMap_get_transcription_factor_binding - TF binding sites
  • RegulomeDB_query_variant - Variant regulatory score
  • jaspar_search_matrices - TF binding matrices
  • ENCODE_search_experiments - Experiment metadata (assay_title must be "TF ChIP-seq" not "ChIP-seq")
  • ChIPAtlas_get_experiments - ChIP-seq experiments (REQUIRES operation param)
  • ChIPAtlas_search_datasets - Dataset search (REQUIRES operation param)
  • ChIPAtlas_enrichment_analysis - Enrichment from BED/motifs/genes
  • ChIPAtlas_get_peak_data - Peak data download (REQUIRES operation param)
  • FourDN_search_data - Chromatin conformation data (REQUIRES operation param)

Gene Annotation Tools

  • MyGene_query_genes - Gene query
  • MyGene_batch_query - Batch gene query
  • HGNC_get_gene_info - Gene symbol, aliases, IDs
  • GO_get_annotations_for_gene - GO annotations

See TOOLS_REFERENCE.md for full parameter details and return schemas.


Data Format Notes

  • Methylation data: Probes (rows) x samples (columns), beta values 0-1
  • BED files: Tab-separated, 0-based half-open coordinates
  • narrowPeak: 10-column BED extension with signalValue, pValue, qValue, peak
  • Illumina manifests: Probe ID, chromosome, position, gene annotation
  • Clinical data: Patient/sample-centric with clinical variables as columns

Genome Builds Supported

BuildSpeciesAutosomesSex Chromosomes
hg38 (GRCh38)Humanchr1-chr22chrX, chrY
hg19 (GRCh37)Humanchr1-chr22chrX, chrY
mm10 (GRCm38)Mousechr1-chr19chrX, chrY

Limitations

  • No native pybedtools: uses pure Python interval operations
  • No native pyBigWig: cannot read BigWig files directly without package
  • No R bridge: does not use methylKit, ChIPseeker, or DiffBind
  • Illumina-centric: methylation functions designed for 450K/EPIC arrays
  • Uses t-test/Wilcoxon for differential methylation (not limma/bumphunter)
  • No peak calling: assumes peaks are pre-called
  • API rate limits: ToolUniverse annotation limited to ~20 genes per batch

Reference Files

  • CODE_REFERENCE.md - Full Python function implementations for all phases
  • TOOLS_REFERENCE.md - ToolUniverse tool parameter details and return schemas
  • ANALYSIS_PROCEDURES.md - Decision trees, step-by-step analysis patterns, edge cases, fallback strategies
  • QUICK_START.md - Quick start examples for common analysis types

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

devtu-optimize-skills

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

devtu-create-tool

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

devtu-optimize-descriptions

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

tooluniverse-clinical-trial-design

No summary provided by upstream source.

Repository SourceNeeds Review