genomics

Genomics and transcriptomics analysis strategies

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "genomics" with this command: npx skills add justaddcoffee/open-science-skills/justaddcoffee-open-science-skills-genomics

Genomics and Transcriptomics Analysis

When to Use This Skill

  • When data contains gene expression measurements (RNA-seq, microarray)
  • When analyzing differential gene expression
  • When performing pathway or gene set enrichment analysis
  • When interpreting genetic variants or mutations

Core Concepts

Gene Expression Data Types

RNA-seq counts:

  • Raw read counts per gene
  • Requires normalization (TPM, RPKM, DESeq2)
  • Suitable for differential expression analysis

Microarray intensities:

  • Probe fluorescence intensities
  • Log-transformed, background-corrected
  • Legacy platform, less common now

Single-cell RNA-seq:

  • Expression per cell (not bulk tissue)
  • High sparsity (many zeros)
  • Specialized analysis methods

Gene Nomenclature

Human genes:

  • Official symbols: HUGO Gene Nomenclature Committee (HGNC)
  • Example: TP53 (tumor protein p53)
  • Italicized in publications

Mouse genes:

  • Similar to human but capitalization differs
  • Example: Tp53 (first letter capital, rest lowercase)

Protein names:

  • Not italicized
  • Example: p53 protein

Always verify gene symbols - aliases and outdated names are common.

Differential Expression Analysis

Workflow

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Load expression data (genes × samples)
# Rows = genes, Columns = samples
expr_data = pd.read_csv("expression_data.csv", index_col=0)

# Define groups
group1_samples = ["Sample1", "Sample2", "Sample3"]
group2_samples = ["Sample4", "Sample5", "Sample6"]

results = []

for gene in expr_data.index:
    group1_expr = expr_data.loc[gene, group1_samples]
    group2_expr = expr_data.loc[gene, group2_samples]

    # T-test
    t_stat, p_value = ttest_ind(group1_expr, group2_expr)

    # Fold change
    mean1 = group1_expr.mean()
    mean2 = group2_expr.mean()
    log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan

    results.append({
        "gene": gene,
        "log2FC": log2fc,
        "p_value": p_value,
        "mean_group1": mean1,
        "mean_group2": mean2
    })

results_df = pd.DataFrame(results)

# Multiple testing correction
results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1]

# Define significant genes
significant = results_df[
    (results_df["p_adj"] < 0.05) &
    (abs(results_df["log2FC"]) > 1)  # 2-fold change
]

print(f"Significant genes: {len(significant)}")
print(f"Upregulated: {sum(significant['log2FC'] > 0)}")
print(f"Downregulated: {sum(significant['log2FC'] < 0)}")

Volcano Plot

Visualize differential expression:

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(
    results_df["log2FC"],
    -np.log10(results_df["p_adj"]),
    alpha=0.5, s=10, c="gray"
)

# Highlight significant genes
sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1)
plt.scatter(
    results_df.loc[sig_mask, "log2FC"],
    -np.log10(results_df.loc[sig_mask, "p_adj"]),
    alpha=0.7, s=20, c="red", label="Significant"
)

plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(adjusted p-value)")
plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5)
plt.axvline(-1, linestyle="--", color="black", linewidth=0.5)
plt.axvline(1, linestyle="--", color="black", linewidth=0.5)
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

Gene Set Enrichment

Simple Pathway Enrichment

When: You have a list of significant genes and want to know which pathways are affected

# Define gene sets (pathways)
gene_sets = {
    "Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...],
    "Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...],
    "DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...],
    # ... more pathways
}

# Fisher's exact test for enrichment
from scipy.stats import fisher_exact

all_genes = set(expr_data.index)
sig_genes = set(significant["gene"])

enrichment_results = []

for pathway, pathway_genes in gene_sets.items():
    pathway_genes = set(pathway_genes) & all_genes  # Only genes in dataset

    # 2x2 contingency table
    a = len(sig_genes & pathway_genes)  # Sig & in pathway
    b = len(sig_genes - pathway_genes)  # Sig & not in pathway
    c = len(pathway_genes - sig_genes)  # Not sig & in pathway
    d = len(all_genes - sig_genes - pathway_genes)  # Not sig & not in pathway

    oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater')

    enrichment_results.append({
        "pathway": pathway,
        "overlap": a,
        "pathway_size": len(pathway_genes),
        "odds_ratio": oddsratio,
        "p_value": p_value
    })

enrich_df = pd.DataFrame(enrichment_results)
enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1]
enrich_df = enrich_df.sort_values("p_adj")

print(enrich_df.head(10))

Gene Ontology (GO) Terms

Common GO categories:

  • Biological Process (BP): What the gene does (e.g., "cell cycle", "apoptosis")
  • Molecular Function (MF): Biochemical activity (e.g., "kinase activity")
  • Cellular Component (CC): Where it acts (e.g., "nucleus", "mitochondrion")

Resources:

KEGG Pathway Enrichment

KEGG = Kyoto Encyclopedia of Genes and Genomes

Provides curated pathway maps for:

  • Metabolic pathways
  • Signaling pathways
  • Disease pathways

Example pathways:

  • hsa04110: Cell cycle
  • hsa04210: Apoptosis
  • hsa04151: PI3K-Akt signaling

Common Analysis Patterns

Pattern 1: Transcription Factor Activity

Observation: Many genes upregulated

Hypothesis: Shared transcription factor (TF)

Test:

# Check if significant genes share TF binding motifs
tf_targets = {
    "TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...],
    "MYC": ["CDK4", "CCND1", "E2F1", ...],
    # ... more TFs
}

# Test for enrichment (same as pathway enrichment)

Interpretation: Enrichment suggests TF is active/inactive in condition

Pattern 2: Pathway Coordination

Observation: Genes in same pathway all up/down together

Interpretation: Pathway-level regulation (not individual genes)

Example:

All glycolysis genes ↑↑ → Increased glycolysis
All oxidative phosphorylation genes ↓↓ → Metabolic shift

Pattern 3: Compensatory Response

Observation: Opposite regulation of related pathways

Example:

De novo biosynthesis genes ↓
Salvage pathway genes ↑
→ Metabolic switch to energy-efficient salvage

Correlation Analysis

Co-expression Networks

When: Identify genes that change together

from scipy.stats import pearsonr

# Compute pairwise correlations
genes = significant["gene"].tolist()[:50]  # Top 50 for tractability
corr_matrix = expr_data.loc[genes].T.corr()

# Filter high correlations
high_corr = []
for i in range(len(genes)):
    for j in range(i+1, len(genes)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            high_corr.append({
                "gene1": genes[i],
                "gene2": genes[j],
                "correlation": corr_matrix.iloc[i, j]
            })

print(f"High correlations (|r| > 0.8): {len(high_corr)}")

Interpretation:

  • Positive correlation → co-regulated (same pathway, shared TF)
  • Negative correlation → antagonistic regulation

Network Visualization

import networkx as nx

# Build network
G = nx.Graph()
for item in high_corr:
    G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"]))

# Find communities (clusters of co-expressed genes)
from networkx.algorithms import community
communities = community.greedy_modularity_communities(G)

for i, comm in enumerate(communities):
    print(f"Community {i}: {list(comm)}")

Literature Search Strategies

Effective Queries

For gene function:

"[GENE] function"
"[GENE] role in [PROCESS]"
"[GENE] knockout phenotype"

For pathway context:

"[GENE] pathway"
"[GENE] interacting proteins"
"[GENE] regulation"

For disease relevance:

"[GENE] [DISEASE]"
"[GENE] mutation [DISEASE]"

Key Databases

  1. NCBI Gene: Gene summaries and references
  2. UniProt: Protein function and domains
  3. STRING: Protein-protein interactions
  4. GeneCards: Comprehensive gene info
  5. PubMed: Literature search

Genomics-Specific Hypotheses

Template Hypotheses

H1: Transcriptional Regulation

"Condition X activates transcription factor [TF], upregulating
target genes [G1, G2, G3] in pathway [P]"

H2: Pathway Activation

"Condition X activates [pathway], evidenced by coordinated
upregulation of pathway genes and increased activity signature"

H3: Epigenetic Regulation

"Condition X alters chromatin state at [locus], changing
expression of genes [G1, G2]"

H4: Post-transcriptional Regulation

"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2],
explaining decreased protein levels despite unchanged mRNA"

Quality Control

Before interpreting results:

  • Check for batch effects (PCA colored by batch)
  • Verify sample labels are correct
  • Check for outlier samples (hierarchical clustering)
  • Confirm expression distribution (should be roughly normal after log transform)
  • Verify normalization (samples should have similar distributions)

Common Pitfalls

Ignoring log transformation

  • Expression data should be log-transformed for most analyses
  • Fold changes are linear differences in log space

Using nominal p-values for many genes

  • Always correct for multiple testing (FDR)
  • Use adjusted p-values for significance

Overinterpreting small fold changes

  • log2FC < 0.5 (1.4-fold) may not be biologically meaningful
  • Use stricter thresholds for noisy data

Confusing gene expression with protein activity

  • mRNA ≠ protein levels
  • Protein activity may require post-translational modifications

Cherry-picking genes

  • Don't select genes to fit a story
  • Use unbiased pathway enrichment

Integration with Other Data Types

Transcriptomics + Metabolomics

Strategy:

1. Identify differentially expressed metabolic enzymes
2. Map to KEGG pathways
3. Check if corresponding metabolites are changed
4. Build integrated metabolic model

Example:

Gene: PHGDH (phosphoglycerate dehydrogenase) ↑
Metabolite: Serine ↑
→ Integrated finding: Increased serine biosynthesis

Transcriptomics + Proteomics

Compare mRNA vs protein changes:

  • Concordant (both up/down) → transcriptional regulation
  • Discordant (mRNA ≠ protein) → post-transcriptional regulation

Key Principle

Gene expression is the messenger, not the message.

mRNA changes indicate potential for protein changes. Always consider:

  • Post-transcriptional regulation (miRNA, RNA stability)
  • Translational control
  • Protein stability and degradation
  • Post-translational modifications

Connect expression changes to phenotype through pathways and functional validation.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

learn-anything-in-one-hour

Teach users any new skill/knowledge X in ~1 hour using a fixed 4-step workflow optimized for complete beginners, focusing on 80/20 rule for maximum value in minimum time. Triggers when user asks to learn something new quickly, or mentions "learn X in one hour".

Archived SourceRecently Updated
Research

X/Twitter Research

# X/Twitter Research Skill

Archived SourceRecently Updated
Research

council

Convene the Council of High Intelligence — multi-persona deliberation with historical thinkers for deeper analysis of complex problems.

Archived SourceRecently Updated
Research

polymarket-openclaw-trader

Reusable Polymarket + OpenClaw trading operations skill for any workspace. Use when the user needs to set up, run, tune, monitor, and deploy an automated Polymarket trading project (paper/live), including env configuration, risk controls, reporting, and dashboard operations.

Archived SourceRecently Updated