PyDESeq2

Overview

PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.

When to Use This Skill

This skill should be used when:

Analyzing bulk RNA-seq count data for differential expression
Comparing gene expression between experimental conditions (e.g., treated vs control)
Performing multi-factor designs accounting for batch effects or covariates
Converting R-based DESeq2 workflows to Python
Integrating differential expression analysis into Python-based pipelines
Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2"

Quick Start Workflow

For users who want to perform a standard differential expression analysis:

import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats

1. Load data

counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes metadata = pd.read_csv("metadata.csv", index_col=0)

2. Filter low-count genes

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]

3. Initialize and fit DESeq2

dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True ) dds.deseq2()

4. Perform statistical testing

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()

5. Access results

results = ds.results_df significant = results[results.padj < 0.05] print(f"Found {len(significant)} significant genes")

Core Workflow Steps

Step 1: Data Preparation

Input requirements:

Count matrix: Samples × genes DataFrame with non-negative integer read counts
Metadata: Samples × variables DataFrame with experimental factors

Common data loading patterns:

From CSV (typical format: genes × samples, needs transpose)

counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0)

From TSV

counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T

From AnnData

import anndata as ad adata = ad.read_h5ad("data.h5ad") counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names) metadata = adata.obs

Data filtering:

Remove low-count genes

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep]

Remove samples with missing metadata

samples_to_keep = ~metadata.condition.isna() counts_df = counts_df.loc[samples_to_keep] metadata = metadata.loc[samples_to_keep]

Step 2: Design Specification

The design formula specifies how gene expression is modeled.

Single-factor designs:

design = "~condition" # Simple two-group comparison

Multi-factor designs:

design = "~batch + condition" # Control for batch effects design = "~age + condition" # Include continuous covariate design = "~group + condition + group:condition" # Interaction effects

Design formula guidelines:

Use Wilkinson formula notation (R-style)
Put adjustment variables (e.g., batch) before the main variable of interest
Ensure variables exist as columns in the metadata DataFrame
Use appropriate data types (categorical for discrete variables)

Step 3: DESeq2 Fitting

Initialize the DeseqDataSet and run the complete pipeline:

from pydeseq2.dds import DeseqDataSet

dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True, # Refit after removing outliers n_cpus=1 # Parallel processing (adjust as needed) )

Run the complete DESeq2 pipeline

dds.deseq2()

What deseq2() does:

Computes size factors (normalization)
Fits genewise dispersions
Fits dispersion trend curve
Computes dispersion priors
Fits MAP dispersions (shrinkage)
Fits log fold changes
Calculates Cook's distances (outlier detection)
Refits if outliers detected (optional)

Step 4: Statistical Testing

Perform Wald tests to identify differentially expressed genes:

from pydeseq2.ds import DeseqStats

ds = DeseqStats( dds, contrast=["condition", "treated", "control"], # Test treated vs control alpha=0.05, # Significance threshold cooks_filter=True, # Filter outliers independent_filter=True # Filter low-power tests )

ds.summary()

Contrast specification:

Format: [variable, test_level, reference_level]
Example: ["condition", "treated", "control"] tests treated vs control
If None , uses the last coefficient in the design

Result DataFrame columns:

baseMean : Mean normalized count across samples
log2FoldChange : Log2 fold change between conditions
lfcSE : Standard error of LFC
stat : Wald test statistic
pvalue : Raw p-value
padj : Adjusted p-value (FDR-corrected via Benjamini-Hochberg)

Step 5: Optional LFC Shrinkage

Apply shrinkage to reduce noise in fold change estimates:

ds.lfc_shrink() # Applies apeGLM shrinkage

When to use LFC shrinkage:

For visualization (volcano plots, heatmaps)
For ranking genes by effect size
When prioritizing genes for follow-up experiments

Important: Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance.

Step 6: Result Export

Save results and intermediate objects:

import pickle

Export results as CSV

ds.results_df.to_csv("deseq2_results.csv")

Save significant genes only

significant = ds.results_df[ds.results_df.padj < 0.05] significant.to_csv("significant_genes.csv")

Save DeseqDataSet for later use

with open("dds_result.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f)

Common Analysis Patterns

Two-Group Comparison

Standard case-control comparison:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()

results = ds.results_df significant = results[results.padj < 0.05]

Multiple Comparisons

Testing multiple treatment groups against control:

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2()

treatments = ["treatment_A", "treatment_B", "treatment_C"] all_results = {}

for treatment in treatments: ds = DeseqStats(dds, contrast=["condition", treatment, "control"]) ds.summary() all_results[treatment] = ds.results_df

sig_count = len(ds.results_df[ds.results_df.padj &#x3C; 0.05])
print(f"{treatment}: {sig_count} significant genes")

Accounting for Batch Effects

Control for technical variation:

Include batch in design

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition") dds.deseq2()

Test condition while controlling for batch

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()

Continuous Covariates

Include continuous variables like age or dosage:

Ensure continuous variable is numeric

metadata["age"] = pd.to_numeric(metadata["age"])

dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition") dds.deseq2()

ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary()

Using the Analysis Script

This skill includes a complete command-line script for standard analyses:

Basic usage

python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~condition"
--contrast condition treated control
--output results/

With additional options

python scripts/run_deseq2_analysis.py
--counts counts.csv
--metadata metadata.csv
--design "~batch + condition"
--contrast condition treated control
--output results/
--min-counts 10
--alpha 0.05
--n-cpus 4
--plots

Script features:

Automatic data loading and validation
Gene and sample filtering
Complete DESeq2 pipeline execution
Statistical testing with customizable parameters
Result export (CSV, pickle)
Optional visualization (volcano and MA plots)

Refer users to scripts/run_deseq2_analysis.py when they need a standalone analysis tool or want to batch process multiple datasets.

Result Interpretation

Identifying Significant Genes

Filter by adjusted p-value

significant = ds.results_df[ds.results_df.padj < 0.05]

Filter by both significance and effect size

sig_and_large = ds.results_df[ (ds.results_df.padj < 0.05) & (abs(ds.results_df.log2FoldChange) > 1) ]

Separate up- and down-regulated

upregulated = significant[significant.log2FoldChange > 0] downregulated = significant[significant.log2FoldChange < 0]

print(f"Upregulated: {len(upregulated)}") print(f"Downregulated: {len(downregulated)}")

Ranking and Sorting

Sort by adjusted p-value

top_by_padj = ds.results_df.sort_values("padj").head(20)

Sort by absolute fold change (use shrunk values)

ds.lfc_shrink() ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange) top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20)

Sort by a combined metric

ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange) top_combined = ds.results_df.sort_values("score", ascending=False).head(20)

Quality Metrics

Check normalization (size factors should be close to 1)

print("Size factors:", dds.obsm["size_factors"])

Examine dispersion estimates

import matplotlib.pyplot as plt plt.hist(dds.varm["dispersions"], bins=50) plt.xlabel("Dispersion") plt.ylabel("Frequency") plt.title("Dispersion Distribution") plt.show()

Check p-value distribution (should be mostly flat with peak near 0)

plt.hist(ds.results_df.pvalue.dropna(), bins=50) plt.xlabel("P-value") plt.ylabel("Frequency") plt.title("P-value Distribution") plt.show()

Visualization Guidelines

Volcano Plot

Visualize significance vs effect size:

import matplotlib.pyplot as plt import numpy as np

results = ds.results_df.copy() results["-log10(padj)"] = -np.log10(results.padj)

plt.figure(figsize=(10, 6)) significant = results.padj < 0.05

plt.scatter( results.loc[~significant, "log2FoldChange"], results.loc[~significant, "-log10(padj)"], alpha=0.3, s=10, c='gray', label='Not significant' ) plt.scatter( results.loc[significant, "log2FoldChange"], results.loc[significant, "-log10(padj)"], alpha=0.6, s=10, c='red', label='padj < 0.05' )

plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log2 Fold Change") plt.ylabel("-Log10(Adjusted P-value)") plt.title("Volcano Plot") plt.legend() plt.savefig("volcano_plot.png", dpi=300)

MA Plot

Show fold change vs mean expression:

plt.figure(figsize=(10, 6))

plt.scatter( np.log10(results.loc[~significant, "baseMean"] + 1), results.loc[~significant, "log2FoldChange"], alpha=0.3, s=10, c='gray' ) plt.scatter( np.log10(results.loc[significant, "baseMean"] + 1), results.loc[significant, "log2FoldChange"], alpha=0.6, s=10, c='red' )

plt.axhline(0, color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log10(Base Mean + 1)") plt.ylabel("Log2 Fold Change") plt.title("MA Plot") plt.savefig("ma_plot.png", dpi=300)

Troubleshooting Common Issues

Data Format Problems

Issue: "Index mismatch between counts and metadata"

Solution: Ensure sample names match exactly

print("Counts samples:", counts_df.index.tolist()) print("Metadata samples:", metadata.index.tolist())

Take intersection if needed

common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common]

Issue: "All genes have zero counts"

Solution: Check if data needs transposition

print(f"Counts shape: {counts_df.shape}")

If genes > samples, transpose is needed

if counts_df.shape[1] < counts_df.shape[0]: counts_df = counts_df.T

Design Matrix Issues

Issue: "Design matrix is not full rank"

Cause: Confounded variables (e.g., all treated samples in one batch)

Solution: Remove confounded variable or add interaction term

Check confounding

print(pd.crosstab(metadata.condition, metadata.batch))

Either simplify design or add interaction

design = "~condition" # Remove batch

OR

design = "~condition + batch + condition:batch" # Model interaction

No Significant Genes

Diagnostics:

Check dispersion distribution

plt.hist(dds.varm["dispersions"], bins=50) plt.show()

Check size factors

print(dds.obsm["size_factors"])

Look at top genes by raw p-value

print(ds.results_df.nsmallest(20, "pvalue"))

Possible causes:

Small effect sizes
High biological variability
Insufficient sample size
Technical issues (batch effects, outliers)

Reference Documentation

For comprehensive details beyond this workflow-oriented guide:

API Reference (references/api_reference.md ): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes.

Workflow Guide (references/workflow_guide.md ): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues.

Load these references into context when users need:

Detailed API documentation: Read references/api_reference.md
Comprehensive workflow examples: Read references/workflow_guide.md
Troubleshooting guidance: Read references/workflow_guide.md (see Troubleshooting section)

Key Reminders

Data orientation matters: Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with .T if needed.

Sample filtering: Remove samples with missing metadata before analysis to avoid errors.

Gene filtering: Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time.

Design formula order: Put adjustment variables before the variable of interest (e.g., "~batch + condition" not "~condition + batch" ).

LFC shrinkage timing: Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates.

Result interpretation: Use padj < 0.05 for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate.

Contrast specification: The format is [variable, test_level, reference_level] where test_level is compared against reference_level.

Save intermediate objects: Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step.

Installation and Requirements

uv pip install pydeseq2

System requirements:

Python 3.10-3.11
pandas 1.4.3+
numpy 1.23.0+
scipy 1.11.0+
scikit-learn 1.1.1+
anndata 0.8.0+

Optional for visualization:

matplotlib
seaborn

Additional Resources

Official Documentation: https://pydeseq2.readthedocs.io
GitHub Repository: https://github.com/owkin/PyDESeq2
Publication: Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547
Original DESeq2 (R): Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8

pydeseq2

Safety Notice

Copy this and send it to your AI assistant to learn

1. Load data

2. Filter low-count genes

3. Initialize and fit DESeq2

4. Perform statistical testing

5. Access results

From CSV (typical format: genes × samples, needs transpose)

From TSV

From AnnData

Remove low-count genes

Remove samples with missing metadata

Run the complete DESeq2 pipeline

Export results as CSV

Save significant genes only

Save DeseqDataSet for later use

Include batch in design

Test condition while controlling for batch

Ensure continuous variable is numeric

Basic usage

With additional options

Filter by adjusted p-value

Filter by both significance and effect size

Separate up- and down-regulated

Sort by adjusted p-value

Sort by absolute fold change (use shrunk values)

Sort by a combined metric

Check normalization (size factors should be close to 1)

Examine dispersion estimates

Check p-value distribution (should be mostly flat with peak near 0)

Take intersection if needed

If genes > samples, transpose is needed

Check confounding

Either simplify design or add interaction

OR

Check dispersion distribution

Check size factors

Look at top genes by raw p-value

Source Transparency

Related Skills

clinical-reports

clinical-decision-support

clinpgx-database

clinvar-database