Gene Set Enrichment Analysis (GSEA)

Core Concept

GSEA uses all genes ranked by a statistic (log2FC, signed p-value) rather than a subset of significant genes. It finds gene sets where members are enriched at the top or bottom of the ranked list.

Prepare Ranked Gene List

library(clusterProfiler) library(org.Hs.eg.db)

de_results <- read.csv('de_results.csv')

Create named vector: values = statistic, names = gene IDs

gene_list <- de_results$log2FoldChange names(gene_list) <- de_results$gene_id

Sort in decreasing order (REQUIRED)

gene_list <- sort(gene_list, decreasing = TRUE)

Convert Gene IDs for GSEA

Convert symbols to Entrez IDs

gene_ids <- bitr(names(gene_list), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

Create ranked list with Entrez IDs

gene_list_entrez <- gene_list[names(gene_list) %in% gene_ids$SYMBOL] names(gene_list_entrez) <- gene_ids$ENTREZID[match(names(gene_list_entrez), gene_ids$SYMBOL)] gene_list_entrez <- sort(gene_list_entrez, decreasing = TRUE)

Alternative Ranking Statistics

Signed p-value (recommended for detecting both up and down)

gene_list <- -log10(de_results$pvalue) * sign(de_results$log2FoldChange) names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)

Wald statistic (from DESeq2)

gene_list <- de_results$stat names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)

GSEA with GO

gse_go <- gseGO( geneList = gene_list_entrez, OrgDb = org.Hs.eg.db, ont = 'BP', # BP, MF, CC, or ALL minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE, pAdjustMethod = 'BH' )

Make readable

gse_go <- setReadable(gse_go, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')

GSEA with KEGG

gse_kegg <- gseKEGG( geneList = gene_list_entrez, organism = 'hsa', minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE )

Make readable

gse_kegg <- setReadable(gse_kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')

GSEA with Custom Gene Sets

Read GMT file (Gene Matrix Transposed)

gene_sets <- read.gmt('msigdb_hallmarks.gmt')

gse_custom <- GSEA( geneList = gene_list_entrez, TERM2GENE = gene_sets, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05 )

MSigDB Gene Sets

Use msigdbr package for MSigDB gene sets

library(msigdbr)

Hallmark gene sets

hallmarks <- msigdbr(species = 'Homo sapiens', category = 'H') hallmarks_t2g <- hallmarks[, c('gs_name', 'entrez_gene')]

gse_hallmark <- GSEA( geneList = gene_list_entrez, TERM2GENE = hallmarks_t2g, pvalueCutoff = 0.05 )

Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)

Understanding Results

View results

head(gse_go) results <- as.data.frame(gse_go)

Key columns:

- NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated)

- pvalue: Nominal p-value

- p.adjust: FDR-adjusted p-value

- core_enrichment: Leading edge genes

Interpreting NES (Normalized Enrichment Score)

NES Interpretation

Positive (> 0) Gene set enriched in upregulated genes

Negative (< 0) Gene set enriched in downregulated genes

NES

Key Parameters

Parameter Default Description

geneList required Named, sorted numeric vector

OrgDb required Organism database (for gseGO)

organism hsa KEGG organism code (for gseKEGG)

ont BP Ontology: BP, MF, CC, ALL

minGSSize 10 Min genes in gene set

maxGSSize 500 Max genes in gene set

pvalueCutoff 0.05 P-value threshold

pAdjustMethod BH Adjustment method

nPerm 10000 Permutations (if permutation test used)

eps 1e-10 Boundary for p-value calculation

Export Results

results_df <- as.data.frame(gse_go) write.csv(results_df, 'gsea_go_results.csv', row.names = FALSE)

Get leading edge genes for a term

leading_edge <- strsplit(results_df$core_enrichment[1], '/')[[1]]

Notes

Must be sorted - gene list must be sorted in decreasing order
Named vector - names are gene IDs, values are statistics
No arbitrary cutoffs - uses all genes, not just significant ones
NES sign matters - positive = upregulated enrichment
Leading edge - core_enrichment contains driving genes

Related Skills

go-enrichment - Over-representation analysis for GO
kegg-pathways - Over-representation analysis for KEGG
enrichment-visualization - GSEA plots, ridge plots

bio-pathway-gsea

Safety Notice

Copy this and send it to your AI assistant to learn

Create named vector: values = statistic, names = gene IDs

Sort in decreasing order (REQUIRED)

Convert symbols to Entrez IDs

Create ranked list with Entrez IDs

Signed p-value (recommended for detecting both up and down)

Wald statistic (from DESeq2)

Make readable

Make readable

Read GMT file (Gene Matrix Transposed)

Use msigdbr package for MSigDB gene sets

Hallmark gene sets

Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)

View results

Key columns:

- NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated)

- pvalue: Nominal p-value

- p.adjust: FDR-adjusted p-value

- core_enrichment: Leading edge genes

Get leading edge genes for a term

Source Transparency

Related Skills

bio-microbiome-diversity-analysis

bio-restriction-fragment-analysis

bio-spatial-transcriptomics-image-analysis

bio-proteomics-dia-analysis