bio-pathway-gsea

Gene Set Enrichment Analysis (GSEA)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-pathway-gsea" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-pathway-gsea

Gene Set Enrichment Analysis (GSEA)

Core Concept

GSEA uses all genes ranked by a statistic (log2FC, signed p-value) rather than a subset of significant genes. It finds gene sets where members are enriched at the top or bottom of the ranked list.

Prepare Ranked Gene List

library(clusterProfiler) library(org.Hs.eg.db)

de_results <- read.csv('de_results.csv')

Create named vector: values = statistic, names = gene IDs

gene_list <- de_results$log2FoldChange names(gene_list) <- de_results$gene_id

Sort in decreasing order (REQUIRED)

gene_list <- sort(gene_list, decreasing = TRUE)

Convert Gene IDs for GSEA

Convert symbols to Entrez IDs

gene_ids <- bitr(names(gene_list), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

Create ranked list with Entrez IDs

gene_list_entrez <- gene_list[names(gene_list) %in% gene_ids$SYMBOL] names(gene_list_entrez) <- gene_ids$ENTREZID[match(names(gene_list_entrez), gene_ids$SYMBOL)] gene_list_entrez <- sort(gene_list_entrez, decreasing = TRUE)

Alternative Ranking Statistics

Signed p-value (recommended for detecting both up and down)

gene_list <- -log10(de_results$pvalue) * sign(de_results$log2FoldChange) names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)

Wald statistic (from DESeq2)

gene_list <- de_results$stat names(gene_list) <- de_results$gene_id gene_list <- sort(gene_list, decreasing = TRUE)

GSEA with GO

gse_go <- gseGO( geneList = gene_list_entrez, OrgDb = org.Hs.eg.db, ont = 'BP', # BP, MF, CC, or ALL minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE, pAdjustMethod = 'BH' )

Make readable

gse_go <- setReadable(gse_go, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')

GSEA with KEGG

gse_kegg <- gseKEGG( geneList = gene_list_entrez, organism = 'hsa', minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE )

Make readable

gse_kegg <- setReadable(gse_kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')

GSEA with Custom Gene Sets

Read GMT file (Gene Matrix Transposed)

gene_sets <- read.gmt('msigdb_hallmarks.gmt')

gse_custom <- GSEA( geneList = gene_list_entrez, TERM2GENE = gene_sets, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05 )

MSigDB Gene Sets

Use msigdbr package for MSigDB gene sets

library(msigdbr)

Hallmark gene sets

hallmarks <- msigdbr(species = 'Homo sapiens', category = 'H') hallmarks_t2g <- hallmarks[, c('gs_name', 'entrez_gene')]

gse_hallmark <- GSEA( geneList = gene_list_entrez, TERM2GENE = hallmarks_t2g, pvalueCutoff = 0.05 )

Other categories: C1 (positional), C2 (curated), C3 (motif), C5 (GO), C6 (oncogenic), C7 (immunologic)

Understanding Results

View results

head(gse_go) results <- as.data.frame(gse_go)

Key columns:

- NES: Normalized Enrichment Score (positive = upregulated, negative = downregulated)

- pvalue: Nominal p-value

- p.adjust: FDR-adjusted p-value

- core_enrichment: Leading edge genes

Interpreting NES (Normalized Enrichment Score)

NES Interpretation

Positive (> 0) Gene set enriched in upregulated genes

Negative (< 0) Gene set enriched in downregulated genes

NES

Key Parameters

Parameter Default Description

geneList required Named, sorted numeric vector

OrgDb required Organism database (for gseGO)

organism hsa KEGG organism code (for gseKEGG)

ont BP Ontology: BP, MF, CC, ALL

minGSSize 10 Min genes in gene set

maxGSSize 500 Max genes in gene set

pvalueCutoff 0.05 P-value threshold

pAdjustMethod BH Adjustment method

nPerm 10000 Permutations (if permutation test used)

eps 1e-10 Boundary for p-value calculation

Export Results

results_df <- as.data.frame(gse_go) write.csv(results_df, 'gsea_go_results.csv', row.names = FALSE)

Get leading edge genes for a term

leading_edge <- strsplit(results_df$core_enrichment[1], '/')[[1]]

Notes

  • Must be sorted - gene list must be sorted in decreasing order

  • Named vector - names are gene IDs, values are statistics

  • No arbitrary cutoffs - uses all genes, not just significant ones

  • NES sign matters - positive = upregulated enrichment

  • Leading edge - core_enrichment contains driving genes

Related Skills

  • go-enrichment - Over-representation analysis for GO

  • kegg-pathways - Over-representation analysis for KEGG

  • enrichment-visualization - GSEA plots, ridge plots

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

bio-microbiome-diversity-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-restriction-fragment-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-spatial-transcriptomics-image-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-proteomics-dia-analysis

No summary provided by upstream source.

Repository SourceNeeds Review