TCGA Bulk Data Preprocessing with OmicVerse

Overview

Use this skill for loading TCGA data from GDC downloads, building normalised expression matrices, attaching clinical metadata, and running survival analyses through ov.bulk.pyTCGA .

Instructions

Gather required downloads

Confirm the user has three items from the GDC Data Portal:

gdc_sample_sheet.<date>.tsv — the sample sheet export
Decompressed gdc_download_xxxxx/ directory with expression archives
clinical.cart.<date>/ directory with clinical XML/JSON files

Initialise the TCGA helper

import omicverse as ov import scanpy as sc ov.plot_set()

aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) aml_tcga.adata_init() # Builds AnnData with raw counts, FPKM, and TPM layers

Persist and reload

aml_tcga.adata.write_h5ad('data/ov_tcga_raw.h5ad', compression='gzip')

To reload later:

new_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir) new_tcga.adata_read('data/ov_tcga_raw.h5ad')

Initialise metadata and survival

aml_tcga.adata_meta_init() # Gene ID → symbol mapping, patient info aml_tcga.survial_init() # NOTE: "survial" spelling — see Critical API Reference below

Run survival analysis

Single gene

aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True)

All genes (can take minutes for large gene sets)

aml_tcga.survial_analysis_all() # NOTE: "survial" spelling

Export results

aml_tcga.adata.write_h5ad('data/ov_tcga_survival.h5ad', compression='gzip')

Critical API Reference

IMPORTANT: Method Name Spelling Inconsistency

The pyTCGA API has an intentional spelling inconsistency. Two methods use "survial" (missing the 'v') while one uses the correct "survival":

Method Spelling Purpose

survial_init()

survial (no 'v') Initialize survival metadata columns

survival_analysis(gene, layer, plot)

survival (correct) Single-gene Kaplan-Meier curve

survial_analysis_all()

survial (no 'v') Sweep all genes for survival significance

CORRECT — use the exact method names as documented

aml_tcga.survial_init() # "survial" — no 'v' aml_tcga.survival_analysis('MYC', layer='deseq_normalize', plot=True) # "survival" — correct aml_tcga.survial_analysis_all() # "survial" — no 'v'

WRONG — these will raise AttributeError

aml_tcga.survival_init() # AttributeError! Use survial_init()

aml_tcga.survival_analysis_all() # AttributeError! Use survial_analysis_all()

Survival Analysis Methodology

survival_analysis() performs Kaplan-Meier analysis:

Splits patients into high/low expression groups using the median as cutoff
Computes a log-rank test p-value to assess significance
If plot=True , renders survival curves with confidence intervals

Layer selection matters: Use layer='deseq_normalize' (recommended) because DESeq2 normalization accounts for library size and composition bias, making expression comparable across samples. Alternative: layer='tpm' for TPM-normalized values.

Defensive Validation Patterns

import os

Before pyTCGA init: verify all paths exist

for name, path in [('sample_sheet', sample_sheet_path), ('downloads', download_dir), ('clinical', clinical_dir)]: if not os.path.exists(path): raise FileNotFoundError(f"TCGA {name} path not found: {path}")

After adata_init(): verify expected layers were created

expected_layers = ['counts', 'fpkm', 'tpm'] for layer in expected_layers: if layer not in aml_tcga.adata.layers: print(f"WARNING: Missing layer '{layer}' — check if TCGA archives are fully extracted")

Before survival analysis: verify metadata is initialized

if 'survial_init' not in dir(aml_tcga) or aml_tcga.adata.obs.shape[1] < 5: print("WARNING: Run adata_meta_init() and survial_init() before survival analysis")

Troubleshooting

AttributeError: 'pyTCGA' object has no attribute 'survival_init' : Use the misspelled name survial_init() (missing 'v'). Same for survial_analysis_all() . See Critical API Reference above.
KeyError during adata_meta_init() : Gene IDs in the expression matrix don't match expected format. TCGA uses ENSG IDs; the method maps them to symbols internally. Ensure archives are from the same GDC download.
Empty survival plot or NaN p-values: Clinical XML files are missing date fields (days_to_death, days_to_last_follow_up). Check that the clinical.cart.* directory contains complete XML files, not just metadata JSONs.
survial_analysis_all() runs very slowly: This tests every gene individually. For a genome with ~20,000 genes, expect 5-15 minutes. Consider filtering to genes of interest first.
Sample sheet column mismatch: Verify the TSV uses tab separators and the header row matches GDC's expected format. Re-download from GDC if column names differ.
Missing deseq_normalize layer: This layer is created during adata_meta_init() . If absent, re-run the metadata initialization step.

Examples

"Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
"Reload a saved AnnData file, attach survival annotations, and export the updated .h5ad ."
"Run survival analysis for all genes and store the enriched dataset."

References

Tutorial notebook: t_tcga.ipynb
Quick copy/paste commands: reference.md

tcga-bulk-data-preprocessing-with-omicverse

Safety Notice

Copy this and send it to your AI assistant to learn

To reload later:

Single gene

All genes (can take minutes for large gene sets)

CORRECT — use the exact method names as documented

WRONG — these will raise AttributeError

aml_tcga.survival_init() # AttributeError! Use survial_init()

aml_tcga.survival_analysis_all() # AttributeError! Use survial_analysis_all()

Before pyTCGA init: verify all paths exist

After adata_init(): verify expected layers were created

Before survival analysis: verify metadata is initialized

Source Transparency

Related Skills

data-viz-plots

data-export-pdf

data-export-excel