cbioportal-database

cBioPortal for Cancer Genomics (https://www.cbioportal.org/) is an open-access resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. It hosts data from The Cancer Genome Atlas (TCGA), AACR Project GENIE, MSK-IMPACT, and hundreds of other cancer studies — covering mutations, copy number alterations (CNA), structural variants, mRNA/protein expression, methylation, and clinical data for thousands of cancer samples.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cbioportal-database" with this command: npx skills add k-dense-ai/claude-scientific-skills/k-dense-ai-claude-scientific-skills-cbioportal-database

cBioPortal Database

Overview

cBioPortal for Cancer Genomics (https://www.cbioportal.org/) is an open-access resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. It hosts data from The Cancer Genome Atlas (TCGA), AACR Project GENIE, MSK-IMPACT, and hundreds of other cancer studies — covering mutations, copy number alterations (CNA), structural variants, mRNA/protein expression, methylation, and clinical data for thousands of cancer samples.

Key resources:

When to Use This Skill

Use cBioPortal when:

  • Mutation landscape: What fraction of a cancer type has mutations in a specific gene?

  • Oncogene/TSG validation: Is a gene frequently mutated, amplified, or deleted in cancer?

  • Co-mutation patterns: Are mutations in gene A and gene B mutually exclusive or co-occurring?

  • Survival analysis: Do mutations in a gene associate with better or worse patient outcomes?

  • Alteration profiles: What types of alterations (missense, truncating, amplification, deletion) affect a gene?

  • Pan-cancer analysis: Compare alteration frequencies across cancer types

  • Clinical associations: Link genomic alterations to clinical variables (stage, grade, treatment response)

  • TCGA/GENIE exploration: Systematic access to TCGA and clinical sequencing datasets

Core Capabilities

  1. cBioPortal REST API

Base URL: https://www.cbioportal.org/api

The API is RESTful, returns JSON, and requires no API key for public data.

import requests

BASE_URL = "https://www.cbioportal.org/api" HEADERS = {"Accept": "application/json", "Content-Type": "application/json"}

def cbioportal_get(endpoint, params=None): url = f"{BASE_URL}/{endpoint}" response = requests.get(url, params=params, headers=HEADERS) response.raise_for_status() return response.json()

def cbioportal_post(endpoint, body): url = f"{BASE_URL}/{endpoint}" response = requests.post(url, json=body, headers=HEADERS) response.raise_for_status() return response.json()

  1. Browse Studies

def get_all_studies(): """List all available cancer studies.""" return cbioportal_get("studies", {"pageSize": 500})

Each study has:

studyId: unique identifier (e.g., "brca_tcga")

name: human-readable name

description: dataset description

cancerTypeId: cancer type abbreviation

referenceGenome: GRCh37 or GRCh38

pmid: associated publication

studies = get_all_studies() print(f"Total studies: {len(studies)}")

Common TCGA study IDs:

brca_tcga, luad_tcga, coadread_tcga, gbm_tcga, prad_tcga,

skcm_tcga, blca_tcga, hnsc_tcga, lihc_tcga, stad_tcga

Filter for TCGA studies

tcga_studies = [s for s in studies if "tcga" in s["studyId"]] print([s["studyId"] for s in tcga_studies[:10]])

  1. Molecular Profiles

Each study has multiple molecular profiles (mutation, CNA, expression, etc.):

def get_molecular_profiles(study_id): """Get all molecular profiles for a study.""" return cbioportal_get(f"studies/{study_id}/molecular-profiles")

profiles = get_molecular_profiles("brca_tcga") for p in profiles: print(f" {p['molecularProfileId']}: {p['name']} ({p['molecularAlterationType']})")

Alteration types:

MUTATION_EXTENDED — somatic mutations

COPY_NUMBER_ALTERATION — CNA (GISTIC)

MRNA_EXPRESSION — mRNA expression

PROTEIN_LEVEL — RPPA protein expression

STRUCTURAL_VARIANT — fusions/rearrangements

  1. Mutation Data

def get_mutations(molecular_profile_id, entrez_gene_ids, sample_list_id=None): """Get mutations for specified genes in a molecular profile.""" body = { "entrezGeneIds": entrez_gene_ids, "sampleListId": sample_list_id or molecular_profile_id.replace("_mutations", "_all") } return cbioportal_post( f"molecular-profiles/{molecular_profile_id}/mutations/fetch", body )

BRCA1 Entrez ID is 672, TP53 is 7157, PTEN is 5728

mutations = get_mutations("brca_tcga_mutations", entrez_gene_ids=[7157]) # TP53

Each mutation record contains:

patientId, sampleId, entrezGeneId, gene.hugoGeneSymbol

mutationType (Missense_Mutation, Nonsense_Mutation, Frame_Shift_Del, etc.)

proteinChange (e.g., "R175H")

variantClassification, variantType

ncbiBuild, chr, startPosition, endPosition, referenceAllele, variantAllele

mutationStatus (Somatic/Germline)

alleleFreqT (tumor VAF)

import pandas as pd df = pd.DataFrame(mutations) print(df[["patientId", "mutationType", "proteinChange", "alleleFreqT"]].head()) print(f"\nMutation types:\n{df['mutationType'].value_counts()}")

  1. Copy Number Alteration Data

def get_cna(molecular_profile_id, entrez_gene_ids): """Get discrete CNA data (GISTIC: -2, -1, 0, 1, 2).""" body = { "entrezGeneIds": entrez_gene_ids, "sampleListId": molecular_profile_id.replace("_gistic", "_all").replace("_cna", "_all") } return cbioportal_post( f"molecular-profiles/{molecular_profile_id}/discrete-copy-number/fetch", body )

GISTIC values:

-2 = Deep deletion (homozygous loss)

-1 = Shallow deletion (heterozygous loss)

0 = Diploid (neutral)

1 = Low-level gain

2 = High-level amplification

cna_data = get_cna("brca_tcga_gistic", entrez_gene_ids=[1956]) # EGFR df_cna = pd.DataFrame(cna_data) print(df_cna["value"].value_counts())

  1. Alteration Frequency (OncoPrint-style)

def get_alteration_frequency(study_id, gene_symbols, alteration_types=None): """Compute alteration frequencies for genes across a cancer study.""" import requests, pandas as pd

# Get sample list
samples = requests.get(
    f"{BASE_URL}/studies/{study_id}/sample-lists",
    headers=HEADERS
).json()
all_samples_id = next(
    (s["sampleListId"] for s in samples if s["category"] == "all_cases_in_study"), None
)
total_samples = len(requests.get(
    f"{BASE_URL}/sample-lists/{all_samples_id}/sample-ids",
    headers=HEADERS
).json())

# Get gene Entrez IDs
gene_data = requests.post(
    f"{BASE_URL}/genes/fetch",
    json=[{"hugoGeneSymbol": g} for g in gene_symbols],
    headers=HEADERS
).json()
entrez_ids = [g["entrezGeneId"] for g in gene_data]

# Get mutations
mutation_profile = f"{study_id}_mutations"
mutations = get_mutations(mutation_profile, entrez_ids, all_samples_id)

freq = {}
for g_symbol, e_id in zip(gene_symbols, entrez_ids):
    mutated = len(set(m["patientId"] for m in mutations if m["entrezGeneId"] == e_id))
    freq[g_symbol] = mutated / total_samples * 100

return freq

Example

freq = get_alteration_frequency("brca_tcga", ["TP53", "PIK3CA", "BRCA1", "BRCA2"]) for gene, pct in sorted(freq.items(), key=lambda x: -x[1]): print(f" {gene}: {pct:.1f}%")

  1. Clinical Data

def get_clinical_data(study_id, attribute_ids=None): """Get patient-level clinical data.""" params = {"studyId": study_id} all_clinical = cbioportal_get( "clinical-data/fetch", params ) # Returns list of {patientId, studyId, clinicalAttributeId, value}

Clinical attributes include:

OS_STATUS, OS_MONTHS, DFS_STATUS, DFS_MONTHS (survival)

TUMOR_STAGE, GRADE, AGE, SEX, RACE

Study-specific attributes vary

def get_clinical_attributes(study_id): """List all available clinical attributes for a study.""" return cbioportal_get(f"studies/{study_id}/clinical-attributes")

Query Workflows

Workflow 1: Gene Alteration Profile in a Cancer Type

import requests, pandas as pd

def alteration_profile(study_id, gene_symbol): """Full alteration profile for a gene in a cancer study."""

# 1. Get gene Entrez ID
gene_info = requests.post(
    f"{BASE_URL}/genes/fetch",
    json=[{"hugoGeneSymbol": gene_symbol}],
    headers=HEADERS
).json()[0]
entrez_id = gene_info["entrezGeneId"]

# 2. Get mutations
mutations = get_mutations(f"{study_id}_mutations", [entrez_id])
mut_df = pd.DataFrame(mutations) if mutations else pd.DataFrame()

# 3. Get CNAs
cna = get_cna(f"{study_id}_gistic", [entrez_id])
cna_df = pd.DataFrame(cna) if cna else pd.DataFrame()

# 4. Summary
n_mut = len(set(mut_df["patientId"])) if not mut_df.empty else 0
n_amp = len(cna_df[cna_df["value"] == 2]) if not cna_df.empty else 0
n_del = len(cna_df[cna_df["value"] == -2]) if not cna_df.empty else 0

return {"mutations": n_mut, "amplifications": n_amp, "deep_deletions": n_del}

result = alteration_profile("brca_tcga", "PIK3CA") print(result)

Workflow 2: Pan-Cancer Gene Mutation Frequency

import requests, pandas as pd

def pan_cancer_mutation_freq(gene_symbol, cancer_study_ids=None): """Mutation frequency of a gene across multiple cancer types.""" studies = get_all_studies() if cancer_study_ids: studies = [s for s in studies if s["studyId"] in cancer_study_ids]

results = []
for study in studies[:20]:  # Limit for demo
    try:
        freq = get_alteration_frequency(study["studyId"], [gene_symbol])
        results.append({
            "study": study["studyId"],
            "cancer": study.get("cancerTypeId", ""),
            "mutation_pct": freq.get(gene_symbol, 0)
        })
    except Exception:
        pass

df = pd.DataFrame(results).sort_values("mutation_pct", ascending=False)
return df

Workflow 3: Survival Analysis by Mutation Status

import requests, pandas as pd

def survival_by_mutation(study_id, gene_symbol): """Get survival data split by mutation status.""" # This workflow fetches clinical and mutation data for downstream analysis

gene_info = requests.post(
    f"{BASE_URL}/genes/fetch",
    json=[{"hugoGeneSymbol": gene_symbol}],
    headers=HEADERS
).json()[0]
entrez_id = gene_info["entrezGeneId"]

mutations = get_mutations(f"{study_id}_mutations", [entrez_id])
mutated_patients = set(m["patientId"] for m in mutations)

clinical = cbioportal_get("clinical-data/fetch", {"studyId": study_id})
clinical_df = pd.DataFrame(clinical)

os_data = clinical_df[clinical_df["clinicalAttributeId"].isin(["OS_MONTHS", "OS_STATUS"])]
os_wide = os_data.pivot(index="patientId", columns="clinicalAttributeId", values="value")
os_wide["mutated"] = os_wide.index.isin(mutated_patients)

return os_wide

Key API Endpoints Summary

Endpoint Description

GET /studies

List all studies

GET /studies/{studyId}/molecular-profiles

Molecular profiles for a study

POST /molecular-profiles/{profileId}/mutations/fetch

Get mutation data

POST /molecular-profiles/{profileId}/discrete-copy-number/fetch

Get CNA data

POST /molecular-profiles/{profileId}/molecular-data/fetch

Get expression data

GET /studies/{studyId}/clinical-attributes

Available clinical variables

GET /clinical-data/fetch

Clinical data

POST /genes/fetch

Gene metadata by symbol or Entrez ID

GET /studies/{studyId}/sample-lists

Sample lists

Best Practices

  • Know your study IDs: Use the Swagger UI or GET /studies to find the correct study ID

  • Use sample lists: Each study has an all sample list and subsets; always specify the appropriate one

  • TCGA vs. GENIE: TCGA data is comprehensive but older; GENIE has more recent clinical sequencing data

  • Entrez gene IDs: The API uses Entrez IDs — use /genes/fetch to convert from symbols

  • Handle 404s: Some molecular profiles may not exist for all studies

  • Rate limiting: Add delays for bulk queries; consider downloading data files for large-scale analyses

Data Downloads

For large-scale analyses, download study data directly:

Download TCGA BRCA data

wget https://cbioportal-datahub.s3.amazonaws.com/brca_tcga.tar.gz

Additional Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

clinvar-database

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

biopython

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

clinpgx-database

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

clinical-decision-support

No summary provided by upstream source.

Repository SourceNeeds Review