CZ CELLxGENE Census
Overview
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
-
61+ million cells from human and mouse
-
Standardized metadata (cell types, tissues, diseases, donors)
-
Raw gene expression matrices
-
Pre-calculated embeddings and statistics
-
Integration with PyTorch, scanpy, and other analysis tools
When to Use This Skill
This skill should be used when:
-
Querying single-cell expression data by cell type, tissue, or disease
-
Exploring available single-cell datasets and metadata
-
Training machine learning models on single-cell data
-
Performing large-scale cross-dataset analyses
-
Integrating Census data with scanpy or other analysis frameworks
-
Computing statistics across millions of cells
-
Accessing pre-calculated embeddings or model predictions
Installation and Setup
Install the Census API:
uv pip install cellxgene-census
For machine learning workflows, install additional dependencies:
uv pip install cellxgene-census[experimental]
Core Workflow Patterns
- Opening the Census
Always use the context manager to ensure proper resource cleanup:
import cellxgene_census
Open latest stable version
with cellxgene_census.open_soma() as census: # Work with census data
Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census: # Work with census data
Key points:
-
Use context manager (with statement) for automatic cleanup
-
Specify census_version for reproducible analyses
-
Default opens latest "stable" release
- Exploring Census Information
Before querying expression data, explore available datasets and metadata.
Access summary information:
Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"Total cells: {summary['total_cell_count'][0]}")
Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
Query cell metadata to understand available data:
Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type"] ) unique_cell_types = cell_metadata["cell_type"].unique() print(f"Found {len(unique_cell_types)} cell types in brain")
Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()
Important: Always filter for is_primary_data == True to avoid counting duplicate cells unless specifically analyzing duplicates.
- Querying Expression Data (Small to Medium Scale)
For queries returning < 100k cells that fit in memory, use get_anndata() :
Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", # or "Mus musculus" obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True", obs_column_names=["assay", "disease", "sex", "donor_id"], )
Query specific genes with multiple filters
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']", obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True", obs_column_names=["cell_type", "tissue_general", "donor_id"], )
Filter syntax:
-
Use obs_value_filter for cell filtering
-
Use var_value_filter for gene filtering
-
Combine conditions with and , or
-
Use in for multiple values: tissue in ['lung', 'liver']
-
Select only needed columns with obs_column_names
Getting metadata separately:
Query cell metadata
cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general", "donor_id"] )
Query gene metadata
gene_metadata = cellxgene_census.get_var( census, "homo_sapiens", value_filter="feature_name in ['CD4', 'CD8A']", column_names=["feature_id", "feature_name", "feature_length"] )
- Large-Scale Queries (Out-of-Core Processing)
For queries exceeding available RAM, use axis_query() with iterative processing:
import tiledbsoma as soma
Create axis query
query = census["census_data"]["homo_sapiens"].axis_query( measurement_name="RNA", obs_query=soma.AxisQuery( value_filter="tissue_general == 'brain' and is_primary_data == True" ), var_query=soma.AxisQuery( value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']" ) )
Iterate through expression matrix in chunks
iterator = query.X("raw").tables() for batch in iterator: # batch is a pyarrow.Table with columns: # - soma_data: expression value # - soma_dim_0: cell (obs) coordinate # - soma_dim_1: gene (var) coordinate process_batch(batch)
Computing incremental statistics:
Example: Calculate mean expression
n_observations = 0 sum_values = 0.0
iterator = query.X("raw").tables() for batch in iterator: values = batch["soma_data"].to_numpy() n_observations += len(values) sum_values += values.sum()
mean_expression = sum_values / n_observations
- Machine Learning with PyTorch
For training models, use the experimental PyTorch integration:
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census: # Create dataloader dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="tissue_general == 'liver' and is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, )
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"]["cell_type"] # Cell type labels
# Forward pass
outputs = model(X)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
Train/test splitting:
from cellxgene_census.experimental.ml import ExperimentDataset
Create dataset from experiment
dataset = ExperimentDataset( experiment_axis_query, layer_name="raw", obs_column_names=["cell_type"], batch_size=128, )
Split into train and test
train_dataset, test_dataset = dataset.random_split( split=[0.8, 0.2], seed=42 )
- Integration with Scanpy
Seamlessly integrate Census data with scanpy workflows:
import scanpy as sc
Load data from Census
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True", )
Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000)
Dimensionality reduction
sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata)
Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
- Multi-Dataset Integration
Query and integrate multiple datasets:
Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"] adatas = []
for tissue in tissues: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True", ) adata.obs["tissue"] = tissue adatas.append(adata)
Concatenate
combined = adatas[0].concatenate(adatas[1:])
Strategy 2: Query multiple datasets directly
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True", )
Key Concepts and Best Practices
Always Filter for Primary Data
Unless analyzing duplicates, always include is_primary_data == True in queries to avoid counting cells multiple times:
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
Specify Census Version for Reproducibility
Always specify the Census version in production analyses:
census = cellxgene_census.open_soma(census_version="2023-07-25")
Estimate Query Size Before Loading
For large queries, first check the number of cells to avoid memory issues:
Get cell count
metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["soma_joinid"] ) n_cells = len(metadata) print(f"Query will return {n_cells:,} cells")
If too large (>100k), use out-of-core processing
Use tissue_general for Broader Groupings
The tissue_general field provides coarser categories than tissue , useful for cross-tissue analyses:
Broader grouping
obs_value_filter="tissue_general == 'immune system'"
Specific tissue
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
Select Only Needed Columns
Minimize data transfer by specifying only required metadata columns:
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
Check Dataset Presence for Gene-Specific Queries
When analyzing specific genes, verify which datasets measured them:
presence = cellxgene_census.get_presence_matrix( census, "homo_sapiens", var_value_filter="feature_name in ['CD4', 'CD8A']" )
Two-Step Workflow: Explore Then Query
First explore metadata to understand available data, then query expression:
Step 1: Explore what's available
metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="disease == 'COVID-19' and is_primary_data == True", column_names=["cell_type", "tissue_general"] ) print(metadata.value_counts())
Step 2: Query based on findings
adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True", )
Available Metadata Fields
Cell Metadata (obs)
Key fields for filtering:
-
cell_type , cell_type_ontology_term_id
-
tissue , tissue_general , tissue_ontology_term_id
-
disease , disease_ontology_term_id
-
assay , assay_ontology_term_id
-
donor_id , sex , self_reported_ethnicity
-
development_stage , development_stage_ontology_term_id
-
dataset_id
-
is_primary_data (Boolean: True = unique cell)
Gene Metadata (var)
-
feature_id (Ensembl gene ID, e.g., "ENSG00000161798")
-
feature_name (Gene symbol, e.g., "FOXP2")
-
feature_length (Gene length in base pairs)
Reference Documentation
This skill includes detailed reference documentation:
references/census_schema.md
Comprehensive documentation of:
-
Census data structure and organization
-
All available metadata fields
-
Value filter syntax and operators
-
SOMA object types
-
Data inclusion criteria
When to read: When you need detailed schema information, full list of metadata fields, or complex filter syntax.
references/common_patterns.md
Examples and patterns for:
-
Exploratory queries (metadata only)
-
Small-to-medium queries (AnnData)
-
Large queries (out-of-core processing)
-
PyTorch integration
-
Scanpy integration workflows
-
Multi-dataset integration
-
Best practices and common pitfalls
When to read: When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
Common Use Cases
Use Case 1: Explore Cell Types in a Tissue
with cellxgene_census.open_soma() as census: cells = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'lung' and is_primary_data == True", column_names=["cell_type"] ) print(cells["cell_type"].value_counts())
Use Case 2: Query Marker Gene Expression
with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']", obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True", )
Use Case 3: Train Cell Type Classifier
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census: dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, )
# Train model
for epoch in range(epochs):
for batch in dataloader:
# Training logic
pass
Use Case 4: Cross-Tissue Analysis
with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True", )
# Analyze macrophage differences across tissues
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
Troubleshooting
Query Returns Too Many Cells
-
Add more specific filters to reduce scope
-
Use tissue instead of tissue_general for finer granularity
-
Filter by specific dataset_id if known
-
Switch to out-of-core processing for large queries
Memory Errors
-
Reduce query scope with more restrictive filters
-
Select fewer genes with var_value_filter
-
Use out-of-core processing with axis_query()
-
Process data in batches
Duplicate Cells in Results
-
Always include is_primary_data == True in filters
-
Check if intentionally querying across multiple datasets
Gene Not Found
-
Verify gene name spelling (case-sensitive)
-
Try Ensembl ID with feature_id instead of feature_name
-
Check dataset presence matrix to see if gene was measured
-
Some genes may have been filtered during Census construction
Version Inconsistencies
-
Always specify census_version explicitly
-
Use same version across all analyses
-
Check release notes for version-specific changes