AnnData
Overview
AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
When to Use This Skill
Use this skill when:
-
Creating, reading, or writing AnnData objects
-
Working with h5ad, zarr, or other genomics data formats
-
Performing single-cell RNA-seq analysis
-
Managing large datasets with sparse matrices or backed mode
-
Concatenating multiple datasets or experimental batches
-
Subsetting, filtering, or transforming annotated data
-
Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
Installation
uv pip install anndata
With optional dependencies
uv pip install anndata[dev,test,doc]
Quick Start
Creating an AnnData object
import anndata as ad import numpy as np import pandas as pd
Minimal creation
X = np.random.rand(100, 2000) # 100 cells × 2000 genes adata = ad.AnnData(X)
With metadata
obs = pd.DataFrame({ 'cell_type': ['T cell', 'B cell'] * 50, 'sample': ['A', 'B'] * 50 }, index=[f'cell_{i}' for i in range(100)])
var = pd.DataFrame({ 'gene_name': [f'Gene_{i}' for i in range(2000)] }, index=[f'ENSG{i:05d}' for i in range(2000)])
adata = ad.AnnData(X=X, obs=obs, var=var)
Reading data
Read h5ad file
adata = ad.read_h5ad('data.h5ad')
Read with backed mode (for large files)
adata = ad.read_h5ad('large_data.h5ad', backed='r')
Read other formats
adata = ad.read_csv('data.csv') adata = ad.read_loom('data.loom') adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
Writing data
Write h5ad file
adata.write_h5ad('output.h5ad')
Write with compression
adata.write_h5ad('output.h5ad', compression='gzip')
Write other formats
adata.write_zarr('output.zarr') adata.write_csvs('output_dir/')
Basic operations
Subset by conditions
t_cells = adata[adata.obs['cell_type'] == 'T cell']
Subset by indices
subset = adata[0:50, 0:100]
Add metadata
adata.obs['quality_score'] = np.random.rand(adata.n_obs) adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
Access dimensions
print(f"{adata.n_obs} observations × {adata.n_vars} variables")
Core Capabilities
- Data Structure
Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
See: references/data_structure.md for comprehensive information on:
-
Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
-
Creating AnnData objects from various sources
-
Accessing and manipulating data components
-
Memory-efficient practices
- Input/Output Operations
Read and write data in various formats with support for compression, backed mode, and cloud storage.
See: references/io_operations.md for details on:
-
Native formats (h5ad, zarr)
-
Alternative formats (CSV, MTX, Loom, 10X, Excel)
-
Backed mode for large datasets
-
Remote data access
-
Format conversion
-
Performance optimization
Common commands:
Read/write h5ad
adata = ad.read_h5ad('data.h5ad', backed='r') adata.write_h5ad('output.h5ad', compression='gzip')
Read 10X data
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
Read MTX format
adata = ad.read_mtx('matrix.mtx').T
- Concatenation
Combine multiple AnnData objects along observations or variables with flexible join strategies.
See: references/concatenation.md for comprehensive coverage of:
-
Basic concatenation (axis=0 for observations, axis=1 for variables)
-
Join types (inner, outer)
-
Merge strategies (same, unique, first, only)
-
Tracking data sources with labels
-
Lazy concatenation (AnnCollection)
-
On-disk concatenation for large datasets
Common commands:
Concatenate observations (combine samples)
adata = ad.concat( [adata1, adata2, adata3], axis=0, join='inner', label='batch', keys=['batch1', 'batch2', 'batch3'] )
Concatenate variables (combine modalities)
adata = ad.concat([adata_rna, adata_protein], axis=1)
Lazy concatenation
from anndata.experimental import AnnCollection collection = AnnCollection( ['data1.h5ad', 'data2.h5ad'], join_obs='outer', label='dataset' )
- Data Manipulation
Transform, subset, filter, and reorganize data efficiently.
See: references/manipulation.md for detailed guidance on:
-
Subsetting (by indices, names, boolean masks, metadata conditions)
-
Transposition
-
Copying (full copies vs views)
-
Renaming (observations, variables, categories)
-
Type conversions (strings to categoricals, sparse/dense)
-
Adding/removing data components
-
Reordering
-
Quality control filtering
Common commands:
Subset by metadata
filtered = adata[adata.obs['quality_score'] > 0.8] hv_genes = adata[:, adata.var['highly_variable']]
Transpose
adata_T = adata.T
Copy vs view
view = adata[0:100, :] # View (lightweight reference) copy = adata[0:100, :].copy() # Independent copy
Convert strings to categoricals
adata.strings_to_categoricals()
- Best Practices
Follow recommended patterns for memory efficiency, performance, and reproducibility.
See: references/best_practices.md for guidelines on:
-
Memory management (sparse matrices, categoricals, backed mode)
-
Views vs copies
-
Data storage optimization
-
Performance optimization
-
Working with raw data
-
Metadata management
-
Reproducibility
-
Error handling
-
Integration with other tools
-
Common pitfalls and solutions
Key recommendations:
Use sparse matrices for sparse data
from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X)
Convert strings to categoricals
adata.strings_to_categoricals()
Use backed mode for large files
adata = ad.read_h5ad('large.h5ad', backed='r')
Store raw before filtering
adata.raw = adata.copy() adata = adata[:, adata.var['highly_variable']]
Integration with Scverse Ecosystem
AnnData serves as the foundational data structure for the scverse ecosystem:
Scanpy (Single-cell analysis)
import scanpy as sc
Preprocessing
sc.pp.filter_cells(adata, min_genes=200) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000)
Dimensionality reduction
sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata)
Visualization
sc.pl.umap(adata, color=['cell_type', 'leiden'])
Muon (Multimodal data)
import muon as mu
Combine RNA and protein data
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
PyTorch integration
from anndata.experimental import AnnLoader
Create DataLoader for deep learning
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
for batch in dataloader: X = batch.X # Train model
Common Workflows
Single-cell RNA-seq analysis
import anndata as ad import scanpy as sc
1. Load data
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) adata.obs['n_counts'] = adata.X.sum(axis=1) adata = adata[adata.obs['n_genes'] > 200] adata = adata[adata.obs['n_counts'] < 50000]
3. Store raw
adata.raw = adata.copy()
4. Normalize and filter
sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var['highly_variable']]
5. Save processed data
adata.write_h5ad('processed.h5ad')
Batch integration
Load multiple batches
adata1 = ad.read_h5ad('batch1.h5ad') adata2 = ad.read_h5ad('batch2.h5ad') adata3 = ad.read_h5ad('batch3.h5ad')
Concatenate with batch labels
adata = ad.concat( [adata1, adata2, adata3], label='batch', keys=['batch1', 'batch2', 'batch3'], join='inner' )
Apply batch correction
import scanpy as sc sc.pp.combat(adata, key='batch')
Continue analysis
sc.pp.pca(adata) sc.pp.neighbors(adata) sc.tl.umap(adata)
Working with large datasets
Open in backed mode
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
Filter based on metadata (no data loading)
high_quality = adata[adata.obs['quality_score'] > 0.8]
Load filtered subset
adata_subset = high_quality.to_memory()
Process subset
process(adata_subset)
Or process in chunks
chunk_size = 1000 for i in range(0, adata.n_obs, chunk_size): chunk = adata[i:i+chunk_size, :].to_memory() process(chunk)
Troubleshooting
Out of memory errors
Use backed mode or convert to sparse matrices:
Backed mode
adata = ad.read_h5ad('file.h5ad', backed='r')
Sparse matrices
from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X)
Slow file reading
Use compression and appropriate formats:
Optimize for storage
adata.strings_to_categoricals() adata.write_h5ad('file.h5ad', compression='gzip')
Use Zarr for cloud storage
adata.write_zarr('file.zarr', chunks=(1000, 1000))
Index alignment issues
Always align external data on index:
Wrong
adata.obs['new_col'] = external_data['values']
Correct
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
Additional Resources
-
Official documentation: https://anndata.readthedocs.io/
-
Scanpy tutorials: https://scanpy.readthedocs.io/
-
Scverse ecosystem: https://scverse.org/
-
GitHub repository: https://github.com/scverse/anndata