Arboreto
Overview
Arboreto is a computational library for inferring gene regulatory networks (GRNs) from gene expression data using parallelized algorithms that scale from single machines to multi-node clusters.
Core capability: Identify which transcription factors (TFs) regulate which target genes based on expression patterns across observations (cells, samples, conditions).
Quick Start
Install arboreto:
pip install arboreto
Basic GRN inference:
import pandas as pd from arboreto.algo import grnboost2
if name == 'main': # Load expression data (genes as columns) expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
# Infer regulatory network
network = grnboost2(expression_data=expression_matrix)
# Save results (TF, target, importance)
network.to_csv('network.tsv', sep='\t', index=False, header=False)
Critical: Always use if name == 'main': guard because Dask spawns new processes.
Core Capabilities
- Basic GRN Inference
For standard GRN inference workflows including:
-
Input data preparation (Pandas DataFrame or NumPy array)
-
Running inference with GRNBoost2 or GENIE3
-
Filtering by transcription factors
-
Output format and interpretation
See: references/basic_inference.md
Use the ready-to-run script: scripts/basic_grn_inference.py for standard inference tasks:
python scripts/basic_grn_inference.py expression_data.tsv output_network.tsv --tf-file tfs.txt --seed 777
- Algorithm Selection
Arboreto provides two algorithms:
GRNBoost2 (Recommended):
-
Fast gradient boosting-based inference
-
Optimized for large datasets (10k+ observations)
-
Default choice for most analyses
GENIE3:
-
Random Forest-based inference
-
Original multiple regression approach
-
Use for comparison or validation
Quick comparison:
from arboreto.algo import grnboost2, genie3
Fast, recommended
network_grnboost = grnboost2(expression_data=matrix)
Classic algorithm
network_genie3 = genie3(expression_data=matrix)
For detailed algorithm comparison, parameters, and selection guidance: references/algorithms.md
- Distributed Computing
Scale inference from local multi-core to cluster environments:
Local (default) - Uses all available cores automatically:
network = grnboost2(expression_data=matrix)
Custom local client - Control resources:
from distributed import LocalCluster, Client
local_cluster = LocalCluster(n_workers=10, memory_limit='8GB') client = Client(local_cluster)
network = grnboost2(expression_data=matrix, client_or_address=client)
client.close() local_cluster.close()
Cluster computing - Connect to remote Dask scheduler:
from distributed import Client
client = Client('tcp://scheduler:8786') network = grnboost2(expression_data=matrix, client_or_address=client)
For cluster setup, performance optimization, and large-scale workflows: references/distributed_computing.md
Installation
Recommended (Conda):
conda install -c bioconda arboreto
Alternative (pip):
pip install arboreto
For isolated environment:
conda create --name arboreto-env conda activate arboreto-env conda install -c bioconda arboreto
Dependencies: scipy, scikit-learn, numpy, pandas, dask, distributed
Common Use Cases
Single-Cell RNA-seq Analysis
import pandas as pd from arboreto.algo import grnboost2
if name == 'main': # Load single-cell expression matrix (cells x genes) sc_data = pd.read_csv('scrna_counts.tsv', sep='\t')
# Infer cell-type-specific regulatory network
network = grnboost2(expression_data=sc_data, seed=42)
# Filter high-confidence links
high_confidence = network[network['importance'] > 0.5]
high_confidence.to_csv('grn_high_confidence.tsv', sep='\t', index=False)
Bulk RNA-seq with TF Filtering
from arboreto.utils import load_tf_names from arboreto.algo import grnboost2
if name == 'main': # Load data expression_data = pd.read_csv('rnaseq_tpm.tsv', sep='\t') tf_names = load_tf_names('human_tfs.txt')
# Infer with TF restriction
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
seed=123
)
network.to_csv('tf_target_network.tsv', sep='\t', index=False)
Comparative Analysis (Multiple Conditions)
from arboreto.algo import grnboost2
if name == 'main': # Infer networks for different conditions conditions = ['control', 'treatment_24h', 'treatment_48h']
for condition in conditions:
data = pd.read_csv(f'{condition}_expression.tsv', sep='\t')
network = grnboost2(expression_data=data, seed=42)
network.to_csv(f'{condition}_network.tsv', sep='\t', index=False)
Output Interpretation
Arboreto returns a DataFrame with regulatory links:
Column Description
TF
Transcription factor (regulator)
target
Target gene
importance
Regulatory importance score (higher = stronger)
Filtering strategy:
-
Top N links per target gene
-
Importance threshold (e.g., > 0.5)
-
Statistical significance testing (permutation tests)
Integration with pySCENIC
Arboreto is a core component of the SCENIC pipeline for single-cell regulatory network analysis:
Step 1: Use arboreto for GRN inference
from arboreto.algo import grnboost2 network = grnboost2(expression_data=sc_data, tf_names=tf_list)
Step 2: Use pySCENIC for regulon identification and activity scoring
(See pySCENIC documentation for downstream analysis)
Reproducibility
Always set a seed for reproducible results:
network = grnboost2(expression_data=matrix, seed=777)
Run multiple seeds for robustness analysis:
from distributed import LocalCluster, Client
if name == 'main': client = Client(LocalCluster())
seeds = [42, 123, 777]
networks = []
for seed in seeds:
net = grnboost2(expression_data=matrix, client_or_address=client, seed=seed)
networks.append(net)
# Combine networks and filter consensus links
consensus = analyze_consensus(networks)
Troubleshooting
Memory errors: Reduce dataset size by filtering low-variance genes or use distributed computing
Slow performance: Use GRNBoost2 instead of GENIE3, enable distributed client, filter TF list
Dask errors: Ensure if name == 'main': guard is present in scripts
Empty results: Check data format (genes as columns), verify TF names match gene names