TorchDrug
Overview
TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.
When to Use This Skill
This skill should be used when working with:
Data Types:
-
SMILES strings or molecular structures
-
Protein sequences or 3D structures (PDB files)
-
Chemical reactions and retrosynthesis
-
Biomedical knowledge graphs
-
Drug discovery datasets
Tasks:
-
Predicting molecular properties (solubility, toxicity, activity)
-
Protein function or structure prediction
-
Drug-target binding prediction
-
Generating new molecular structures
-
Planning chemical synthesis routes
-
Link prediction in biomedical knowledge bases
-
Training graph neural networks on scientific data
Libraries and Integration:
-
TorchDrug is the primary library
-
Often used with RDKit for cheminformatics
-
Compatible with PyTorch and PyTorch Lightning
-
Integrates with AlphaFold and ESM for proteins
Getting Started
Installation
uv pip install torchdrug
Or with optional dependencies
uv pip install torchdrug[full]
Quick Example
from torchdrug import datasets, models, tasks from torch.utils.data import DataLoader
Load molecular dataset
dataset = datasets.BBBP("~/molecule-datasets/") train_set, valid_set, test_set = dataset.split()
Define GNN model
model = models.GIN( input_dim=dataset.node_feature_dim, hidden_dims=[256, 256, 256], edge_input_dim=dataset.edge_feature_dim, batch_norm=True, readout="mean" )
Create property prediction task
task = tasks.PropertyPrediction( model, task=dataset.tasks, criterion="bce", metric=["auroc", "auprc"] )
Train with PyTorch
optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
for epoch in range(100): for batch in train_loader: loss = task(batch) optimizer.zero_grad() loss.backward() optimizer.step()
Core Capabilities
- Molecular Property Prediction
Predict chemical, physical, and biological properties of molecules from structure.
Use Cases:
-
Drug-likeness and ADMET properties
-
Toxicity screening
-
Quantum chemistry properties
-
Binding affinity prediction
Key Components:
-
20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.)
-
GNN models (GIN, GAT, SchNet)
-
PropertyPrediction and MultipleBinaryClassification tasks
Reference: See references/molecular_property_prediction.md for:
-
Complete dataset catalog
-
Model selection guide
-
Training workflows and best practices
-
Feature engineering details
- Protein Modeling
Work with protein sequences, structures, and properties.
Use Cases:
-
Enzyme function prediction
-
Protein stability and solubility
-
Subcellular localization
-
Protein-protein interactions
-
Structure prediction
Key Components:
-
15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.)
-
Sequence models (ESM, ProteinBERT, ProteinLSTM)
-
Structure models (GearNet, SchNet)
-
Multiple task types for different prediction levels
Reference: See references/protein_modeling.md for:
-
Protein-specific datasets
-
Sequence vs structure models
-
Pre-training strategies
-
Integration with AlphaFold and ESM
- Knowledge Graph Reasoning
Predict missing links and relationships in biological knowledge graphs.
Use Cases:
-
Drug repurposing
-
Disease mechanism discovery
-
Gene-disease associations
-
Multi-hop biomedical reasoning
Key Components:
-
General KGs (FB15k, WN18) and biomedical (Hetionet)
-
Embedding models (TransE, RotatE, ComplEx)
-
KnowledgeGraphCompletion task
Reference: See references/knowledge_graphs.md for:
-
Knowledge graph datasets (including Hetionet with 45k biomedical entities)
-
Embedding model comparison
-
Evaluation metrics and protocols
-
Biomedical applications
- Molecular Generation
Generate novel molecular structures with desired properties.
Use Cases:
-
De novo drug design
-
Lead optimization
-
Chemical space exploration
-
Property-guided generation
Key Components:
-
Autoregressive generation
-
GCPN (policy-based generation)
-
GraphAutoregressiveFlow
-
Property optimization workflows
Reference: See references/molecular_generation.md for:
-
Generation strategies (unconditional, conditional, scaffold-based)
-
Multi-objective optimization
-
Validation and filtering
-
Integration with property prediction
- Retrosynthesis
Predict synthetic routes from target molecules to starting materials.
Use Cases:
-
Synthesis planning
-
Route optimization
-
Synthetic accessibility assessment
-
Multi-step planning
Key Components:
-
USPTO-50k reaction dataset
-
CenterIdentification (reaction center prediction)
-
SynthonCompletion (reactant prediction)
-
End-to-end Retrosynthesis pipeline
Reference: See references/retrosynthesis.md for:
-
Task decomposition (center ID → synthon completion)
-
Multi-step synthesis planning
-
Commercial availability checking
-
Integration with other retrosynthesis tools
- Graph Neural Network Models
Comprehensive catalog of GNN architectures for different data types and tasks.
Available Models:
-
General GNNs: GCN, GAT, GIN, RGCN, MPNN
-
3D-aware: SchNet, GearNet
-
Protein-specific: ESM, ProteinBERT, GearNet
-
Knowledge graph: TransE, RotatE, ComplEx, SimplE
-
Generative: GraphAutoregressiveFlow
Reference: See references/models_architectures.md for:
-
Detailed model descriptions
-
Model selection guide by task and dataset
-
Architecture comparisons
-
Implementation tips
- Datasets
40+ curated datasets spanning chemistry, biology, and knowledge graphs.
Categories:
-
Molecular properties (drug discovery, quantum chemistry)
-
Protein properties (function, structure, interactions)
-
Knowledge graphs (general and biomedical)
-
Retrosynthesis reactions
Reference: See references/datasets.md for:
-
Complete dataset catalog with sizes and tasks
-
Dataset selection guide
-
Loading and preprocessing
-
Splitting strategies (random, scaffold)
Common Workflows
Workflow 1: Molecular Property Prediction
Scenario: Predict blood-brain barrier penetration for drug candidates.
Steps:
-
Load dataset: datasets.BBBP()
-
Choose model: GIN for molecular graphs
-
Define task: PropertyPrediction with binary classification
-
Train with scaffold split for realistic evaluation
-
Evaluate using AUROC and AUPRC
Navigation: references/molecular_property_prediction.md → Dataset selection → Model selection → Training
Workflow 2: Protein Function Prediction
Scenario: Predict enzyme function from sequence.
Steps:
-
Load dataset: datasets.EnzymeCommission()
-
Choose model: ESM (pre-trained) or GearNet (with structure)
-
Define task: PropertyPrediction with multi-class classification
-
Fine-tune pre-trained model or train from scratch
-
Evaluate using accuracy and per-class metrics
Navigation: references/protein_modeling.md → Model selection (sequence vs structure) → Pre-training strategies
Workflow 3: Drug Repurposing via Knowledge Graphs
Scenario: Find new disease treatments in Hetionet.
Steps:
-
Load dataset: datasets.Hetionet()
-
Choose model: RotatE or ComplEx
-
Define task: KnowledgeGraphCompletion
-
Train with negative sampling
-
Query for "Compound-treats-Disease" predictions
-
Filter by plausibility and mechanism
Navigation: references/knowledge_graphs.md → Hetionet dataset → Model selection → Biomedical applications
Workflow 4: De Novo Molecule Generation
Scenario: Generate drug-like molecules optimized for target binding.
Steps:
-
Train property predictor on activity data
-
Choose generation approach: GCPN for RL-based optimization
-
Define reward function combining affinity, drug-likeness, synthesizability
-
Generate candidates with property constraints
-
Validate chemistry and filter by drug-likeness
-
Rank by multi-objective scoring
Navigation: references/molecular_generation.md → Conditional generation → Multi-objective optimization
Workflow 5: Retrosynthesis Planning
Scenario: Plan synthesis route for target molecule.
Steps:
-
Load dataset: datasets.USPTO50k()
-
Train center identification model (RGCN)
-
Train synthon completion model (GIN)
-
Combine into end-to-end retrosynthesis pipeline
-
Apply recursively for multi-step planning
-
Check commercial availability of building blocks
Navigation: references/retrosynthesis.md → Task types → Multi-step planning
Integration Patterns
With RDKit
Convert between TorchDrug molecules and RDKit:
from torchdrug import data from rdkit import Chem
SMILES → TorchDrug molecule
smiles = "CCO" mol = data.Molecule.from_smiles(smiles)
TorchDrug → RDKit
rdkit_mol = mol.to_molecule()
RDKit → TorchDrug
rdkit_mol = Chem.MolFromSmiles(smiles) mol = data.Molecule.from_molecule(rdkit_mol)
With AlphaFold/ESM
Use predicted structures:
from torchdrug import data
Load AlphaFold predicted structure
protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb")
Build graph with spatial edges
graph = protein.residue_graph( node_position="ca", edge_types=["sequential", "radius"], radius_cutoff=10.0 )
With PyTorch Lightning
Wrap tasks for Lightning training:
import pytorch_lightning as pl
class LightningTask(pl.LightningModule): def init(self, torchdrug_task): super().init() self.task = torchdrug_task
def training_step(self, batch, batch_idx):
return self.task(batch)
def validation_step(self, batch, batch_idx):
pred = self.task.predict(batch)
target = self.task.target(batch)
return {"pred": pred, "target": target}
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)
Technical Details
For deep dives into TorchDrug's architecture:
Core Concepts: See references/core_concepts.md for:
-
Architecture philosophy (modular, configurable)
-
Data structures (Graph, Molecule, Protein, PackedGraph)
-
Model interface and forward function signature
-
Task interface (predict, target, forward, evaluate)
-
Training workflows and best practices
-
Loss functions and metrics
-
Common pitfalls and debugging
Quick Reference Cheat Sheet
Choose Dataset:
-
Molecular property → references/datasets.md → Molecular section
-
Protein task → references/datasets.md → Protein section
-
Knowledge graph → references/datasets.md → Knowledge graph section
Choose Model:
-
Molecules → references/models_architectures.md → GNN section → GIN/GAT/SchNet
-
Proteins (sequence) → references/models_architectures.md → Protein section → ESM
-
Proteins (structure) → references/models_architectures.md → Protein section → GearNet
-
Knowledge graph → references/models_architectures.md → KG section → RotatE/ComplEx
Common Tasks:
-
Property prediction → references/molecular_property_prediction.md or references/protein_modeling.md
-
Generation → references/molecular_generation.md
-
Retrosynthesis → references/retrosynthesis.md
-
KG reasoning → references/knowledge_graphs.md
Understand Architecture:
-
Data structures → references/core_concepts.md → Data Structures
-
Model design → references/core_concepts.md → Model Interface
-
Task design → references/core_concepts.md → Task Interface
Troubleshooting Common Issues
Issue: Dimension mismatch errors → Check model.input_dim matches dataset.node_feature_dim
→ See references/core_concepts.md → Essential Attributes
Issue: Poor performance on molecular tasks → Use scaffold splitting, not random → Try GIN instead of GCN → See references/molecular_property_prediction.md → Best Practices
Issue: Protein model not learning → Use pre-trained ESM for sequence tasks → Check edge construction for structure models → See references/protein_modeling.md → Training Workflows
Issue: Memory errors with large graphs → Reduce batch size → Use gradient accumulation → See references/core_concepts.md → Memory Efficiency
Issue: Generated molecules are invalid → Add validity constraints → Post-process with RDKit validation → See references/molecular_generation.md → Validation and Filtering
Resources
Official Documentation: https://torchdrug.ai/docs/ GitHub: https://github.com/DeepGraphLearning/torchdrug Paper: TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery
Summary
Navigate to the appropriate reference file based on your task:
-
Molecular property prediction → molecular_property_prediction.md
-
Protein modeling → protein_modeling.md
-
Knowledge graphs → knowledge_graphs.md
-
Molecular generation → molecular_generation.md
-
Retrosynthesis → retrosynthesis.md
-
Model selection → models_architectures.md
-
Dataset selection → datasets.md
-
Technical details → core_concepts.md
Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.