DiffDock: Molecular Docking with Diffusion Models
Overview
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
Core Capabilities:
-
Predict ligand binding poses with high accuracy using deep learning
-
Support protein structures (PDB files) or sequences (via ESMFold)
-
Process single complexes or batch virtual screening campaigns
-
Generate confidence scores to assess prediction reliability
-
Handle diverse ligand inputs (SMILES, SDF, MOL2)
Key Distinction: DiffDock predicts binding poses (3D structure) and confidence (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
When to Use This Skill
This skill should be used when:
-
"Dock this ligand to a protein" or "predict binding pose"
-
"Run molecular docking" or "perform protein-ligand docking"
-
"Virtual screening" or "screen compound library"
-
"Where does this molecule bind?" or "predict binding site"
-
Structure-based drug design or lead optimization tasks
-
Tasks involving PDB files + SMILES strings or ligand structures
-
Batch docking of multiple protein-ligand pairs
Installation and Environment Setup
Check Environment Status
Before proceeding with DiffDock tasks, verify the environment setup:
Use the provided setup checker
python scripts/setup_check.py
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
Installation Options
Option 1: Conda (Recommended)
git clone https://github.com/gcorso/DiffDock.git cd DiffDock conda env create --file environment.yml conda activate diffdock
Option 2: Docker
docker pull rbgcsail/diffdock docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock micromamba activate diffdock
Important Notes:
-
GPU strongly recommended (10-100x speedup vs CPU)
-
First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
-
Model checkpoints (~500MB) download automatically if not present
Core Workflows
Workflow 1: Single Protein-Ligand Docking
Use Case: Dock one ligand to one protein target
Input Requirements:
-
Protein: PDB file OR amino acid sequence
-
Ligand: SMILES string OR structure file (SDF/MOL2)
Command:
python -m inference
--config default_inference_args.yaml
--protein_path protein.pdb
--ligand "CC(=O)Oc1ccccc1C(=O)O"
--out_dir results/single_docking/
Alternative (protein sequence):
python -m inference
--config default_inference_args.yaml
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..."
--ligand ligand.sdf
--out_dir results/sequence_docking/
Output Structure:
results/single_docking/ ├── rank_1.sdf # Top-ranked pose ├── rank_2.sdf # Second-ranked pose ├── ... ├── rank_10.sdf # 10th pose (default: 10 samples) └── confidence_scores.txt
Workflow 2: Batch Processing Multiple Complexes
Use Case: Dock multiple ligands to proteins, virtual screening campaigns
Step 1: Prepare Batch CSV
Use the provided script to create or validate batch input:
Create template
python scripts/prepare_batch_csv.py --create --output batch_input.csv
Validate existing CSV
python scripts/prepare_batch_csv.py my_input.csv --validate
CSV Format:
complex_name,protein_path,ligand_description,protein_sequence complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O, complex2,,COc1ccc(C#N)cc1,MSKGEELFT... complex3,protein3.pdb,ligand3.sdf,
Required Columns:
-
complex_name : Unique identifier
-
protein_path : PDB file path (leave empty if using sequence)
-
ligand_description : SMILES string or ligand file path
-
protein_sequence : Amino acid sequence (leave empty if using PDB)
Step 2: Run Batch Docking
python -m inference
--config default_inference_args.yaml
--protein_ligand_csv batch_input.csv
--out_dir results/batch/
--batch_size 10
For Large Virtual Screening (>100 compounds):
Pre-compute protein embeddings for faster processing:
Pre-compute embeddings
python datasets/esm_embedding_preparation.py
--protein_ligand_csv screening_input.csv
--out_file protein_embeddings.pt
Run with pre-computed embeddings
python -m inference
--config default_inference_args.yaml
--protein_ligand_csv screening_input.csv
--esm_embeddings_path protein_embeddings.pt
--out_dir results/screening/
Workflow 3: Analyzing Results
After docking completes, analyze confidence scores and rank predictions:
Analyze all results
python scripts/analyze_results.py results/batch/
Show top 5 per complex
python scripts/analyze_results.py results/batch/ --top 5
Filter by confidence threshold
python scripts/analyze_results.py results/batch/ --threshold 0.0
Export to CSV
python scripts/analyze_results.py results/batch/ --export summary.csv
Show top 20 predictions across all complexes
python scripts/analyze_results.py results/batch/ --best 20
The analysis script:
-
Parses confidence scores from all predictions
-
Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
-
Ranks predictions within and across complexes
-
Generates statistical summaries
-
Exports results to CSV for downstream analysis
Confidence Score Interpretation
Understanding Scores:
Score Range Confidence Level Interpretation
0 High Strong prediction, likely accurate
-1.5 to 0 Moderate Reasonable prediction, validate carefully
< -1.5 Low Uncertain prediction, requires validation
Critical Notes:
-
Confidence ≠ Affinity: High confidence means model certainty about structure, NOT strong binding
-
Context Matters: Adjust expectations for:
-
Large ligands (>500 Da): Lower confidence expected
-
Multiple protein chains: May decrease confidence
-
Novel protein families: May underperform
-
Multiple Samples: Review top 3-5 predictions, look for consensus
For detailed guidance: Read references/confidence_and_limitations.md using the Read tool
Parameter Customization
Using Custom Configuration
Create custom configuration for specific use cases:
Copy template
cp assets/custom_inference_config.yaml my_config.yaml
Edit parameters (see template for presets)
Then run with custom config
python -m inference
--config my_config.yaml
--protein_ligand_csv input.csv
--out_dir results/
Key Parameters to Adjust
Sampling Density:
-
samples_per_complex: 10 → Increase to 20-40 for difficult cases
-
More samples = better coverage but longer runtime
Inference Steps:
-
inference_steps: 20 → Increase to 25-30 for higher accuracy
-
More steps = potentially better quality but slower
Temperature Parameters (control diversity):
-
temp_sampling_tor: 7.04 → Increase for flexible ligands (8-10)
-
temp_sampling_tor: 7.04 → Decrease for rigid ligands (5-6)
-
Higher temperature = more diverse poses
Presets Available in Template:
-
High Accuracy: More samples + steps, lower temperature
-
Fast Screening: Fewer samples, faster
-
Flexible Ligands: Increased torsion temperature
-
Rigid Ligands: Decreased torsion temperature
For complete parameter reference: Read references/parameters_reference.md using the Read tool
Advanced Techniques
Ensemble Docking (Protein Flexibility)
For proteins with known flexibility, dock to multiple conformations:
Create ensemble CSV
import pandas as pd
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"] ligand = "CC(=O)Oc1ccccc1C(=O)O"
data = { "complex_name": [f"ensemble_{i}" for i in range(len(conformations))], "protein_path": conformations, "ligand_description": [ligand] * len(conformations), "protein_sequence": [""] * len(conformations) }
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
Run docking with increased sampling:
python -m inference
--config default_inference_args.yaml
--protein_ligand_csv ensemble_input.csv
--samples_per_complex 20
--out_dir results/ensemble/
Integration with Scoring Functions
DiffDock generates poses; combine with other tools for affinity:
GNINA (Fast neural network scoring):
for pose in results/*.sdf; do gnina -r protein.pdb -l "$pose" --score_only done
MM/GBSA (More accurate, slower): Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
Free Energy Calculations (Most accurate): Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
Recommended Workflow:
-
DiffDock → Generate poses with confidence scores
-
Visual inspection → Check structural plausibility
-
GNINA or MM/GBSA → Rescore and rank by affinity
-
Experimental validation → Biochemical assays
Limitations and Scope
DiffDock IS Designed For:
-
Small molecule ligands (typically 100-1000 Da)
-
Drug-like organic compounds
-
Small peptides (<20 residues)
-
Single or multi-chain proteins
DiffDock IS NOT Designed For:
-
Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
-
Large peptides (>20 residues) → Use alternative methods
-
Covalent docking → Use specialized covalent docking tools
-
Binding affinity prediction → Combine with scoring functions
-
Membrane proteins → Not specifically trained, use with caution
For complete limitations: Read references/confidence_and_limitations.md using the Read tool
Troubleshooting
Common Issues
Issue: Low confidence scores across all predictions
-
Cause: Large/unusual ligands, unclear binding site, protein flexibility
-
Solution: Increase samples_per_complex (20-40), try ensemble docking, validate protein structure
Issue: Out of memory errors
-
Cause: GPU memory insufficient for batch size
-
Solution: Reduce --batch_size 2 or process fewer complexes at once
Issue: Slow performance
-
Cause: Running on CPU instead of GPU
-
Solution: Verify CUDA with python -c "import torch; print(torch.cuda.is_available())" , use GPU
Issue: Unrealistic binding poses
-
Cause: Poor protein preparation, ligand too large, wrong binding site
-
Solution: Check protein for missing residues, remove far waters, consider specifying binding site
Issue: "Module not found" errors
-
Cause: Missing dependencies or wrong environment
-
Solution: Run python scripts/setup_check.py to diagnose
Performance Optimization
For Best Results:
-
Use GPU (essential for practical use)
-
Pre-compute ESM embeddings for repeated protein use
-
Batch process multiple complexes together
-
Start with default parameters, then tune if needed
-
Validate protein structures (resolve missing residues)
-
Use canonical SMILES for ligands
Graphical User Interface
For interactive use, launch the web interface:
python app/main.py
Navigate to http://localhost:7860
Or use the online demo without installation:
Resources
Helper Scripts (scripts/ )
prepare_batch_csv.py : Create and validate batch input CSV files
-
Create templates with example entries
-
Validate file paths and SMILES strings
-
Check for required columns and format issues
analyze_results.py : Analyze confidence scores and rank predictions
-
Parse results from single or batch runs
-
Generate statistical summaries
-
Export to CSV for downstream analysis
-
Identify top predictions across complexes
setup_check.py : Verify DiffDock environment setup
-
Check Python version and dependencies
-
Verify PyTorch and CUDA availability
-
Test RDKit and PyTorch Geometric installation
-
Provide installation instructions if needed
Reference Documentation (references/ )
parameters_reference.md : Complete parameter documentation
-
All command-line options and configuration parameters
-
Default values and acceptable ranges
-
Temperature parameters for controlling diversity
-
Model checkpoint locations and version flags
Read this file when users need:
-
Detailed parameter explanations
-
Fine-tuning guidance for specific systems
-
Alternative sampling strategies
confidence_and_limitations.md : Confidence score interpretation and tool limitations
-
Detailed confidence score interpretation
-
When to trust predictions
-
Scope and limitations of DiffDock
-
Integration with complementary tools
-
Troubleshooting prediction quality
Read this file when users need:
-
Help interpreting confidence scores
-
Understanding when NOT to use DiffDock
-
Guidance on combining with other tools
-
Validation strategies
workflows_examples.md : Comprehensive workflow examples
-
Detailed installation instructions
-
Step-by-step examples for all workflows
-
Advanced integration patterns
-
Troubleshooting common issues
-
Best practices and optimization tips
Read this file when users need:
-
Complete workflow examples with code
-
Integration with GNINA, OpenMM, or other tools
-
Virtual screening workflows
-
Ensemble docking procedures
Assets (assets/ )
batch_template.csv : Template for batch processing
-
Pre-formatted CSV with required columns
-
Example entries showing different input types
-
Ready to customize with actual data
custom_inference_config.yaml : Configuration template
-
Annotated YAML with all parameters
-
Four preset configurations for common use cases
-
Detailed comments explaining each parameter
-
Ready to customize and use
Best Practices
-
Always verify environment with setup_check.py before starting large jobs
-
Validate batch CSVs with prepare_batch_csv.py to catch errors early
-
Start with defaults then tune parameters based on system-specific needs
-
Generate multiple samples (10-40) for robust predictions
-
Visual inspection of top poses before downstream analysis
-
Combine with scoring functions for affinity assessment
-
Use confidence scores for initial ranking, not final decisions
-
Pre-compute embeddings for virtual screening campaigns
-
Document parameters used for reproducibility
-
Validate results experimentally when possible
Citations
When using DiffDock, cite the appropriate papers:
DiffDock-L (current default model):
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models" arXiv:2402.18396
Original DiffDock:
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking" ICLR 2023, arXiv:2210.01776
Additional Resources
-
GitHub Repository: https://github.com/gcorso/DiffDock
-
Online Demo: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
-
DiffDock-L Paper: https://arxiv.org/abs/2402.18396
-
Original Paper: https://arxiv.org/abs/2210.01776