RDKit Cheminformatics Toolkit
Overview
RDKit is a comprehensive cheminformatics library providing Python APIs for molecular analysis and manipulation. This skill provides guidance for reading/writing molecular structures, calculating descriptors, fingerprinting, substructure searching, chemical reactions, 2D/3D coordinate generation, and molecular visualization. Use this skill for drug discovery, computational chemistry, and cheminformatics research tasks.
Core Capabilities
- Molecular I/O and Creation
Reading Molecules:
Read molecular structures from various formats:
from rdkit import Chem
From SMILES strings
mol = Chem.MolFromSmiles('Cc1ccccc1') # Returns Mol object or None
From MOL files
mol = Chem.MolFromMolFile('path/to/file.mol')
From MOL blocks (string data)
mol = Chem.MolFromMolBlock(mol_block_string)
From InChI
mol = Chem.MolFromInchi('InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H')
Writing Molecules:
Convert molecules to text representations:
To canonical SMILES
smiles = Chem.MolToSmiles(mol)
To MOL block
mol_block = Chem.MolToMolBlock(mol)
To InChI
inchi = Chem.MolToInchi(mol)
Batch Processing:
For processing multiple molecules, use Supplier/Writer objects:
Read SDF files
suppl = Chem.SDMolSupplier('molecules.sdf') for mol in suppl: if mol is not None: # Check for parsing errors # Process molecule pass
Read SMILES files
suppl = Chem.SmilesMolSupplier('molecules.smi', titleLine=False)
For large files or compressed data
with gzip.open('molecules.sdf.gz') as f: suppl = Chem.ForwardSDMolSupplier(f) for mol in suppl: # Process molecule pass
Multithreaded processing for large datasets
suppl = Chem.MultithreadedSDMolSupplier('molecules.sdf')
Write molecules to SDF
writer = Chem.SDWriter('output.sdf') for mol in molecules: writer.write(mol) writer.close()
Important Notes:
-
All MolFrom* functions return None on failure with error messages
-
Always check for None before processing molecules
-
Molecules are automatically sanitized on import (validates valence, perceives aromaticity)
- Molecular Sanitization and Validation
RDKit automatically sanitizes molecules during parsing, executing 13 steps including valence checking, aromaticity perception, and chirality assignment.
Sanitization Control:
Disable automatic sanitization
mol = Chem.MolFromSmiles('C1=CC=CC=C1', sanitize=False)
Manual sanitization
Chem.SanitizeMol(mol)
Detect problems before sanitization
problems = Chem.DetectChemistryProblems(mol) for problem in problems: print(problem.GetType(), problem.Message())
Partial sanitization (skip specific steps)
from rdkit.Chem import rdMolStandardize Chem.SanitizeMol(mol, sanitizeOps=Chem.SANITIZE_ALL ^ Chem.SANITIZE_PROPERTIES)
Common Sanitization Issues:
-
Atoms with explicit valence exceeding maximum allowed will raise exceptions
-
Invalid aromatic rings will cause kekulization errors
-
Radical electrons may not be properly assigned without explicit specification
- Molecular Analysis and Properties
Accessing Molecular Structure:
Iterate atoms and bonds
for atom in mol.GetAtoms(): print(atom.GetSymbol(), atom.GetIdx(), atom.GetDegree())
for bond in mol.GetBonds(): print(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond.GetBondType())
Ring information
ring_info = mol.GetRingInfo() ring_info.NumRings() ring_info.AtomRings() # Returns tuples of atom indices
Check if atom is in ring
atom = mol.GetAtomWithIdx(0) atom.IsInRing() atom.IsInRingSize(6) # Check for 6-membered rings
Find smallest set of smallest rings (SSSR)
from rdkit.Chem import GetSymmSSSR rings = GetSymmSSSR(mol)
Stereochemistry:
Find chiral centers
from rdkit.Chem import FindMolChiralCenters chiral_centers = FindMolChiralCenters(mol, includeUnassigned=True)
Returns list of (atom_idx, chirality) tuples
Assign stereochemistry from 3D coordinates
from rdkit.Chem import AssignStereochemistryFrom3D AssignStereochemistryFrom3D(mol)
Check bond stereochemistry
bond = mol.GetBondWithIdx(0) stereo = bond.GetStereo() # STEREONONE, STEREOZ, STEREOE, etc.
Fragment Analysis:
Get disconnected fragments
frags = Chem.GetMolFrags(mol, asMols=True)
Fragment on specific bonds
from rdkit.Chem import FragmentOnBonds frag_mol = FragmentOnBonds(mol, [bond_idx1, bond_idx2])
Count ring systems
from rdkit.Chem.Scaffolds import MurckoScaffold scaffold = MurckoScaffold.GetScaffoldForMol(mol)
- Molecular Descriptors and Properties
Basic Descriptors:
from rdkit.Chem import Descriptors
Molecular weight
mw = Descriptors.MolWt(mol) exact_mw = Descriptors.ExactMolWt(mol)
LogP (lipophilicity)
logp = Descriptors.MolLogP(mol)
Topological polar surface area
tpsa = Descriptors.TPSA(mol)
Number of hydrogen bond donors/acceptors
hbd = Descriptors.NumHDonors(mol) hba = Descriptors.NumHAcceptors(mol)
Number of rotatable bonds
rot_bonds = Descriptors.NumRotatableBonds(mol)
Number of aromatic rings
aromatic_rings = Descriptors.NumAromaticRings(mol)
Batch Descriptor Calculation:
Calculate all descriptors at once
all_descriptors = Descriptors.CalcMolDescriptors(mol)
Returns dictionary: {'MolWt': 180.16, 'MolLogP': 1.23, ...}
Get list of available descriptor names
descriptor_names = [desc[0] for desc in Descriptors._descList]
Lipinski's Rule of Five:
Check drug-likeness
mw = Descriptors.MolWt(mol) <= 500 logp = Descriptors.MolLogP(mol) <= 5 hbd = Descriptors.NumHDonors(mol) <= 5 hba = Descriptors.NumHAcceptors(mol) <= 10
is_drug_like = mw and logp and hbd and hba
- Fingerprints and Molecular Similarity
Fingerprint Types:
from rdkit.Chem import rdFingerprintGenerator from rdkit.Chem import MACCSkeys
RDKit topological fingerprint
rdk_gen = rdFingerprintGenerator.GetRDKitFPGenerator(minPath=1, maxPath=7, fpSize=2048) fp = rdk_gen.GetFingerprint(mol)
Morgan fingerprints (circular fingerprints, similar to ECFP)
Modern API using rdFingerprintGenerator
morgan_gen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048) fp = morgan_gen.GetFingerprint(mol)
Count-based fingerprint
fp_count = morgan_gen.GetCountFingerprint(mol)
MACCS keys (166-bit structural key)
fp = MACCSkeys.GenMACCSKeys(mol)
Atom pair fingerprints
ap_gen = rdFingerprintGenerator.GetAtomPairGenerator() fp = ap_gen.GetFingerprint(mol)
Topological torsion fingerprints
tt_gen = rdFingerprintGenerator.GetTopologicalTorsionGenerator() fp = tt_gen.GetFingerprint(mol)
Avalon fingerprints (if available)
from rdkit.Avalon import pyAvalonTools fp = pyAvalonTools.GetAvalonFP(mol)
Similarity Calculation:
from rdkit import DataStructs from rdkit.Chem import rdFingerprintGenerator
Generate fingerprints using generator
mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048) fp1 = mfpgen.GetFingerprint(mol1) fp2 = mfpgen.GetFingerprint(mol2)
Calculate Tanimoto similarity
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
Calculate similarity for multiple molecules
fps = [mfpgen.GetFingerprint(m) for m in [mol2, mol3, mol4]] similarities = DataStructs.BulkTanimotoSimilarity(fp1, fps)
Other similarity metrics
dice = DataStructs.DiceSimilarity(fp1, fp2) cosine = DataStructs.CosineSimilarity(fp1, fp2)
Clustering and Diversity:
Butina clustering based on fingerprint similarity
from rdkit.ML.Cluster import Butina
Calculate distance matrix
dists = [] mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048) fps = [mfpgen.GetFingerprint(mol) for mol in mols] for i in range(len(fps)): sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i]) dists.extend([1-sim for sim in sims])
Cluster with distance cutoff
clusters = Butina.ClusterData(dists, len(fps), distThresh=0.3, isDistData=True)
- Substructure Searching and SMARTS
Basic Substructure Matching:
Define query using SMARTS
query = Chem.MolFromSmarts('[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1') # Benzene ring
Check if molecule contains substructure
has_match = mol.HasSubstructMatch(query)
Get all matches (returns tuple of tuples with atom indices)
matches = mol.GetSubstructMatches(query)
Get only first match
match = mol.GetSubstructMatch(query)
Common SMARTS Patterns:
Primary alcohols
primary_alcohol = Chem.MolFromSmarts('[CH2][OH1]')
Carboxylic acids
carboxylic_acid = Chem.MolFromSmarts('C(=O)[OH]')
Amides
amide = Chem.MolFromSmarts('C(=O)N')
Aromatic heterocycles
aromatic_n = Chem.MolFromSmarts('[nR]') # Aromatic nitrogen in ring
Macrocycles (rings > 12 atoms)
macrocycle = Chem.MolFromSmarts('[r{12-}]')
Matching Rules:
-
Unspecified properties in query match any value in target
-
Hydrogens are ignored unless explicitly specified
-
Charged query atom won't match uncharged target atom
-
Aromatic query atom won't match aliphatic target atom (unless query is generic)
- Chemical Reactions
Reaction SMARTS:
from rdkit.Chem import AllChem
Define reaction using SMARTS: reactants >> products
rxn = AllChem.ReactionFromSmarts('[C:1]=[O:2]>>[C:1][O:2]') # Ketone reduction
Apply reaction to molecules
reactants = (mol1,) products = rxn.RunReactants(reactants)
Products is tuple of tuples (one tuple per product set)
for product_set in products: for product in product_set: # Sanitize product Chem.SanitizeMol(product)
Reaction Features:
-
Atom mapping preserves specific atoms between reactants and products
-
Dummy atoms in products are replaced by corresponding reactant atoms
-
"Any" bonds inherit bond order from reactants
-
Chirality preserved unless explicitly changed
Reaction Similarity:
Generate reaction fingerprints
fp = AllChem.CreateDifferenceFingerprintForReaction(rxn)
Compare reactions
similarity = DataStructs.TanimotoSimilarity(fp1, fp2)
- 2D and 3D Coordinate Generation
2D Coordinate Generation:
from rdkit.Chem import AllChem
Generate 2D coordinates for depiction
AllChem.Compute2DCoords(mol)
Align molecule to template structure
template = Chem.MolFromSmiles('c1ccccc1') AllChem.Compute2DCoords(template) AllChem.GenerateDepictionMatching2DStructure(mol, template)
3D Coordinate Generation and Conformers:
Generate single 3D conformer using ETKDG
AllChem.EmbedMolecule(mol, randomSeed=42)
Generate multiple conformers
conf_ids = AllChem.EmbedMultipleConfs(mol, numConfs=10, randomSeed=42)
Optimize geometry with force field
AllChem.UFFOptimizeMolecule(mol) # UFF force field AllChem.MMFFOptimizeMolecule(mol) # MMFF94 force field
Optimize all conformers
for conf_id in conf_ids: AllChem.MMFFOptimizeMolecule(mol, confId=conf_id)
Calculate RMSD between conformers
from rdkit.Chem import AllChem rms = AllChem.GetConformerRMS(mol, conf_id1, conf_id2)
Align molecules
AllChem.AlignMol(probe_mol, ref_mol)
Constrained Embedding:
Embed with part of molecule constrained to specific coordinates
AllChem.ConstrainedEmbed(mol, core_mol)
- Molecular Visualization
Basic Drawing:
from rdkit.Chem import Draw
Draw single molecule to PIL image
img = Draw.MolToImage(mol, size=(300, 300)) img.save('molecule.png')
Draw to file directly
Draw.MolToFile(mol, 'molecule.png')
Draw multiple molecules in grid
mols = [mol1, mol2, mol3, mol4] img = Draw.MolsToGridImage(mols, molsPerRow=2, subImgSize=(200, 200))
Highlighting Substructures:
Highlight substructure match
query = Chem.MolFromSmarts('c1ccccc1') match = mol.GetSubstructMatch(query)
img = Draw.MolToImage(mol, highlightAtoms=match)
Custom highlight colors
highlight_colors = {atom_idx: (1, 0, 0) for atom_idx in match} # Red img = Draw.MolToImage(mol, highlightAtoms=match, highlightAtomColors=highlight_colors)
Customizing Visualization:
from rdkit.Chem.Draw import rdMolDraw2D
Create drawer with custom options
drawer = rdMolDraw2D.MolDraw2DCairo(300, 300) opts = drawer.drawOptions()
Customize options
opts.addAtomIndices = True opts.addStereoAnnotation = True opts.bondLineWidth = 2
Draw molecule
drawer.DrawMolecule(mol) drawer.FinishDrawing()
Save to file
with open('molecule.png', 'wb') as f: f.write(drawer.GetDrawingText())
Jupyter Notebook Integration:
Enable inline display in Jupyter
from rdkit.Chem.Draw import IPythonConsole
Customize default display
IPythonConsole.ipython_useSVG = True # Use SVG instead of PNG IPythonConsole.molSize = (300, 300) # Default size
Molecules now display automatically
mol # Shows molecule image
Visualizing Fingerprint Bits:
Show what molecular features a fingerprint bit represents
from rdkit.Chem import Draw
For Morgan fingerprints
bit_info = {} fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, bitInfo=bit_info)
Draw environment for specific bit
img = Draw.DrawMorganBit(mol, bit_id, bit_info)
- Molecular Modification
Adding/Removing Hydrogens:
Add explicit hydrogens
mol_h = Chem.AddHs(mol)
Remove explicit hydrogens
mol = Chem.RemoveHs(mol_h)
Kekulization and Aromaticity:
Convert aromatic bonds to alternating single/double
Chem.Kekulize(mol)
Set aromaticity
Chem.SetAromaticity(mol)
Replacing Substructures:
Replace substructure with another structure
query = Chem.MolFromSmarts('c1ccccc1') # Benzene replacement = Chem.MolFromSmiles('C1CCCCC1') # Cyclohexane
new_mol = Chem.ReplaceSubstructs(mol, query, replacement)[0]
Neutralizing Charges:
Remove formal charges by adding/removing hydrogens
from rdkit.Chem.MolStandardize import rdMolStandardize
Using Uncharger
uncharger = rdMolStandardize.Uncharger() mol_neutral = uncharger.uncharge(mol)
- Working with Molecular Hashes and Standardization
Molecular Hashing:
from rdkit.Chem import rdMolHash
Generate Murcko scaffold hash
scaffold_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.MurckoScaffold)
Canonical SMILES hash
canonical_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.CanonicalSmiles)
Regioisomer hash (ignores stereochemistry)
regio_hash = rdMolHash.MolHash(mol, rdMolHash.HashFunction.Regioisomer)
Randomized SMILES:
Generate random SMILES representations (for data augmentation)
from rdkit.Chem import MolToRandomSmilesVect
random_smiles = MolToRandomSmilesVect(mol, numSmiles=10, randomSeed=42)
- Pharmacophore and 3D Features
Pharmacophore Features:
from rdkit.Chem import ChemicalFeatures from rdkit import RDConfig import os
Load feature factory
fdef_path = os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef') factory = ChemicalFeatures.BuildFeatureFactory(fdef_path)
Get pharmacophore features
features = factory.GetFeaturesForMol(mol)
for feat in features: print(feat.GetFamily(), feat.GetType(), feat.GetAtomIds())
Common Workflows
Drug-likeness Analysis
from rdkit import Chem from rdkit.Chem import Descriptors
def analyze_druglikeness(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: return None
# Calculate Lipinski descriptors
results = {
'MW': Descriptors.MolWt(mol),
'LogP': Descriptors.MolLogP(mol),
'HBD': Descriptors.NumHDonors(mol),
'HBA': Descriptors.NumHAcceptors(mol),
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Descriptors.NumRotatableBonds(mol)
}
# Check Lipinski's Rule of Five
results['Lipinski'] = (
results['MW'] <= 500 and
results['LogP'] <= 5 and
results['HBD'] <= 5 and
results['HBA'] <= 10
)
return results
Similarity Screening
from rdkit import Chem from rdkit.Chem import AllChem from rdkit import DataStructs
def similarity_screen(query_smiles, database_smiles, threshold=0.7): query_mol = Chem.MolFromSmiles(query_smiles) query_fp = AllChem.GetMorganFingerprintAsBitVect(query_mol, 2)
hits = []
for idx, smiles in enumerate(database_smiles):
mol = Chem.MolFromSmiles(smiles)
if mol:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
sim = DataStructs.TanimotoSimilarity(query_fp, fp)
if sim >= threshold:
hits.append((idx, smiles, sim))
return sorted(hits, key=lambda x: x[2], reverse=True)
Substructure Filtering
from rdkit import Chem
def filter_by_substructure(smiles_list, pattern_smarts): query = Chem.MolFromSmarts(pattern_smarts)
hits = []
for smiles in smiles_list:
mol = Chem.MolFromSmiles(smiles)
if mol and mol.HasSubstructMatch(query):
hits.append(smiles)
return hits
Best Practices
Error Handling
Always check for None when parsing molecules:
mol = Chem.MolFromSmiles(smiles) if mol is None: print(f"Failed to parse: {smiles}") continue
Performance Optimization
Use binary formats for storage:
import pickle
Pickle molecules for fast loading
with open('molecules.pkl', 'wb') as f: pickle.dump(mols, f)
Load pickled molecules (much faster than reparsing)
with open('molecules.pkl', 'rb') as f: mols = pickle.load(f)
Use bulk operations:
Calculate fingerprints for all molecules at once
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
Use bulk similarity calculations
similarities = DataStructs.BulkTanimotoSimilarity(fps[0], fps[1:])
Thread Safety
RDKit operations are generally thread-safe for:
-
Molecule I/O (SMILES, mol blocks)
-
Coordinate generation
-
Fingerprinting and descriptors
-
Substructure searching
-
Reactions
-
Drawing
Not thread-safe: MolSuppliers when accessed concurrently.
Memory Management
For large datasets:
Use ForwardSDMolSupplier to avoid loading entire file
with open('large.sdf') as f: suppl = Chem.ForwardSDMolSupplier(f) for mol in suppl: # Process one molecule at a time pass
Use MultithreadedSDMolSupplier for parallel processing
suppl = Chem.MultithreadedSDMolSupplier('large.sdf', numWriterThreads=4)
Common Pitfalls
-
Forgetting to check for None: Always validate molecules after parsing
-
Sanitization failures: Use DetectChemistryProblems() to debug
-
Missing hydrogens: Use AddHs() when calculating properties that depend on hydrogen
-
2D vs 3D: Generate appropriate coordinates before visualization or 3D analysis
-
SMARTS matching rules: Remember that unspecified properties match anything
-
Thread safety with MolSuppliers: Don't share supplier objects across threads
Resources
references/
This skill includes detailed API reference documentation:
-
api_reference.md
-
Comprehensive listing of RDKit modules, functions, and classes organized by functionality
-
descriptors_reference.md
-
Complete list of available molecular descriptors with descriptions
-
smarts_patterns.md
-
Common SMARTS patterns for functional groups and structural features
Load these references when needing specific API details, parameter information, or pattern examples.
scripts/
Example scripts for common RDKit workflows:
-
molecular_properties.py
-
Calculate comprehensive molecular properties and descriptors
-
similarity_search.py
-
Perform fingerprint-based similarity screening
-
substructure_filter.py
-
Filter molecules by substructure patterns
These scripts can be executed directly or used as templates for custom workflows.