RCSB Protein Data Bank
The Protein Data Bank hosts over 200,000 experimentally determined macromolecular structures plus computed models from AlphaFold and ModelArchive. This skill provides programmatic access to search, download, and analyze structural data.
Applicable Scenarios
| Task | Examples |
|---|
| Structure Retrieval | Download coordinates for a known PDB ID |
| Similarity Search | Find structures similar by sequence or 3D fold |
| Metadata Access | Get resolution, method, organism, ligands |
| Dataset Building | Compile structures for ML training or analysis |
| Drug Discovery | Identify ligand-bound structures for a target |
| Quality Filtering | Select high-resolution, well-refined structures |
Setup
pip install rcsb-api requests
The rcsb-api package (v1.5.0+) provides:
rcsbapi.search - Query construction and execution
rcsbapi.data - DataQuery for batch retrieval
Quick Reference
Search Queries
from rcsbapi.search import TextQuery, AttributeQuery, SeqSimilarityQuery, StructSimilarityQuery
# Text search
results = list(TextQuery("kinase inhibitor")())
# Filter by organism (use string attribute paths)
human = AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
)
# Filter by resolution
high_res = AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.0
)
# Filter by experimental method
xray = AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
# Combine queries: & (AND), | (OR), ~ (NOT)
results = list((TextQuery("kinase") & human & high_res)())
# Sequence similarity (MMseqs2) - minimum 25 residues required
seq_query = SeqSimilarityQuery(
value="VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
evalue_cutoff=1e-5,
identity_cutoff=0.7
)
# Structure similarity (3D fold)
struct_query = StructSimilarityQuery(
structure_search_type="entry_id",
entry_id="4HHB"
)
Data Retrieval (REST API)
The rcsb-api package's data module has limited functionality. Use the REST API directly for metadata:
import requests
def fetch_entry(pdb_id: str) -> dict:
"""Fetch entry metadata from RCSB REST API."""
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
return resp.json()
# Example usage
data = fetch_entry("4HHB")
print(data["struct"]["title"])
print(data["rcsb_entry_info"]["resolution_combined"])
# Polymer entity (chain info + sequence)
def fetch_polymer_entity(pdb_id: str, entity_id: int = 1) -> dict:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/{entity_id}")
resp.raise_for_status()
return resp.json()
entity = fetch_polymer_entity("4HHB", 1)
sequence = entity["entity_poly"]["pdbx_seq_one_letter_code"]
File Downloads
| Format | URL Pattern |
|---|
| PDB | https://files.rcsb.org/download/{ID}.pdb |
| mmCIF | https://files.rcsb.org/download/{ID}.cif |
| Assembly | https://files.rcsb.org/download/{ID}.pdb1 |
| FASTA | https://www.rcsb.org/fasta/entry/{ID} |
import requests
from pathlib import Path
def download_structure(pdb_id: str, fmt: str = "cif", outdir: str = ".") -> Path:
url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
resp = requests.get(url)
resp.raise_for_status()
outpath = Path(outdir) / f"{pdb_id}.{fmt}"
outpath.write_text(resp.text)
return outpath
Common Workflows
Find High-Quality Human Structures
from rcsbapi.search import TextQuery, AttributeQuery
query = (
TextQuery("receptor") &
AttributeQuery(
attribute="rcsb_entity_source_organism.scientific_name",
operator="exact_match",
value="Homo sapiens"
) &
AttributeQuery(
attribute="rcsb_entry_info.resolution_combined",
operator="less",
value=2.5
) &
AttributeQuery(
attribute="exptl.method",
operator="exact_match",
value="X-RAY DIFFRACTION"
)
)
results = list(query())
Batch Metadata Retrieval
import requests
import time
def fetch_batch(pdb_ids: list, delay: float = 0.3) -> dict:
"""Fetch metadata with rate limiting."""
results = {}
for pdb_id in pdb_ids:
try:
resp = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}")
resp.raise_for_status()
data = resp.json()
results[pdb_id] = {
"title": data["struct"]["title"],
"resolution": data.get("rcsb_entry_info", {}).get("resolution_combined"),
"method": data.get("exptl", [{}])[0].get("method"),
}
except Exception as e:
results[pdb_id] = {"error": str(e)}
time.sleep(delay)
return results
Find Drug-Bound Structures
from rcsbapi.search import AttributeQuery
# Find structures containing imatinib (ligand ID: STI)
query = AttributeQuery(
attribute="rcsb_nonpolymer_entity_instance_container_identifiers.comp_id",
operator="exact_match",
value="STI"
)
drug_complexes = list(query())
GraphQL for Complex Queries
import requests
query = """
{
entry(entry_id: "4HHB") {
struct { title }
rcsb_entry_info {
resolution_combined
deposited_atom_count
}
polymer_entities {
rcsb_polymer_entity { pdbx_description }
entity_poly { pdbx_seq_one_letter_code }
}
}
}
"""
response = requests.post(
"https://data.rcsb.org/graphql",
json={"query": query}
)
result = response.json()["data"]["entry"]
Key Concepts
| Term | Definition |
|---|
| PDB ID | 4-character alphanumeric code (e.g., "4HHB"). AlphaFold uses "AF_" prefix |
| Entity | Distinct molecular species. A homodimer has one entity appearing twice |
| Resolution | Quality metric in angstroms. Lower is better; <2.0 Å is high quality |
| Biological Assembly | Functional oligomeric state (may differ from asymmetric unit) |
| mmCIF | Modern format replacing legacy PDB; required for large structures |
Common Attribute Paths
Use these string paths with AttributeQuery:
| Attribute | Description |
|---|
rcsb_entity_source_organism.scientific_name | Source organism (e.g., "Homo sapiens") |
rcsb_entry_info.resolution_combined | Resolution in angstroms |
exptl.method | Experimental method (X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR) |
rcsb_nonpolymer_entity_instance_container_identifiers.comp_id | Ligand/small molecule ID |
struct.title | Structure title |
rcsb_accession_info.deposit_date | Deposition date |
Best Practices
| Practice | Rationale |
|---|
| Use mmCIF format | PDB format has atom count limits |
| Filter by resolution | <2.5 Å for most analyses; <2.0 Å for detailed work |
| Check experimental method | X-ray vs cryo-EM vs NMR have different quality metrics |
| Rate limit requests | 2-3 req/s to avoid 429 errors |
| Cache downloads | Structures rarely change after release |
| Prefer GraphQL | Reduces requests for complex data needs |
Troubleshooting
| Issue | Resolution |
|---|
| 404 on entry fetch | Entry may be obsoleted; check RCSB website for superseding ID |
| 429 Too Many Requests | Implement exponential backoff; reduce request rate |
| Empty search results | Check query syntax; use query.to_dict() to debug |
| Large structure fails | Use mmCIF format instead of PDB |
| Missing sequence data | Query polymer entity endpoint, not entry |
References
See references/api-reference.md for:
- Complete REST endpoint documentation
- All searchable attributes and operators
- Advanced query patterns
- Rate limiting strategies
External Links