Biopython: Python Tools for Computational Biology
Summary
Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.
Applicable Scenarios
This skill applies when you need to:
| Task Category | Examples |
|---|---|
| Sequence Operations | Create, modify, translate DNA/RNA/protein sequences |
| File Format Handling | Parse or convert FASTA, GenBank, FASTQ, PDB, mmCIF |
| NCBI Database Access | Query GenBank, PubMed, Protein, Gene, Taxonomy |
| Similarity Searches | Execute BLAST locally or via NCBI, parse results |
| Alignment Work | Pairwise or multiple sequence alignments |
| Structural Analysis | Parse PDB files, compute distances, DSSP assignment |
| Tree Construction | Build, manipulate, visualize phylogenetic trees |
| Motif Discovery | Find and score sequence patterns |
| Sequence Statistics | GC content, molecular weight, melting temperature |
Module Organization
| Module | Purpose | Reference |
|---|---|---|
| Bio.Seq / Bio.SeqIO | Sequence objects and file I/O | references/sequence-io.md |
| Bio.Align / Bio.AlignIO | Pairwise and multiple alignments | references/alignment.md |
| Bio.Entrez | NCBI database programmatic access | references/databases.md |
| Bio.Blast | BLAST execution and result parsing | references/blast.md |
| Bio.PDB | 3D structure manipulation | references/structure.md |
| Bio.Phylo | Phylogenetic tree operations | references/phylogenetics.md |
| Bio.motifs, Bio.SeqUtils, etc. | Motifs, utilities, restriction sites | references/advanced.md |
Setup
Install via pip:
uv pip install biopython
Configure NCBI access (mandatory for Entrez operations):
from Bio import Entrez
Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key" # Optional: increases rate limit to 10 req/s
Quick Reference
Parse Sequences
from Bio import SeqIO
records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
print(f"{rec.id}: {len(rec)} bp")
Translate DNA
from Bio.Seq import Seq
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()
Query NCBI
from Bio import Entrez
Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()
Run BLAST
from Bio.Blast import NCBIWWW, NCBIXML
result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)
Parse Protein Structure
from Bio.PDB import PDBParser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
print(atom.name, atom.coord)
Build Phylogenetic Tree
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)
Reference Files
| File | Contents |
|---|---|
references/sequence-io.md | Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion |
references/alignment.md | Pairwise alignment, BLOSUM matrices, AlignIO, external aligners |
references/databases.md | NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax |
references/blast.md | Remote/local BLAST, XML parsing, result filtering, batch queries |
references/structure.md | Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries |
references/phylogenetics.md | Tree I/O, distance matrices, tree construction, consensus, visualization |
references/advanced.md | Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram |
Implementation Patterns
Retrieve and Analyze GenBank Record
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction
Entrez.email = "researcher@institution.edu"
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")
Batch Sequence Processing
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
output_records.append(record)
SeqIO.write(output_records, "filtered.fasta", "fasta")
BLAST with Result Filtering
from Bio.Blast import NCBIWWW, NCBIXML
query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-10:
identity_pct = (hsp.identities / hsp.align_length) * 100
print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")
Phylogeny from Alignment
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt
alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()
fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)
Guidelines
Imports: Use explicit imports
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
File Handling: Always close handles or use context managers
with open("sequences.fasta") as f:
for record in SeqIO.parse(f, "fasta"):
process(record)
Memory Efficiency: Use iterators for large datasets
# Correct: iterate without loading all
for record in SeqIO.parse("huge.fasta", "fasta"):
if meets_criteria(record):
yield record
# Avoid: loading entire file
all_records = list(SeqIO.parse("huge.fasta", "fasta"))
Error Handling: Wrap network operations
from urllib.error import HTTPError
try:
handle = Entrez.efetch(db="nucleotide", id=accession)
record = SeqIO.read(handle, "genbank")
except HTTPError as e:
print(f"Fetch failed: {e.code}")
NCBI Compliance: Set email, respect rate limits, cache downloads locally
Troubleshooting
| Issue | Resolution |
|---|---|
| "No handlers could be found for logger 'Bio.Entrez'" | Set Entrez.email before any queries |
| HTTP 400 from NCBI | Verify accession/ID format is correct |
| "ValueError: EOF" during parse | Confirm file format matches format string |
| Alignment length mismatch | Sequences must be pre-aligned for AlignIO |
| Slow BLAST queries | Use local BLAST for large-scale searches |
| PDB parser warnings | Use PDBParser(QUIET=True) or check structure quality |
External Resources
- Biopython Documentation: https://biopython.org/docs/latest/
- Biopython Tutorial: https://biopython.org/docs/latest/Tutorial/
- GitHub Repository: https://github.com/biopython/biopython