biopython

Python toolkit for computational biology. Use when asked to "parse FASTA", "read GenBank", "query NCBI", "run BLAST", "analyze protein structure", "build phylogenetic tree", or work with biological sequences. Handles sequence I/O, database access, alignments, structure analysis, and phylogenetics.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "biopython" with this command: npx skills add aminoanalytica/amina-skills/aminoanalytica-amina-skills-biopython

Biopython: Python Tools for Computational Biology

Summary

Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.

Applicable Scenarios

This skill applies when you need to:

Task CategoryExamples
Sequence OperationsCreate, modify, translate DNA/RNA/protein sequences
File Format HandlingParse or convert FASTA, GenBank, FASTQ, PDB, mmCIF
NCBI Database AccessQuery GenBank, PubMed, Protein, Gene, Taxonomy
Similarity SearchesExecute BLAST locally or via NCBI, parse results
Alignment WorkPairwise or multiple sequence alignments
Structural AnalysisParse PDB files, compute distances, DSSP assignment
Tree ConstructionBuild, manipulate, visualize phylogenetic trees
Motif DiscoveryFind and score sequence patterns
Sequence StatisticsGC content, molecular weight, melting temperature

Module Organization

ModulePurposeReference
Bio.Seq / Bio.SeqIOSequence objects and file I/Oreferences/sequence-io.md
Bio.Align / Bio.AlignIOPairwise and multiple alignmentsreferences/alignment.md
Bio.EntrezNCBI database programmatic accessreferences/databases.md
Bio.BlastBLAST execution and result parsingreferences/blast.md
Bio.PDB3D structure manipulationreferences/structure.md
Bio.PhyloPhylogenetic tree operationsreferences/phylogenetics.md
Bio.motifs, Bio.SeqUtils, etc.Motifs, utilities, restriction sitesreferences/advanced.md

Setup

Install via pip:

uv pip install biopython

Configure NCBI access (mandatory for Entrez operations):

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # Optional: increases rate limit to 10 req/s

Quick Reference

Parse Sequences

from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")

Translate DNA

from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()

Query NCBI

from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()

Run BLAST

from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)

Parse Protein Structure

from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)

Build Phylogenetic Tree

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)

Reference Files

FileContents
references/sequence-io.mdBio.Seq objects, SeqIO parsing/writing, large file handling, format conversion
references/alignment.mdPairwise alignment, BLOSUM matrices, AlignIO, external aligners
references/databases.mdNCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax
references/blast.mdRemote/local BLAST, XML parsing, result filtering, batch queries
references/structure.mdBio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries
references/phylogenetics.mdTree I/O, distance matrices, tree construction, consensus, visualization
references/advanced.mdMotifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram

Implementation Patterns

Retrieve and Analyze GenBank Record

from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")

Batch Sequence Processing

from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")

BLAST with Result Filtering

from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")

Phylogeny from Alignment

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)

Guidelines

Imports: Use explicit imports

from Bio import SeqIO, Entrez
from Bio.Seq import Seq

File Handling: Always close handles or use context managers

with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)

Memory Efficiency: Use iterators for large datasets

# Correct: iterate without loading all
for record in SeqIO.parse("huge.fasta", "fasta"):
    if meets_criteria(record):
        yield record

# Avoid: loading entire file
all_records = list(SeqIO.parse("huge.fasta", "fasta"))

Error Handling: Wrap network operations

from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"Fetch failed: {e.code}")

NCBI Compliance: Set email, respect rate limits, cache downloads locally

Troubleshooting

IssueResolution
"No handlers could be found for logger 'Bio.Entrez'"Set Entrez.email before any queries
HTTP 400 from NCBIVerify accession/ID format is correct
"ValueError: EOF" during parseConfirm file format matches format string
Alignment length mismatchSequences must be pre-aligned for AlignIO
Slow BLAST queriesUse local BLAST for large-scale searches
PDB parser warningsUse PDBParser(QUIET=True) or check structure quality

External Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

uniprot-database

No summary provided by upstream source.

Repository SourceNeeds Review
General

pymol

No summary provided by upstream source.

Repository SourceNeeds Review
General

scikit-bio

No summary provided by upstream source.

Repository SourceNeeds Review
General

rdkit

No summary provided by upstream source.

Repository SourceNeeds Review