Biological Sequence Retrieval
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
IMPORTANT: Always use English terms in tool calls (gene names, organism names, sequence descriptions), even if the user writes in another language. Only try original-language terms as a fallback if English returns no results. Respond in the user's language.
Workflow Overview
Phase 0: Clarify (if needed)
↓
Phase 1: Disambiguate Gene/Organism
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Sequence Profile
Phase 0: Clarification (When Needed)
Ask the user ONLY if:
- Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
- Sequence type unclear (mRNA, genomic, protein?)
- Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)
Skip clarification for:
- Specific accession numbers (NC_, NM_, U*, etc.)
- Clear organism + gene combinations
- Complete genome requests with organism specified
Phase 1: Gene/Organism Disambiguation
1.1 Resolve Identifiers
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Strategy depends on input type
if user_provided_accession:
# Direct retrieval based on accession type
accession = user_provided_accession
elif user_provided_gene_and_organism:
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
1.2 Accession Type Decision Tree
CRITICAL: Accession prefix determines which tools to use.
| Prefix | Type | Use With |
|---|---|---|
| NC_* | RefSeq chromosome | NCBI only |
| NM_* | RefSeq mRNA | NCBI only |
| NR_* | RefSeq ncRNA | NCBI only |
| NP_* | RefSeq protein | NCBI only |
| XM_* | RefSeq predicted mRNA | NCBI only |
| U*, M*, K*, X* | GenBank | NCBI or ENA |
| CP*, NZ_* | GenBank genome | NCBI or ENA |
| EMBL format | EMBL | ENA preferred |
1.3 Identity Resolution Checklist
- Organism confirmed (scientific name)
- Gene symbol/name identified
- Sequence type determined (genomic/mRNA/protein)
- Strain specified (if relevant)
- Accession prefix identified → tool selection
Phase 2: Data Retrieval (Internal)
Retrieve silently. Do NOT narrate the search process.
2.1 Search for Sequences
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # Optional
keywords=keywords, # Optional
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
# Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
2.2 Retrieve Sequence Data
# Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # or "genbank"
)
# GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
2.3 ENA Alternative (for GenBank/EMBL accessions)
# Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA entry info
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)
Fallback Chains
| Primary | Fallback | Notes |
|---|---|---|
| NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable |
| ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq |
| NCBI_search_nucleotide | Try broader keywords | No results |
Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.
Phase 3: Report Sequence Profile
Output Structure
Present as a Sequence Profile Report. Hide search process.
# Sequence Profile: [Gene/Organism]
**Search Summary**
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found
---
## Primary Sequence
### [Accession]: [Definition/Title]
| Attribute | Value |
|-----------|-------|
| **Accession** | [accession] |
| **Type** | RefSeq / GenBank |
| **Organism** | [scientific name] |
| **Strain** | [strain if applicable] |
| **Length** | [X,XXX bp / aa] |
| **Molecule** | DNA / mRNA / Protein |
| **Topology** | Linear / Circular |
**Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party
### Sequence Statistics
| Statistic | Value |
|-----------|-------|
| **Length** | [X,XXX] bp |
| **GC Content** | [XX.X]% |
| **Genes** | [N] (if genome) |
| **CDS** | [N] (if annotated) |
### Sequence Preview
```fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]
Annotations Summary (from GenBank format)
| Feature | Count | Examples |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| Regulatory | [N] | promoters |
Alternative Sequences
Ranked by relevance and curation level:
| Accession | Type | Length | Description | ENA Compatible |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |
| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |
Cross-Database References
| Database | Accession | Link |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
Download Options
Formats Available
| Format | Description | Use Case |
|---|---|---|
| FASTA | Sequence only | BLAST, alignment |
| GenBank | Sequence + annotations | Gene analysis |
| GFF3 | Annotations only | Genome browsers |
Direct Commands
# FASTA format
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
# GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
Related Sequences
Other Strains/Isolates
| Accession | Strain | Similarity | Notes |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
Protein Products (if applicable)
| Protein Accession | Product Name | Length |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
Retrieved: [date] Database: NCBI Nucleotide
---
## Curation Level Tiers
| Tier | Symbol | Accession Prefix | Description |
|------|--------|------------------|-------------|
| RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard |
| RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted |
| GenBank Validated | ●●○○ | Various | Submitted, some curation |
| GenBank Direct | ●○○○ | Various | Direct submission |
| Third Party | ○○○○ | TPA_ | Third-party annotation |
Include in report:
```markdown
**Curation Level**: ●●●● RefSeq Reference
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use
Completeness Checklist
Every sequence report MUST include:
Per Sequence (Required)
- Accession number
- Organism (scientific name)
- Sequence type (DNA/RNA/protein)
- Length
- Curation level
- Database source
Search Summary (Required)
- Query parameters
- Number of results
- Ranking rationale
Include Even If Limited
- Alternative sequences (or "Only one sequence found")
- Cross-database references (or "No cross-references available")
- Download instructions
Common Use Cases
Reference Genome
User: "Get E. coli K-12 complete genome"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)
# Return NC_000913.3 (RefSeq reference)
Gene Sequence
User: "Find human BRCA1 mRNA"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)
Specific Accession
User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata
Strain Comparison
User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table
Error Handling
| Error | Response |
|---|---|
| "No search criteria provided" | Add organism, gene, or keywords |
| "ENA 404 error" | Accession is likely RefSeq → use NCBI only |
| "No results found" | Broaden search, check spelling, try synonyms |
| "Sequence too large" | Note size, provide download link instead of preview |
| "API rate limit" | Tools auto-retry; if persistent, wait briefly |
Tool Reference
NCBI Tools (All Accessions)
| Tool | Purpose |
|---|---|
NCBI_search_nucleotide | Search by gene/organism |
NCBI_fetch_accessions | Convert UIDs to accessions |
NCBI_get_sequence | Retrieve sequence data |
ENA Tools (GenBank/EMBL Only)
| Tool | Purpose |
|---|---|
ena_get_entry | Entry metadata |
ena_get_sequence_fasta | FASTA sequence |
ena_get_entry_summary | Summary info |
Search Parameters Reference
NCBI_search_nucleotide
| Parameter | Description | Example |
|---|---|---|
operation | Always "search" | "search" |
organism | Scientific name | "Homo sapiens" |
gene | Gene symbol | "BRCA1" |
strain | Specific strain | "K-12" |
keywords | Free text | "complete genome" |
seq_type | Sequence type | "complete_genome", "mrna", "refseq" |
limit | Max results | 10 |
NCBI_get_sequence
| Parameter | Description | Example |
|---|---|---|
operation | Always "fetch_sequence" | "fetch_sequence" |
accession | Accession number | "NC_000913.3" |
format | Output format | "fasta", "genbank" |