ETE Toolkit Skill
Overview
ETE (Environment for Tree Exploration) is a toolkit for phylogenetic and hierarchical tree analysis. Manipulate trees, analyze evolutionary events, visualize results, and integrate with biological databases for phylogenomic research and clustering analysis.
Core Capabilities
- Tree Manipulation and Analysis
Load, manipulate, and analyze hierarchical tree structures with support for:
-
Tree I/O: Read and write Newick, NHX, PhyloXML, and NeXML formats
-
Tree traversal: Navigate trees using preorder, postorder, or levelorder strategies
-
Topology modification: Prune, root, collapse nodes, resolve polytomies
-
Distance calculations: Compute branch lengths and topological distances between nodes
-
Tree comparison: Calculate Robinson-Foulds distances and identify topological differences
Common patterns:
from ete3 import Tree
Load tree from file
tree = Tree("tree.nw", format=1)
Basic statistics
print(f"Leaves: {len(tree)}") print(f"Total nodes: {len(list(tree.traverse()))}")
Prune to taxa of interest
taxa_to_keep = ["species1", "species2", "species3"] tree.prune(taxa_to_keep, preserve_branch_length=True)
Midpoint root
midpoint = tree.get_midpoint_outgroup() tree.set_outgroup(midpoint)
Save modified tree
tree.write(outfile="rooted_tree.nw")
Use scripts/tree_operations.py for command-line tree manipulation:
Display tree statistics
python scripts/tree_operations.py stats tree.nw
Convert format
python scripts/tree_operations.py convert tree.nw output.nw --in-format 0 --out-format 1
Reroot tree
python scripts/tree_operations.py reroot tree.nw rooted.nw --midpoint
Prune to specific taxa
python scripts/tree_operations.py prune tree.nw pruned.nw --keep-taxa "sp1,sp2,sp3"
Show ASCII visualization
python scripts/tree_operations.py ascii tree.nw
- Phylogenetic Analysis
Analyze gene trees with evolutionary event detection:
-
Sequence alignment integration: Link trees to multiple sequence alignments (FASTA, Phylip)
-
Species naming: Automatic or custom species extraction from gene names
-
Evolutionary events: Detect duplication and speciation events using Species Overlap or tree reconciliation
-
Orthology detection: Identify orthologs and paralogs based on evolutionary events
-
Gene family analysis: Split trees by duplications, collapse lineage-specific expansions
Workflow for gene tree analysis:
from ete3 import PhyloTree
Load gene tree with alignment
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
Set species naming function
def get_species(gene_name): return gene_name.split("_")[0]
tree.set_species_naming_function(get_species)
Detect evolutionary events
events = tree.get_descendant_evol_events()
Analyze events
for node in tree.traverse(): if hasattr(node, "evoltype"): if node.evoltype == "D": print(f"Duplication at {node.name}") elif node.evoltype == "S": print(f"Speciation at {node.name}")
Extract ortholog groups
ortho_groups = tree.get_speciation_trees() for i, ortho_tree in enumerate(ortho_groups): ortho_tree.write(outfile=f"ortholog_group_{i}.nw")
Finding orthologs and paralogs:
Find orthologs to query gene
query = tree & "species1_gene1"
orthologs = [] paralogs = []
for event in events: if query in event.in_seqs: if event.etype == "S": orthologs.extend([s for s in event.out_seqs if s != query]) elif event.etype == "D": paralogs.extend([s for s in event.out_seqs if s != query])
- NCBI Taxonomy Integration
Integrate taxonomic information from NCBI Taxonomy database:
-
Database access: Automatic download and local caching of NCBI taxonomy (~300MB)
-
Taxid/name translation: Convert between taxonomic IDs and scientific names
-
Lineage retrieval: Get complete evolutionary lineages
-
Taxonomy trees: Build species trees connecting specified taxa
-
Tree annotation: Automatically annotate trees with taxonomic information
Building taxonomy-based trees:
from ete3 import NCBITaxa
ncbi = NCBITaxa()
Build tree from species names
species = ["Homo sapiens", "Pan troglodytes", "Mus musculus"] name2taxid = ncbi.get_name_translator(species) taxids = [name2taxid[sp][0] for sp in species]
Get minimal tree connecting taxa
tree = ncbi.get_topology(taxids)
Annotate nodes with taxonomy info
for node in tree.traverse(): if hasattr(node, "sci_name"): print(f"{node.sci_name} - Rank: {node.rank} - TaxID: {node.taxid}")
Annotating existing trees:
Get taxonomy info for tree leaves
for leaf in tree: species = extract_species_from_name(leaf.name) taxid = ncbi.get_name_translator([species])[species][0]
# Get lineage
lineage = ncbi.get_lineage(taxid)
ranks = ncbi.get_rank(lineage)
names = ncbi.get_taxid_translator(lineage)
# Add to node
leaf.add_feature("taxid", taxid)
leaf.add_feature("lineage", [names[t] for t in lineage])
4. Tree Visualization
Create publication-quality tree visualizations:
-
Output formats: PNG (raster), PDF, and SVG (vector) for publications
-
Layout modes: Rectangular and circular tree layouts
-
Interactive GUI: Explore trees interactively with zoom, pan, and search
-
Custom styling: NodeStyle for node appearance (colors, shapes, sizes)
-
Faces: Add graphical elements (text, images, charts, heatmaps) to nodes
-
Layout functions: Dynamic styling based on node properties
Basic visualization workflow:
from ete3 import Tree, TreeStyle, NodeStyle
tree = Tree("tree.nw")
Configure tree style
ts = TreeStyle() ts.show_leaf_name = True ts.show_branch_support = True ts.scale = 50 # pixels per branch length unit
Style nodes
for node in tree.traverse(): nstyle = NodeStyle()
if node.is_leaf():
nstyle["fgcolor"] = "blue"
nstyle["size"] = 8
else:
# Color by support
if node.support > 0.9:
nstyle["fgcolor"] = "darkgreen"
else:
nstyle["fgcolor"] = "red"
nstyle["size"] = 5
node.set_style(nstyle)
Render to file
tree.render("tree.pdf", tree_style=ts) tree.render("tree.png", w=800, h=600, units="px", dpi=300)
Use scripts/quick_visualize.py for rapid visualization:
Basic visualization
python scripts/quick_visualize.py tree.nw output.pdf
Circular layout with custom styling
python scripts/quick_visualize.py tree.nw output.pdf --mode c --color-by-support
High-resolution PNG
python scripts/quick_visualize.py tree.nw output.png --width 1200 --height 800 --units px --dpi 300
Custom title and styling
python scripts/quick_visualize.py tree.nw output.pdf --title "Species Phylogeny" --show-support
Advanced visualization with faces:
from ete3 import Tree, TreeStyle, TextFace, CircleFace
tree = Tree("tree.nw")
Add features to nodes
for leaf in tree: leaf.add_feature("habitat", "marine" if "fish" in leaf.name else "land")
Layout function
def layout(node): if node.is_leaf(): # Add colored circle color = "blue" if node.habitat == "marine" else "green" circle = CircleFace(radius=5, color=color) node.add_face(circle, column=0, position="aligned")
# Add label
label = TextFace(node.name, fsize=10)
node.add_face(label, column=1, position="aligned")
ts = TreeStyle() ts.layout_fn = layout ts.show_leaf_name = False
tree.render("annotated_tree.pdf", tree_style=ts)
- Clustering Analysis
Analyze hierarchical clustering results with data integration:
-
ClusterTree: Specialized class for clustering dendrograms
-
Data matrix linking: Connect tree leaves to numerical profiles
-
Cluster metrics: Silhouette coefficient, Dunn index, inter/intra-cluster distances
-
Validation: Test cluster quality with different distance metrics
-
Heatmap visualization: Display data matrices alongside trees
Clustering workflow:
from ete3 import ClusterTree
Load tree with data matrix
matrix = """#Names\tSample1\tSample2\tSample3 Gene1\t1.5\t2.3\t0.8 Gene2\t0.9\t1.1\t1.8 Gene3\t2.1\t2.5\t0.5"""
tree = ClusterTree("((Gene1,Gene2),Gene3);", text_array=matrix)
Evaluate cluster quality
for node in tree.traverse(): if not node.is_leaf(): silhouette = node.get_silhouette() dunn = node.get_dunn()
print(f"Cluster: {node.name}")
print(f" Silhouette: {silhouette:.3f}")
print(f" Dunn index: {dunn:.3f}")
Visualize with heatmap
tree.show("heatmap")
- Tree Comparison
Quantify topological differences between trees:
-
Robinson-Foulds distance: Standard metric for tree comparison
-
Normalized RF: Scale-invariant distance (0.0 to 1.0)
-
Partition analysis: Identify unique and shared bipartitions
-
Consensus trees: Analyze support across multiple trees
-
Batch comparison: Compare multiple trees pairwise
Compare two trees:
from ete3 import Tree
tree1 = Tree("tree1.nw") tree2 = Tree("tree2.nw")
Calculate RF distance
rf, max_rf, common_leaves, parts_t1, parts_t2 = tree1.robinson_foulds(tree2)
print(f"RF distance: {rf}/{max_rf}") print(f"Normalized RF: {rf/max_rf:.3f}") print(f"Common leaves: {len(common_leaves)}")
Find unique partitions
unique_t1 = parts_t1 - parts_t2 unique_t2 = parts_t2 - parts_t1
print(f"Unique to tree1: {len(unique_t1)}") print(f"Unique to tree2: {len(unique_t2)}")
Compare multiple trees:
import numpy as np
trees = [Tree(f"tree{i}.nw") for i in range(4)]
Create distance matrix
n = len(trees) dist_matrix = np.zeros((n, n))
for i in range(n): for j in range(i+1, n): rf, max_rf, _, _, _ = trees[i].robinson_foulds(trees[j]) norm_rf = rf / max_rf if max_rf > 0 else 0 dist_matrix[i, j] = norm_rf dist_matrix[j, i] = norm_rf
Installation and Setup
Install ETE toolkit:
Basic installation
pip install ete3
With external dependencies for rendering (optional but recommended)
On macOS:
brew install qt@5
On Ubuntu/Debian:
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
For full features including GUI
pip install ete3[gui]
First-time NCBI Taxonomy setup:
The first time NCBITaxa is instantiated, it automatically downloads the NCBI taxonomy database (~300MB) to ~/.etetoolkit/taxa.sqlite . This happens only once:
from ete3 import NCBITaxa ncbi = NCBITaxa() # Downloads database on first run
Update taxonomy database:
ncbi.update_taxonomy_database() # Download latest NCBI data
Common Use Cases
Use Case 1: Phylogenomic Pipeline
Complete workflow from gene tree to ortholog identification:
from ete3 import PhyloTree, NCBITaxa
1. Load gene tree with alignment
tree = PhyloTree("gene_tree.nw", alignment="alignment.fasta")
2. Configure species naming
tree.set_species_naming_function(lambda x: x.split("_")[0])
3. Detect evolutionary events
tree.get_descendant_evol_events()
4. Annotate with taxonomy
ncbi = NCBITaxa() for leaf in tree: if leaf.species in species_to_taxid: taxid = species_to_taxid[leaf.species] lineage = ncbi.get_lineage(taxid) leaf.add_feature("lineage", lineage)
5. Extract ortholog groups
ortho_groups = tree.get_speciation_trees()
6. Save and visualize
for i, ortho in enumerate(ortho_groups): ortho.write(outfile=f"ortho_{i}.nw")
Use Case 2: Tree Preprocessing and Formatting
Batch process trees for analysis:
Convert format
python scripts/tree_operations.py convert input.nw output.nw --in-format 0 --out-format 1
Root at midpoint
python scripts/tree_operations.py reroot input.nw rooted.nw --midpoint
Prune to focal taxa
python scripts/tree_operations.py prune rooted.nw pruned.nw --keep-taxa taxa_list.txt
Get statistics
python scripts/tree_operations.py stats pruned.nw
Use Case 3: Publication-Quality Figures
Create styled visualizations:
from ete3 import Tree, TreeStyle, NodeStyle, TextFace
tree = Tree("tree.nw")
Define clade colors
clade_colors = { "Mammals": "red", "Birds": "blue", "Fish": "green" }
def layout(node): # Highlight clades if node.is_leaf(): for clade, color in clade_colors.items(): if clade in node.name: nstyle = NodeStyle() nstyle["fgcolor"] = color nstyle["size"] = 8 node.set_style(nstyle) else: # Add support values if node.support > 0.95: support = TextFace(f"{node.support:.2f}", fsize=8) node.add_face(support, column=0, position="branch-top")
ts = TreeStyle() ts.layout_fn = layout ts.show_scale = True
Render for publication
tree.render("figure.pdf", w=200, units="mm", tree_style=ts) tree.render("figure.svg", tree_style=ts) # Editable vector
Use Case 4: Automated Tree Analysis
Process multiple trees systematically:
from ete3 import Tree import os
input_dir = "trees" output_dir = "processed"
for filename in os.listdir(input_dir): if filename.endswith(".nw"): tree = Tree(os.path.join(input_dir, filename))
# Standardize: midpoint root, resolve polytomies
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)
tree.resolve_polytomy(recursive=True)
# Filter low support branches
for node in tree.traverse():
if hasattr(node, 'support') and node.support < 0.5:
if not node.is_leaf() and not node.is_root():
node.delete()
# Save processed tree
output_file = os.path.join(output_dir, f"processed_{filename}")
tree.write(outfile=output_file)
Reference Documentation
For comprehensive API documentation, code examples, and detailed guides, refer to the following resources in the references/ directory:
-
api_reference.md : Complete API documentation for all ETE classes and methods (Tree, PhyloTree, ClusterTree, NCBITaxa), including parameters, return types, and code examples
-
workflows.md : Common workflow patterns organized by task (tree operations, phylogenetic analysis, tree comparison, taxonomy integration, clustering analysis)
-
visualization.md : Comprehensive visualization guide covering TreeStyle, NodeStyle, Faces, layout functions, and advanced visualization techniques
Load these references when detailed information is needed:
To use API reference
Read references/api_reference.md for complete method signatures and parameters
To implement workflows
Read references/workflows.md for step-by-step workflow examples
To create visualizations
Read references/visualization.md for styling and rendering options
Troubleshooting
Import errors:
If "ModuleNotFoundError: No module named 'ete3'"
pip install ete3
For GUI and rendering issues
pip install ete3[gui]
Rendering issues:
If tree.render() or tree.show() fails with Qt-related errors, install system dependencies:
macOS
brew install qt@5
Ubuntu/Debian
sudo apt-get install python3-pyqt5 python3-pyqt5.qtsvg
NCBI Taxonomy database:
If database download fails or becomes corrupted:
from ete3 import NCBITaxa ncbi = NCBITaxa() ncbi.update_taxonomy_database() # Redownload database
Memory issues with large trees:
For very large trees (>10,000 leaves), use iterators instead of list comprehensions:
Memory-efficient iteration
for leaf in tree.iter_leaves(): process(leaf)
Instead of
for leaf in tree.get_leaves(): # Loads all into memory process(leaf)
Newick Format Reference
ETE supports multiple Newick format specifications (0-100):
-
Format 0: Flexible with branch lengths (default)
-
Format 1: With internal node names
-
Format 2: With bootstrap/support values
-
Format 5: Internal node names + branch lengths
-
Format 8: All features (names, distances, support)
-
Format 9: Leaf names only
-
Format 100: Topology only
Specify format when reading/writing:
tree = Tree("tree.nw", format=1) tree.write(outfile="output.nw", format=5)
NHX (New Hampshire eXtended) format preserves custom features:
tree.write(outfile="tree.nhx", features=["habitat", "temperature", "depth"])
Best Practices
-
Preserve branch lengths: Use preserve_branch_length=True when pruning for phylogenetic analysis
-
Cache content: Use get_cached_content() for repeated access to node contents on large trees
-
Use iterators: Employ iter_* methods for memory-efficient processing of large trees
-
Choose appropriate traversal: Postorder for bottom-up analysis, preorder for top-down
-
Validate monophyly: Always check returned clade type (monophyletic/paraphyletic/polyphyletic)
-
Vector formats for publication: Use PDF or SVG for publication figures (scalable, editable)
-
Interactive testing: Use tree.show() to test visualizations before rendering to file
-
PhyloTree for phylogenetics: Use PhyloTree class for gene trees and evolutionary analysis
-
Copy method selection: "newick" for speed, "cpickle" for full fidelity, "deepcopy" for complex objects
-
NCBI query caching: Store NCBI taxonomy query results to avoid repeated database access