GTF/GFF Handling
GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.
Format Comparison
Feature GTF GFF3
Coordinate system 1-based, inclusive 1-based, inclusive
Hierarchy Implicit (gene_id, transcript_id) Explicit (Parent attribute)
Attribute format key "value"; key=value;
Comments
Fasta sequences Not standard ##FASTA directive
GTF Format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
GFF3 Format
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1 chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972 chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328
Parse GTF with gtfparse (Python)
Installation
pip install gtfparse
Basic Parsing
import gtfparse
Load entire GTF
df = gtfparse.read_gtf('annotation.gtf')
View columns
print(df.columns)
['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',
'gene_id', 'transcript_id', 'gene_name', ...]
Filter by feature type
genes = df[df['feature'] == 'gene'] transcripts = df[df['feature'] == 'transcript'] exons = df[df['feature'] == 'exon']
Get specific gene
gene_df = df[df['gene_name'] == 'TP53']
Extract Gene Coordinates
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
All genes
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]
Convert to BED format (0-based)
genes_bed = genes.copy() genes_bed['start'] = genes_bed['start'] - 1 # GTF is 1-based, BED is 0-based genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']] genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)
Get Exons for Gene
import gtfparse
df = gtfparse.read_gtf('annotation.gtf')
Get all exons for TP53
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')] tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']] print(tp53_exons)
Parse GFF with gffutils (Python)
Installation
pip install gffutils
Create Database
import gffutils
Create database (slow first time, fast for subsequent queries)
db = gffutils.create_db('annotation.gff3', 'annotation.db', force=True, merge_strategy='create_unique')
Or load existing database
db = gffutils.FeatureDB('annotation.db')
Query Features
import gffutils
db = gffutils.FeatureDB('annotation.db')
Count features by type
for featuretype in db.featuretypes(): count = db.count_features_of_type(featuretype) print(f'{featuretype}: {count}')
Get all genes
for gene in db.features_of_type('gene'): print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')
Get gene by ID
gene = db['ENSG00000141510'] # TP53 print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')
Get children (transcripts, exons)
for transcript in db.children(gene, featuretype='mRNA'): print(f' Transcript: {transcript.id}') for exon in db.children(transcript, featuretype='exon'): print(f' Exon: {exon.start}-{exon.end}')
Get Introns
import gffutils
db = gffutils.FeatureDB('annotation.db')
Get introns for a transcript
transcript = db['ENST00000269305'] introns = list(db.interfeatures(db.children(transcript, featuretype='exon'), new_featuretype='intron')) for intron in introns: print(f'Intron: {intron.start}-{intron.end}')
Convert Formats with gffread (CLI)
Installation
conda install -c bioconda gffread
GTF to GFF3
gffread annotation.gtf -o annotation.gff3
GFF3 to GTF
gffread annotation.gff3 -T -o annotation.gtf
Extract Sequences
Extract transcript sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf
Extract CDS sequences
gffread -x cds.fa -g genome.fa annotation.gtf
Extract protein sequences
gffread -y proteins.fa -g genome.fa annotation.gtf
Filter Features
Keep only protein-coding genes
gffread annotation.gtf -C -o coding.gtf
Keep specific gene types
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf
Extract Regions with bedtools
Get Promoters
Extract TSS (transcript start sites)
awk '$3 == "transcript"' annotation.gtf |
awk -v OFS='\t' '{
if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
else print $1, $5-1, $5, ".", ".", $7;
}' > tss.bed
Get promoter regions (2kb upstream of TSS)
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed
Get Gene Bodies
Extract gene coordinates to BED
awk '$3 == "gene"' annotation.gtf |
awk -v OFS='\t' '{
split($0, a, "gene_id ""); split(a[2], b, """);
print $1, $4-1, $5, b[1], ".", $7;
}' > genes.bed
Get Exons
Extract unique exons
awk '$3 == "exon"' annotation.gtf |
awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' |
sort -k1,1 -k2,2n | uniq > exons.bed
Python: GTF to BED Conversion
import gtfparse import pandas as pd
def gtf_to_bed(gtf_path, feature_type='gene', output_path=None): '''Convert GTF features to BED format.''' df = gtfparse.read_gtf(gtf_path) features = df[df['feature'] == feature_type].copy()
# Convert to 0-based coordinates
bed = pd.DataFrame({
'chrom': features['seqname'],
'start': features['start'] - 1,
'end': features['end'],
'name': features.get('gene_name', features.get('gene_id', '.')),
'score': 0,
'strand': features['strand']
})
if output_path:
bed.to_csv(output_path, sep='\t', header=False, index=False)
return bed
Usage
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed') exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')
Validate GTF/GFF
Check GTF format
gffread -E annotation.gtf
Check GFF3 format
gffread -E annotation.gff3
Detailed validation
gt gff3validator annotation.gff3 # requires genometools
Common Attributes
GTF Attributes
Attribute Description
gene_id Ensembl gene ID
gene_name Gene symbol
gene_biotype protein_coding, lncRNA, etc.
transcript_id Ensembl transcript ID
transcript_name Transcript symbol
exon_number Exon position in transcript
exon_id Ensembl exon ID
GFF3 Attributes
Attribute Description
ID Unique feature identifier
Name Display name
Parent Parent feature ID
Dbxref Database cross-references
gene_biotype Gene type
Memory-Efficient Processing
import gtfparse
Process large files in chunks (gtfparse loads all into memory)
For very large files, use gffutils database approach
Or filter during parsing
df = gtfparse.read_gtf('annotation.gtf', features=['gene', 'exon']) # Only load specific features
Related Skills
-
bed-file-basics - BED format and conversion
-
interval-arithmetic - Gene/exon overlap analysis
-
proximity-operations - TSS proximity analysis
-
differential-expression/de-results - Gene coordinate mapping