bio-genome-intervals-gtf-gff-handling

GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-genome-intervals-gtf-gff-handling" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-genome-intervals-gtf-gff-handling

GTF/GFF Handling

GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.

Format Comparison

Feature GTF GFF3

Coordinate system 1-based, inclusive 1-based, inclusive

Hierarchy Implicit (gene_id, transcript_id) Explicit (Parent attribute)

Attribute format key "value"; key=value;

Comments

Fasta sequences Not standard ##FASTA directive

GTF Format

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";

GFF3 Format

chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1 chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972 chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328

Parse GTF with gtfparse (Python)

Installation

pip install gtfparse

Basic Parsing

import gtfparse

Load entire GTF

df = gtfparse.read_gtf('annotation.gtf')

View columns

print(df.columns)

['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',

'gene_id', 'transcript_id', 'gene_name', ...]

Filter by feature type

genes = df[df['feature'] == 'gene'] transcripts = df[df['feature'] == 'transcript'] exons = df[df['feature'] == 'exon']

Get specific gene

gene_df = df[df['gene_name'] == 'TP53']

Extract Gene Coordinates

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

All genes

genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]

Convert to BED format (0-based)

genes_bed = genes.copy() genes_bed['start'] = genes_bed['start'] - 1 # GTF is 1-based, BED is 0-based genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']] genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)

Get Exons for Gene

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

Get all exons for TP53

tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')] tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']] print(tp53_exons)

Parse GFF with gffutils (Python)

Installation

pip install gffutils

Create Database

import gffutils

Create database (slow first time, fast for subsequent queries)

db = gffutils.create_db('annotation.gff3', 'annotation.db', force=True, merge_strategy='create_unique')

Or load existing database

db = gffutils.FeatureDB('annotation.db')

Query Features

import gffutils

db = gffutils.FeatureDB('annotation.db')

Count features by type

for featuretype in db.featuretypes(): count = db.count_features_of_type(featuretype) print(f'{featuretype}: {count}')

Get all genes

for gene in db.features_of_type('gene'): print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')

Get gene by ID

gene = db['ENSG00000141510'] # TP53 print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')

Get children (transcripts, exons)

for transcript in db.children(gene, featuretype='mRNA'): print(f' Transcript: {transcript.id}') for exon in db.children(transcript, featuretype='exon'): print(f' Exon: {exon.start}-{exon.end}')

Get Introns

import gffutils

db = gffutils.FeatureDB('annotation.db')

Get introns for a transcript

transcript = db['ENST00000269305'] introns = list(db.interfeatures(db.children(transcript, featuretype='exon'), new_featuretype='intron')) for intron in introns: print(f'Intron: {intron.start}-{intron.end}')

Convert Formats with gffread (CLI)

Installation

conda install -c bioconda gffread

GTF to GFF3

gffread annotation.gtf -o annotation.gff3

GFF3 to GTF

gffread annotation.gff3 -T -o annotation.gtf

Extract Sequences

Extract transcript sequences

gffread -w transcripts.fa -g genome.fa annotation.gtf

Extract CDS sequences

gffread -x cds.fa -g genome.fa annotation.gtf

Extract protein sequences

gffread -y proteins.fa -g genome.fa annotation.gtf

Filter Features

Keep only protein-coding genes

gffread annotation.gtf -C -o coding.gtf

Keep specific gene types

gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf

Extract Regions with bedtools

Get Promoters

Extract TSS (transcript start sites)

awk '$3 == "transcript"' annotation.gtf |
awk -v OFS='\t' '{ if ($7 == "+") print $1, $4-1, $4, ".", ".", $7; else print $1, $5-1, $5, ".", ".", $7; }' > tss.bed

Get promoter regions (2kb upstream of TSS)

bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed

Get Gene Bodies

Extract gene coordinates to BED

awk '$3 == "gene"' annotation.gtf |
awk -v OFS='\t' '{ split($0, a, "gene_id ""); split(a[2], b, """); print $1, $4-1, $5, b[1], ".", $7; }' > genes.bed

Get Exons

Extract unique exons

awk '$3 == "exon"' annotation.gtf |
awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' |
sort -k1,1 -k2,2n | uniq > exons.bed

Python: GTF to BED Conversion

import gtfparse import pandas as pd

def gtf_to_bed(gtf_path, feature_type='gene', output_path=None): '''Convert GTF features to BED format.''' df = gtfparse.read_gtf(gtf_path) features = df[df['feature'] == feature_type].copy()

# Convert to 0-based coordinates
bed = pd.DataFrame({
    'chrom': features['seqname'],
    'start': features['start'] - 1,
    'end': features['end'],
    'name': features.get('gene_name', features.get('gene_id', '.')),
    'score': 0,
    'strand': features['strand']
})

if output_path:
    bed.to_csv(output_path, sep='\t', header=False, index=False)
return bed

Usage

genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed') exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')

Validate GTF/GFF

Check GTF format

gffread -E annotation.gtf

Check GFF3 format

gffread -E annotation.gff3

Detailed validation

gt gff3validator annotation.gff3 # requires genometools

Common Attributes

GTF Attributes

Attribute Description

gene_id Ensembl gene ID

gene_name Gene symbol

gene_biotype protein_coding, lncRNA, etc.

transcript_id Ensembl transcript ID

transcript_name Transcript symbol

exon_number Exon position in transcript

exon_id Ensembl exon ID

GFF3 Attributes

Attribute Description

ID Unique feature identifier

Name Display name

Parent Parent feature ID

Dbxref Database cross-references

gene_biotype Gene type

Memory-Efficient Processing

import gtfparse

Process large files in chunks (gtfparse loads all into memory)

For very large files, use gffutils database approach

Or filter during parsing

df = gtfparse.read_gtf('annotation.gtf', features=['gene', 'exon']) # Only load specific features

Related Skills

  • bed-file-basics - BED format and conversion

  • interval-arithmetic - Gene/exon overlap analysis

  • proximity-operations - TSS proximity analysis

  • differential-expression/de-results - Gene coordinate mapping

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

bio-read-qc-fastp-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflow-management-snakemake-workflows

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflow-management-cwl-workflows

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

bio-workflows-rnaseq-to-de

No summary provided by upstream source.

Repository SourceNeeds Review