Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:
-
SRA data acquisition: prefetch and fqdump (fasterq-dump wrapper)
-
Quality control: fastp for adapter trimming and QC reports
-
RNA-seq alignment: STAR aligner with auto-index building
-
Gene quantification: featureCount (subread featureCounts wrapper)
-
Single-cell path: ref and count via kb-python (kallisto/bustools)
-
Parallel SRA download: parallel_fastq_dump
All functions share a common CLI infrastructure (_cli_utils.py ) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
Instructions
Environment setup
-
Bioinformatics tools are resolved automatically from PATH or the active conda environment.
-
If auto_install=True (default), missing tools are installed via mamba/conda on demand.
-
Supported tools: prefetch , vdb-validate , fasterq-dump , fastp , STAR , samtools , featureCounts , pigz , gzip .
-
For the single-cell path, ensure kb-python is installed: pip install kb-python .
SRA data download (ov.alignment.prefetch
- ov.alignment.fqdump )
-
Use prefetch first for reliable downloads with integrity validation (vdb-validate ).
-
Then convert to FASTQ with fqdump . It auto-detects single-end vs paired-end.
-
fqdump can also work directly from SRR accessions without prefetch.
-
Both support retry with exponential backoff for network errors.
import omicverse as ov
Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)
Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4)
FASTQ quality control (ov.alignment.fastp )
-
Runs fastp for adapter trimming, quality filtering, and QC reporting.
-
Supports single-end and paired-end reads.
-
Produces per-sample JSON and HTML QC reports.
-
Sample format: tuple of (sample_name, fq1_path, fq2_path_or_None) .
samples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)
STAR alignment (ov.alignment.STAR )
-
Aligns FASTQ reads using the STAR aligner.
-
Auto-index building: set auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.
-
Produces coordinate-sorted BAM files.
-
Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
-
Use strict=False (default) for graceful error handling per sample.
Prepare samples from fastp output
star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', )
Gene quantification (ov.alignment.featureCount )
-
Counts aligned reads per gene using featureCounts (subread).
-
Auto-detects paired-end from BAM headers (via pysam or samtools).
-
auto_fix=True (default) retries with corrected paired-end flag on error.
-
gene_mapping=True maps gene_id to gene_name from the GTF.
-
merge_matrix=True produces a combined count matrix across all samples.
bam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, )
counts is a pandas DataFrame (gene_id x samples)
Single-cell path (ov.alignment.ref
- ov.alignment.count )
-
Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
-
ref() builds a kallisto index and transcript-to-gene mapping.
-
count() quantifies single-cell data with barcode/UMI handling.
-
Supports technologies: 10XV2, 10XV3, BULK, and custom.
-
Output formats: h5ad, loom, cellranger MTX.
Build reference index
ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, )
Quantify 10x v3 data
count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, )
Wiring fastp output into STAR input
-
fastp output is a list of dicts with keys: sample , clean1 , clean2 , json , html .
-
Convert to STAR sample tuples:
star_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ]
Wiring STAR output into featureCount input
-
STAR output is a list of dicts with keys: sample , bam (or error ).
-
Convert to featureCount items:
bam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ]
Skipping completed steps
-
All functions check for existing outputs and skip if overwrite=False (default).
-
Set overwrite=True to force re-execution.
Troubleshooting
-
If a tool is not found, check auto_install=True and that conda/mamba is accessible.
-
For STAR index errors, ensure genome_fasta_files points to uncompressed or gzip FASTA files.
-
For featureCounts paired-end detection errors, auto_fix=True handles most cases automatically.
-
GTF files can be gzip-compressed; they are auto-decompressed as needed.
Critical API Reference
Sample Format Convention
All alignment functions use a consistent sample tuple format:
-
FASTQ samples: (sample_name, fq1_path, fq2_path_or_None)
-
BAM items: (sample_name, bam_path) or (sample_name, bam_path, is_paired_bool)
-
Single samples can be passed as a single tuple; multiple as a list of tuples.
-
When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
Auto-installation
All functions support these parameters:
auto_install=True # Auto-install missing tools via conda/mamba overwrite=False # Skip if outputs already exist threads=8 # Per-tool thread count jobs=None # Concurrent job count (auto-detected from CPU count)
Examples
-
Bulk RNA-seq from SRA: prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrame
-
Single-cell 10x v3: ref -> count with technology='10XV3' -> h5ad AnnData
-
Local FASTQ files: Skip download steps, start directly with fastp -> STAR -> featureCount
References
- See reference.md for copy-paste-ready code templates.