fastq-analysis-pipeline

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "fastq-analysis-pipeline" with this command: npx skills add starlitnightly/omicverse/starlitnightly-omicverse-fastq-analysis-pipeline

Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the ov.alignment module. This skill covers:

  • SRA data acquisition: prefetch and fqdump (fasterq-dump wrapper)

  • Quality control: fastp for adapter trimming and QC reports

  • RNA-seq alignment: STAR aligner with auto-index building

  • Gene quantification: featureCount (subread featureCounts wrapper)

  • Single-cell path: ref and count via kb-python (kallisto/bustools)

  • Parallel SRA download: parallel_fastq_dump

All functions share a common CLI infrastructure (_cli_utils.py ) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

Instructions

Environment setup

  • Bioinformatics tools are resolved automatically from PATH or the active conda environment.

  • If auto_install=True (default), missing tools are installed via mamba/conda on demand.

  • Supported tools: prefetch , vdb-validate , fasterq-dump , fastp , STAR , samtools , featureCounts , pigz , gzip .

  • For the single-cell path, ensure kb-python is installed: pip install kb-python .

SRA data download (ov.alignment.prefetch

  • ov.alignment.fqdump )
  • Use prefetch first for reliable downloads with integrity validation (vdb-validate ).

  • Then convert to FASTQ with fqdump . It auto-detects single-end vs paired-end.

  • fqdump can also work directly from SRR accessions without prefetch.

  • Both support retry with exponential backoff for network errors.

import omicverse as ov

Step 1: Prefetch SRA files (optional but recommended)

pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)

Step 2: Convert to FASTQ

fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4)

FASTQ quality control (ov.alignment.fastp )

  • Runs fastp for adapter trimming, quality filtering, and QC reporting.

  • Supports single-end and paired-end reads.

  • Produces per-sample JSON and HTML QC reports.

  • Sample format: tuple of (sample_name, fq1_path, fq2_path_or_None) .

samples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)

STAR alignment (ov.alignment.STAR )

  • Aligns FASTQ reads using the STAR aligner.

  • Auto-index building: set auto_index=True (default) with genome_fasta_files and gtf to build index automatically if missing.

  • Produces coordinate-sorted BAM files.

  • Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).

  • Use strict=False (default) for graceful error handling per sample.

Prepare samples from fastp output

star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', )

Gene quantification (ov.alignment.featureCount )

  • Counts aligned reads per gene using featureCounts (subread).

  • Auto-detects paired-end from BAM headers (via pysam or samtools).

  • auto_fix=True (default) retries with corrected paired-end flag on error.

  • gene_mapping=True maps gene_id to gene_name from the GTF.

  • merge_matrix=True produces a combined count matrix across all samples.

bam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, )

counts is a pandas DataFrame (gene_id x samples)

Single-cell path (ov.alignment.ref

  • ov.alignment.count )
  • Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.

  • ref() builds a kallisto index and transcript-to-gene mapping.

  • count() quantifies single-cell data with barcode/UMI handling.

  • Supports technologies: 10XV2, 10XV3, BULK, and custom.

  • Output formats: h5ad, loom, cellranger MTX.

Build reference index

ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, )

Quantify 10x v3 data

count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, )

Wiring fastp output into STAR input

  • fastp output is a list of dicts with keys: sample , clean1 , clean2 , json , html .

  • Convert to STAR sample tuples:

star_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ]

Wiring STAR output into featureCount input

  • STAR output is a list of dicts with keys: sample , bam (or error ).

  • Convert to featureCount items:

bam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ]

Skipping completed steps

  • All functions check for existing outputs and skip if overwrite=False (default).

  • Set overwrite=True to force re-execution.

Troubleshooting

  • If a tool is not found, check auto_install=True and that conda/mamba is accessible.

  • For STAR index errors, ensure genome_fasta_files points to uncompressed or gzip FASTA files.

  • For featureCounts paired-end detection errors, auto_fix=True handles most cases automatically.

  • GTF files can be gzip-compressed; they are auto-decompressed as needed.

Critical API Reference

Sample Format Convention

All alignment functions use a consistent sample tuple format:

  • FASTQ samples: (sample_name, fq1_path, fq2_path_or_None)

  • BAM items: (sample_name, bam_path) or (sample_name, bam_path, is_paired_bool)

  • Single samples can be passed as a single tuple; multiple as a list of tuples.

  • When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

Auto-installation

All functions support these parameters:

auto_install=True # Auto-install missing tools via conda/mamba overwrite=False # Skip if outputs already exist threads=8 # Per-tool thread count jobs=None # Concurrent job count (auto-detected from CPU count)

Examples

  • Bulk RNA-seq from SRA: prefetch -> fqdump -> fastp -> STAR -> featureCount -> pandas DataFrame

  • Single-cell 10x v3: ref -> count with technology='10XV3' -> h5ad AnnData

  • Local FASTQ files: Skip download steps, start directly with fastp -> STAR -> featureCount

References

  • See reference.md for copy-paste-ready code templates.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

bulk-rna-seq-deseq2-analysis-with-omicverse

No summary provided by upstream source.

Repository SourceNeeds Review
Research

data-stats-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

single-cell-downstream-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

string-protein-interaction-analysis-with-omicverse

No summary provided by upstream source.

Repository SourceNeeds Review