bio-population-genetics-scikit-allel-analysis

scikit-allel Analysis

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bio-population-genetics-scikit-allel-analysis" with this command: npx skills add gptomics/bioskills/gptomics-bioskills-bio-population-genetics-scikit-allel-analysis

scikit-allel Analysis

Python library for population genetics analysis with efficient array data structures.

Installation

pip install scikit-allel

Optional: zarr for chunked storage

pip install zarr

Reading VCF Files

Load VCF

import allel

callset = allel.read_vcf('data.vcf.gz')

print(callset.keys())

dict_keys(['samples', 'calldata/GT', 'variants/CHROM', 'variants/POS', 'variants/REF', 'variants/ALT', ...])

samples = callset['samples'] genotypes = callset['calldata/GT'] positions = callset['variants/POS'] chroms = callset['variants/CHROM']

Specify Fields

callset = allel.read_vcf('data.vcf.gz', fields=['samples', 'calldata/GT', 'variants/POS', 'variants/CHROM', 'variants/QUAL'])

callset = allel.read_vcf('data.vcf.gz', fields='*') # All fields

callset = allel.read_vcf('data.vcf.gz', region='chr1:1000000-2000000', samples=['sample1', 'sample2'])

Large Files (Chunked)

import zarr

allel.vcf_to_zarr('large.vcf.gz', 'data.zarr', fields='*', overwrite=True) callset = zarr.open('data.zarr', mode='r') gt = allel.GenotypeArray(callset['calldata/GT'])

Genotype Arrays

GenotypeArray

gt = allel.GenotypeArray(callset['calldata/GT'])

print(gt.shape) # (n_variants, n_samples, ploidy) print(gt.n_variants) print(gt.n_samples)

print(gt[0]) # Genotypes at first variant print(gt[:, 0]) # All variants for first sample

Basic Operations

ac = gt.count_alleles() print(ac.shape) # (n_variants, n_alleles)

af = ac.to_frequencies() is_segregating = ac.is_segregating() gt_filtered = gt.compress(is_segregating, axis=0)

Missing Data

is_called = gt.is_called() is_missing = gt.is_missing()

miss_per_variant = (~is_called).sum(axis=1) miss_per_sample = (~is_called).sum(axis=0)

call_rate_variant = is_called.mean(axis=1) call_rate_sample = is_called.mean(axis=0)

Allele Counts and Frequencies

ac = gt.count_alleles() ac_ref = ac[:, 0] ac_alt = ac[:, 1]

af = ac.to_frequencies() maf = af.min(axis=1)

n_singletons = (ac[:, 1] == 1).sum() n_doubletons = (ac[:, 1] == 2).sum()

By Population

subpops = { 'pop1': [0, 1, 2, 3, 4], 'pop2': [5, 6, 7, 8, 9] }

ac_subpops = gt.count_alleles_subpops(subpops)

ac_pop1 = ac_subpops['pop1'] ac_pop2 = ac_subpops['pop2']

Haplotype Arrays

h = gt.to_haplotypes() print(h.shape) # (n_variants, n_haplotypes) print(h.n_haplotypes)

ac_hap = h.count_alleles()

PCA

import allel import numpy as np

gn = gt.to_n_alt(fill=-1) gn_filtered = gn[is_segregating] gn_imputed = np.where(gn_filtered < 0, 0, gn_filtered)

coords, model = allel.pca(gn_imputed, n_components=10, scaler='patterson') print(coords.shape) # (n_samples, n_components)

Plot PCA

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6)) plt.scatter(coords[:, 0], coords[:, 1], c=population_labels) plt.xlabel('PC1') plt.ylabel('PC2') plt.savefig('pca.png')

Diversity Statistics

Heterozygosity

ho = allel.heterozygosity_observed(gt) he = allel.heterozygosity_expected(ac, ploidy=2)

mean_ho = np.mean(ho) mean_he = np.mean(he)

Nucleotide Diversity (Pi)

pi = allel.sequence_diversity(positions, ac) print(f'Pi = {pi:.6f}')

windows = allel.moving_statistic(positions, statistic=lambda x: allel.sequence_diversity(x, ac), size=10000, step=5000)

Watterson's Theta

theta_w = allel.watterson_theta(positions, ac) print(f'Theta_W = {theta_w:.6f}')

Site Frequency Spectrum

sfs = allel.sfs(ac[:, 1])

plt.figure(figsize=(10, 5)) allel.plot_sfs(sfs) plt.savefig('sfs.png')

Folded SFS

sfs_folded = allel.sfs_folded(ac)

plt.figure(figsize=(10, 5)) allel.plot_sfs_folded(sfs_folded) plt.savefig('sfs_folded.png')

Windowed Statistics

pos = np.array(positions) windows = np.arange(0, pos.max(), 100000)

pi_windowed, windows_used, n_bases, counts = allel.windowed_diversity(pos, ac, size=100000, step=50000)

plt.figure(figsize=(14, 4)) plt.plot(windows_used[:, 0], pi_windowed) plt.xlabel('Position') plt.ylabel('Pi') plt.savefig('pi_windows.png')

Sample Subsetting

pop1_idx = np.array([0, 1, 2, 3, 4]) pop2_idx = np.array([5, 6, 7, 8, 9])

gt_pop1 = gt.take(pop1_idx, axis=1) gt_pop2 = gt.take(pop2_idx, axis=1)

ac_pop1 = gt_pop1.count_alleles() ac_pop2 = gt_pop2.count_alleles()

Filter Variants

is_snp = callset['variants/is_snp'] is_biallelic = ac.max_allele() == 1 is_segregating = ac.is_segregating() qual = callset['variants/QUAL'] is_high_qual = qual > 30

flt = is_snp & is_biallelic & is_segregating & is_high_qual

gt_filtered = gt.compress(flt, axis=0) pos_filtered = positions[flt]

Complete Workflow Example

import allel import numpy as np

callset = allel.read_vcf('data.vcf.gz', fields=['samples', 'calldata/GT', 'variants/POS']) gt = allel.GenotypeArray(callset['calldata/GT']) pos = callset['variants/POS'] samples = callset['samples']

ac = gt.count_alleles() flt = ac.is_segregating() & (ac.max_allele() == 1) gt = gt.compress(flt, axis=0) pos = pos[flt] ac = gt.count_alleles()

print(f'Variants after filtering: {gt.n_variants}') print(f'Samples: {gt.n_samples}') print(f'Nucleotide diversity: {allel.sequence_diversity(pos, ac):.6f}') print(f'Mean Het observed: {allel.heterozygosity_observed(gt).mean():.4f}')

gn = gt.to_n_alt(fill=-1) gn = np.where(gn < 0, 0, gn) coords, model = allel.pca(gn, n_components=10, scaler='patterson')

Related Skills

  • selection-statistics - Fst, Tajima's D, iHS with scikit-allel

  • linkage-disequilibrium - LD calculations in Python

  • variant-calling/vcf-basics - VCF format and bcftools

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

bio-microbiome-diversity-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-proteomics-dia-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

bio-restriction-fragment-analysis

No summary provided by upstream source.

Repository SourceNeeds Review