popout

GPU-accelerated local ancestry inference at biobank scale — no reference panel required.

Feed it phased WGS from a large cohort and ancestry structure falls out of the joint distribution.

How it works

With 500K+ samples, the data is the reference panel. See docs/THEORY.md for the full mathematical treatment. The pipeline:

SEED — Randomized SVD on a SNP subset projects all haplotypes into PCA space. GMM assigns soft ancestry labels. Number of ancestries auto-detected via recursive hierarchical splitting (BIC-based binary splits on sub-PCA projections) or eigenvalue gap heuristic.
INIT — Allele frequencies per ancestry computed from soft GMM assignments via weighted GEMM. Window-based refinement handles admixed haplotypes.
REFINE — Two refinement backends, selectable via --method:
- HMM (default) — Forward-backward HMM with A states (not K reference haplotypes — just A ancestries). State space is tiny: 8 floats per haplotype. All haplotypes run simultaneously on GPU via gradient-checkpointed scan (O(√T) memory). EM iteration: M-step updates allele frequencies (with optional rare-variant smoothing), ancestry proportions μ, and generations since admixture T (global or per-haplotype). Converges in 2-3 iterations with large samples.
- CNN / CNN-CRF — 1D dilated convolutional network that processes all sites in parallel and learns multi-site LD patterns via self-supervised pseudo-label refinement. Optional CRF output layer adds learned transition modeling. See docs/CNN.md.
DECODE — Final posteriors from either backend. Argmax gives hard ancestry calls.

Key design choices

A-state HMM, not K-state. FLARE showed that composite reference haplotypes reduce effective K. We go further: with self-derived allele frequencies from 500K samples, we don't need individual reference haplotypes at all. The emission model is P(allele | ancestry, site) = allele frequency. Optionally, --block-emissions uses k-SNP haplotype pattern matching for LD-aware inference.
Sequential over sites, parallel over haplotypes. The forward algorithm must be sequential across T sites. But all H haplotypes are independent. An A100 can hold 1M+ haplotypes' forward state (32 MB) simultaneously. Gradient checkpointing reduces stored forward states from T to √T.
Memory-bandwidth bound. The A×A transition matrix product is trivially cheap. The bottleneck is reading/writing the forward state vector at each step. This means the GPU is at <1% ALU utilization — block emissions exploit this headroom for richer per-step computation with fewer total steps.
Pluggable refinement backend. The M-step operates on (H, T, A) posterior tensors regardless of how they were produced. The CNN backend is a drop-in alternative to the HMM: same spectral seed, same M-step, same decode and output — only the inference engine changes. This lets you choose the right tool: HMM for interpretability and small cohorts, CNN for LD-aware inference and incremental sample addition.

Usage

# From PGEN (recommended for biobank-scale data)
popout --pgen /path/to/per_chrom_pgens/ \
       --map plink.GRCh38.map \
       --out results/cohort \
       --thin-cm 0.02

# From VCF
popout --vcf cohort.phased.vcf.gz \
       --map plink.GRCh38.map \
       --out results/cohort \
       --n-ancestries 6

# With per-haplotype admixture time and block emissions
popout --pgen /path/to/pgens/ \
       --map plink.GRCh38.map \
       --out results/cohort \
       --thin-cm 0.02 \
       --per-hap-T \
       --block-emissions

# CNN backend (LD-aware, fully parallel over sites)
popout --pgen /path/to/pgens/ \
       --map plink.GRCh38.map \
       --out results/cohort \
       --thin-cm 0.02 \
       --method cnn

# CNN-CRF backend (adds learned transition smoothing)
popout --pgen /path/to/pgens/ \
       --map plink.GRCh38.map \
       --out results/cohort \
       --thin-cm 0.02 \
       --method cnn-crf

# Export a reference panel for downstream tools (FLARE, RFMix)
popout --pgen /path/to/pgens/ \
       --map plink.GRCh38.map \
       --out results/cohort \
       --export-panel \
       --panel-threshold 0.99

# Generate QC report from a completed run
popout report --stats results/cohort.summary.json --out report/

Preparing PGEN input from VCF

for chr in {1..22}; do
  plink2 --vcf chr${chr}.phased.vcf.gz --make-pgen phased-list --out pgen_dir/chr${chr}
done

The phased-list flag preserves phase information. Point --pgen at the directory.

Installation

# GPU (recommended)
pip install -e ".[dev]"

# CPU (slow but works)
pip install -e ".[dev]" --no-deps
pip install jax numpy pysam Pgenlib matplotlib pytest

Docker

docker build -t popout .
docker run --gpus all -v /data:/data popout \
    --pgen /data/pgens/ --map /data/map.txt --out /data/results

Output files

File	Description
`{prefix}.global.tsv`	Per-sample global ancestry proportions
`{prefix}.tracts.tsv.gz`	Local ancestry tracts (BED-like intervals per haplotype)
`{prefix}.model`	Fitted model parameters (n_ancestries, mu, T)
`{prefix}.stats.jsonl`	Timestamped runtime metrics (for live monitoring)
`{prefix}.summary.json`	Aggregated QC stats (consumed by `popout report`)
`{prefix}.panel.haplotypes.tsv`	Single-ancestry whole haplotypes (with `--export-panel`)
`{prefix}.panel.segments.tsv.gz`	High-confidence single-ancestry segments (with `--export-panel`)
`{prefix}.panel.frequencies.tsv.gz`	Per-ancestry allele frequencies (with `--export-panel`)
`{prefix}.panel.proportions.tsv`	Per-haplotype ancestry proportions (with `--export-panel`)

See docs/PANEL.md for output formats, thresholds, and a worked example of the two-stage popout → FLARE pipeline.

Tract format

#chrom  start_bp   end_bp     sample    haplotype  ancestry  n_sites
chr1    100000     5200000    SAMPLE_0  0          0         45
chr1    5250000    12800000   SAMPLE_0  0          2         62

Monitoring

# Live monitoring with Weights & Biases
popout --pgen data/ --map map.txt --out results --monitor wandb

# Live monitoring with TensorBoard
popout --pgen data/ --map map.txt --out results --monitor tensorboard

# Disable stats files entirely
popout --pgen data/ --map map.txt --out results --no-stats

Architecture

pgen_io.py     PGEN reader (biobank-scale, chunked, with site thinning)
vcf_io.py      VCF/BCF reader (pysam, for smaller datasets)
gmap.py        Genetic map loading and chromosome normalization
spectral.py    Randomized SVD + GMM + hierarchical ancestry detection
hmm.py         Forward-backward HMM in JAX with gradient checkpointing
em.py          EM loop: seed → init → iterate → decode, with freq smoothing
blocks.py      Block-level haplotype pattern encoding for LD-aware emissions
cnn/           CNN-CRF refinement backend (--method cnn or cnn-crf)
  model.py       Dilated residual 1D CNN architecture (pure JAX)
  crf.py         Linear-chain CRF output layer
  train.py       Adam optimizer, KL loss, self-training loop
  features.py    Input feature construction (allele + freq + distance channels)
  refine.py      Pipeline entry points (run_cnn, run_cnn_genome)
output.py      Tract TSV, global ancestry TSV, model file writers, panel export
stats.py       Runtime metrics collection (JSONL + W&B/TensorBoard)
report.py      QC report generation (matplotlib plots from summary JSON)
cli.py         Command-line interface
datatypes.py   Core data structures (ChromData, AncestryModel, etc.)
simulate.py    Simulated admixed data + accuracy evaluation
demo.py        Standalone demo on simulated data

Performance estimates

Samples	Device	Wall clock (22 chroms)
1K	CPU	~1 minute
10K	CPU	~10 minutes
10K	A100	~30 seconds
100K	A100	~5 minutes
500K	A100	~15 minutes
500K	8×A100	~3 minutes

Options

Flag	Default	Description
`--method`	`hmm`	Refinement backend: `hmm`, `cnn`, or `cnn-crf`
`--ancestry-detection`	`recursive`	Auto-detection method: `recursive` (hierarchical BIC splitting) or `eigenvalue-gap`
`--per-hap-T`	off	Estimate per-haplotype admixture time instead of a single global T
`--n-T-buckets`	20	Number of transition-matrix buckets for per-haplotype T
`--block-emissions`	off	Use k-SNP haplotype pattern matching instead of single-site Bernoulli
`--block-size`	8	SNPs per block when using block emissions
`--smooth-bandwidth-cm`	0.05	Gaussian kernel bandwidth (cM) for rare-variant frequency smoothing (0 = disabled)
`--smooth-maf-threshold`	0.05	MAF threshold below which frequencies are smoothed
`--thin-cm`	none	Minimum cM spacing for site thinning (recommended: 0.02 for WGS)
`--export-panel`	off	Export reference panel files for downstream tools
`--panel-threshold`	0.95	Min posterior for whole-haplotype extraction
`--panel-segment-threshold`	0.99	Min per-site posterior for segment extraction
`--panel-min-segment-cm`	1.0	Min segment length in cM
`--panel-max-per-ancestry`	all	Cap on haplotypes per ancestry (whole-haplotype output)
`--cnn-layers`	12	Number of dilated conv layers (`--method cnn`/`cnn-crf`)
`--cnn-channels`	64	Hidden channel dimension (`--method cnn`/`cnn-crf`)
`--cnn-epochs`	5	Training epochs per pseudo-label round
`--cnn-pseudo-rounds`	2	Number of self-training rounds
`--cnn-lr`	1e-3	CNN learning rate
`--cnn-batch-size`	512	Haplotypes per CNN training/inference batch

CNN refinement backend

With --method cnn, popout replaces the HMM forward-backward with a 1D dilated CNN that processes all sites in parallel and learns multi-site LD patterns via self-supervised pseudo-label refinement. Add --method cnn-crf for an additional CRF output layer that learns transition penalties for smoother tract boundaries. The CNN is particularly useful for fine-scale ancestry (closely related populations) and for incremental inference on new samples without retraining.

See docs/CNN.md for architecture details, training procedure, and guidance on when to use CNN vs HMM.

Reference panel export

With --export-panel, popout extracts single-ancestry haplotypes and segments from its posteriors, producing a ready-made reference panel for downstream LD-aware tools like FLARE or RFMix. No manual curation is required — in a 500K biobank, this typically yields ~850K single-ancestry haplotypes, an order of magnitude larger than any curated panel.

See docs/PANEL.md for full details and the two-stage popout → FLARE pipeline.

Limitations

No multi-GPU support yet. Use --chromosomes to manually partition across GPUs.
LD-aware block boundaries (using r² decay or recombination maps) are not yet implemented; blocks are fixed-width.
Rare-ancestry detection (clusters with <1% of samples) needs further work.

What's next

LD-aware variable-width blocks (r² decay threshold or external block maps)
jax.pmap multi-GPU parallelism
Rare-ancestry detection (clusters with <1% of samples)
Benchmark CNN vs HMM on simulated data across F_ST regimes
Benchmark against FLARE on simulated data (two-stage pipeline; see docs/PANEL.md)

References

Browning SR, Waples RK, Browning BL. Fast, accurate local ancestry inference with FLARE. AJHG 2023.
Li N, Stephens M. Modeling linkage disequilibrium. Genetics 2003.
Halko N, Martinsson PG, Tropp JA. Finding structure with randomness. SIAM Review 2011.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
popout		popout
tests		tests
vcf2pgen		vcf2pgen
.dockerignore		.dockerignore
.dockstore.yml		.dockstore.yml
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

popout

How it works

Key design choices

Usage

Preparing PGEN input from VCF

Installation

Docker

Output files

Tract format

Monitoring

Architecture

Performance estimates

Options

CNN refinement backend

Reference panel export

Limitations

What's next

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

popout

How it works

Key design choices

Usage

Preparing PGEN input from VCF

Installation

Docker

Output files

Tract format

Monitoring

Architecture

Performance estimates

Options

CNN refinement backend

Reference panel export

Limitations

What's next

References

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages