CMU x NVIDIA Hackathon, January 7-9, 2026

OmniGenome: Pangenome and Genomic Cluster Modeling

Problem

Pangenomes are comprehensive genetic maps that capture DNA variations across multiple individuals rather than relying on a single reference genome. They hold immense promise for representing human genetic diversity more accurately, yet current state-of-the-art pangenome graphs are limited to only tens of individuals, failing to capture the full spectrum of variation across different populations. Meanwhile, major biobanks like the UK Biobank and AllOfUS have amassed high-quality long-read sequencing data from thousands of participants, representing a vast, untapped resource that could dramatically improve pangenome comprehensiveness and diversity. However, a critical technical gap exists: no methods currently enable these separate biobanks to collaborate and integrate their data into unified pangenome graphs, meaning each dataset remains siloed and the potential for creating truly representative, population-scale pangenomes that reflect global genetic diversity remains unrealized, limiting both research insights and clinical applications that depend on understanding genetic variation across all human populations.

Intro and aim

For this project we are aiming to:

Perform federated pangenome graph construction (using HPRC data as a proof of principle)
Perform federated genomic background hashing for phenotype association of APOE locus (using 1k genomes data)

Contributors

Rob Loughnan
Adam Kehl
Jedrzej Kubica
Kumar Koushik Telaprolu
Jeff Winchell
Sanjnaa Sridhar

Quick Start

Step 1: Find locus of interest by filtering for low p-values from a GWAS

Find locus using the gwas jupyter notebook (download summary statistics from GWAS Catalog

Step 2:

# 1. Start Docker daemon
systemctl start docker

# 2. Build containers
docker compose build

# 3. Run PGGB to build graph
docker compose run pggb pggb \
  -i /data/input.fa.gz \
  -o /output/my_graph \
  -n 12 -t 8 -p 90 -s 10000

# 4. Convert to Giraffe format
docker compose run vg autoindex --workflow giraffe \
  -g /data/my_graph/*.smooth.final.gfa \
  -p /data/my_graph/giraffe_index

Methods

1) Federated pangenome graph construction

Flowchart

a) Download Data from HPRC

Download data from HPRC for generating graphs from:

download_path=/path/to/download/destination/
python ./HPRC_download_prep/download_hprc.py \
./HPRC_download_prep/assemblies.tsv \
$download_path

Extract Chromosome 19 and 22 from HPRC samples

As pangenome graph construction is very computationally intensive we will be running the process on chromosome19 and chromsome22 as a more tractable dataset.

Install Entrez Direct using conda

The contig names in the fasta files for HPRC are NCBI identifiers and need to be queried using edirect tools to convert these to convetional chromosme numbers (e.g. CM102454.1 -> chr22). This can be installed with conda using:

conda install bioconda::entrez-direct

From this, the following python script can be used to extract chr19 and chr22 from the assembly FASTA files for HPRC:

# This should be the path to the assemblies downloaded above
download_path='/path/to/download/destination/'
python ./HPRC_download_prep/make_hrcp_chr22_fasta.py \
/space/ceph/1/ABCD/users/rloughna/pangenome_construction/hprc_chr22_pansn_full.fa.gz \
--output-chr19 /space/ceph/1/ABCD/users/rloughna/pangenome_construction/hprc_chr19_pansn_full.fa.gz \
--n-individuals 20 \
--bgzip

b) Create Partitions of Data to Simulate Biobank Cohorts

WIP

c) Graph Construction

Prerequisites

Docker CE with Compose plugin (v5.0+)
Minimum 8GB RAM recommended

PGGB (Pangenome Graph Builder)

Builds pangenome graphs from multi-sample FASTA files.

docker compose run pggb pggb \
  -i /data/<input.fa.gz> \
  -o /output/<run_name> \
  -n <num_haplotypes> \
  -t 8 -p 90 -s 10000

VG (Variation Graph Toolkit)

Converts GFA graphs to Giraffe-compatible formats (GBZ, dist, min).

docker compose run vg autoindex --workflow giraffe \
  -g /data/<graph.gfa> \
  -p /data/<output_prefix>

Graph Output

d) Aggregate Graphs from Individual Cohorts

WIP vg combine from vg toolkit looks promising.

2) Genomic background hashing for phenotype association of APOE locus

We extracted this locus around the APOE gene which will be the region we use for localized pangenome graph mapping. Pangenome graph mapping may then provide a genomic background to contexualize the high risk APOE alleles. This genomic background may capture trans expression effects and we will aim to code this using genomic hashing to represent different anonymized haploblocks that could be used in a federated manner across studies.

Focus on APOE Locus

downloaded from GWAS Catalog

wget https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST013001-GCST014000/GCST013197/GCST013197.tsv.gz
gunzip GCST013197.tsv.gz

Extract Locus from Pre-constructed Pangenome Graph

WIP: odgi extract will be used, still need to figure out details.

Results

Conclusions

References

Wightman DP et al.
A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer’s disease.
Nature Genetics, 2021.
Garrison E et al.
Building pangenome graphs.
Nature Methods, 2024.
Garrison E et al.
Variation graph toolkit improves read mapping by representing genetic variation in the reference.
Nature Biotechnology, 2018.
Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E.
ODGI: understanding pangenome graphs.
Bioinformatics, 2022.
Sirén J et al.
Pangenomics enables genotyping of known structural variants in 5202 diverse genomes.
Science, 2021.
The Gulbenkian Training Programme in Bioinformatics.
GTPB/CPANG18: Computational Pangenomics (2018).
Zenodo, 2020.
Heumos S et al.
Cluster-efficient pangenome graph construction with nf-core/pangenome.
Bioinformatics, 2024.
Liao W-W et al.
A draft human pangenome reference.
Nature, 2023.
Belloy ME, Napolioni V, Greicius MD.
A quarter century of APOE and Alzheimer’s disease: Progress to date and the path forward.
Neuron, 2019.
Byrska-Bishop M et al.
High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios.
Cell, 2022.
Li H, Feng X, Chu C.
The design and construction of reference pangenome graphs with minigraph.
Genome Biology, 2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OmniGenome: Pangenome and Genomic Cluster Modeling

Problem

Intro and aim

Contributors

Quick Start

Step 1: Find locus of interest by filtering for low p-values from a GWAS

Step 2:

Methods

1) Federated pangenome graph construction

Flowchart

a) Download Data from HPRC

Extract Chromosome 19 and 22 from HPRC samples

Install Entrez Direct using conda

b) Create Partitions of Data to Simulate Biobank Cohorts

c) Graph Construction

Prerequisites

PGGB (Pangenome Graph Builder)

VG (Variation Graph Toolkit)

Graph Output

d) Aggregate Graphs from Individual Cohorts

2) Genomic background hashing for phenotype association of APOE locus

Focus on APOE Locus

Extract Locus from Pre-constructed Pangenome Graph

Results

Conclusions

References

FilesExpand file tree

README_alt.md

Latest commit

History

README_alt.md

File metadata and controls

OmniGenome: Pangenome and Genomic Cluster Modeling

Problem

Intro and aim

Contributors

Quick Start

Step 1: Find locus of interest by filtering for low p-values from a GWAS

Step 2:

Methods

1) Federated pangenome graph construction

Flowchart

a) Download Data from HPRC

Extract Chromosome 19 and 22 from HPRC samples

Install Entrez Direct using conda

b) Create Partitions of Data to Simulate Biobank Cohorts

c) Graph Construction

Prerequisites

PGGB (Pangenome Graph Builder)

VG (Variation Graph Toolkit)

Graph Output

d) Aggregate Graphs from Individual Cohorts

2) Genomic background hashing for phenotype association of APOE locus

Focus on APOE Locus

Extract Locus from Pre-constructed Pangenome Graph

Results

Conclusions

References