Practical Haplotype Graph v2.4+ Analysis Scripts

This repository contains scripts to assist with analyzing outputs from the Practical Haplotype Graph (PHG) version 2.4 and above.

The PHG, developed by the Buckler Lab, is a computational tool for creating, storing, and using a pangenome for genomic imputation. For more information and full documentation, please visit the official PHG website.

While this repository provides custom tools and workflows for PHG-based analysis, it is not officially affiliated with the PHG software or its authors. These scripts were developed for specific use cases but are shared here in the hope that they may be useful for others working with PHG data.

Script: `imputed_merged_vcf.py`

This script uses a PHG-derived merged parent VCF with a directory of imputed parents files, producing a final imputed VCF where each sample has one genotype's calls per reference range.

Step 1: Generate the Merged Parent VCF

Before running imputed_parents_merged_vcf.py, first use the PHG tool merge-gvcfs to combine all founder GVCFs into a single VCF.

This merged VCF includes records for SNVs, indels (represented as <INS> or <DEL> for insertions and deletions), and missing genotypes. Each site contains genotype calls where:

0 represents the reference allele
1, 2, ..., up to the number of alternate alleles (ALTs) represent the alternate variants (which can be SNVs, <INS>, or <DEL>)
. indicates a missing genotype call

Note: The reference genome can be used during imputation by adding its hVCF to your vcf_files directory, but the reference gVCF should be excluded from merging for this script. See script options below.

export JAVA_OPTS="-Xmx12g"

phg merge-gvcfs \
    --input-dir ~/phg_v2/output/vcf_files \
    --output-file ~/phg_v2/merged_parents/merged_parents.vcf

Step 2 (Optional): Filter Variants with `bcftools`

Filtering can improve downstream analysis performance and focus results on biallelic, informative SNPs.

Keep only:

Biallelic SNPs
Sites with ALT allele in ≥2 and ≤36 samples
Sites without missing data

bcftools view \
    -m2 -M2 \
    --types snps \
    -c 2 -C 36 \
    -g ^miss \
    -Oz -o ~/phg_v2/merged_parents/non_miss_filtered_merged_parents.vcf.gz
    ~/phg_v2/merged_parents/merged_parents.vcf.gz

Step 3: Run the Script

# Move into the cloned repository
cd phg_v2_analysis

# Run the script
python3 /scripts/imputed_parents_merged_vcf.py \
    --ref_ranges_file ~/phg_v2/output/ref_ranges.bed \
    --merged_parents_vcf_path ~/phg_v2/merged_parents/non_miss_filtered_merged_parents.vcf.gz \
    --out_parents_dir ~/phg_v2/output/read_mappings/vcf_files

Optional:

    --reference_sample_name B73 \
    --merged_imputed_vcf_path /path/to/output/merged_imputed.vcf

Script Features

Designed for homozygous/haploid imputation output from phg find-paths, using the --path-type haploid option and specifying --out-parents-dir.
If a reference sample was used during imputation, provide the reference sample name as it appears in the imputed parents files using the optional --reference_sample_name argument.

Performance Notes

Tested on a merged parent VCF of 37 parents, both unfiltered (184 million variants) and filtered (16 million variants), with 2,000 imputed samples.
Maximum RAM usage: < 32 GB.
Runtime: approximately 53 hours (unfiltered) and 6 hours (filtered) on a high-memory server.

Dependencies

Python 3.8+
Standard libraries: gzip, os, argparse, glob
bcftools for optional pre-processing

Future Scripts

Step 4 (Upcoming): Calculate Founder Contribution

Planned features:

Quantify founder haplotype contributions for each sample.
Generate summary tables and genome-wide visualizations.
Tools for specific founder panels (e.g., Zea mays).

Installation

Clone the repository:

git clone https://github.com/mtkelleher/phg_v2_analysis

The scripts can be run directly using python3.

Acknowledgements

This project was developed by Micah Kelleher as part of work in the Baxter Lab at the Donald Danforth Plant Science Center. The code reflects workflows tailored to Zea mays analysis and PHG-based imputation.

Special thanks to:

Dr. Ivan Baxter & Lab – For providing the environment, resources, and insights critical to this project.
Dr. Sherry Flint-Garcia – For her guidance and for providing the Zea mays population used in development and testing.
Dr. Jeff Ross-Ibarra – For his help with performance benchmarking and interpreting PHG outputs.
The Buckler Lab – For creating and maintaining the PHG, and for helpful guidance on usage.
Tim Kosfeld – For foundational discussions and early support for this work.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Practical Haplotype Graph v2.4+ Analysis Scripts

Contents

Script: `imputed_merged_vcf.py`

Step 1: Generate the Merged Parent VCF

Step 2 (Optional): Filter Variants with `bcftools`

Keep only:

Step 3: Run the Script

Script Features

Performance Notes

Dependencies

Future Scripts

Step 4 (Upcoming): Calculate Founder Contribution

Installation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

mtkelleher/phg_v2_analysis

Folders and files

Latest commit

History

Repository files navigation

Practical Haplotype Graph v2.4+ Analysis Scripts

Contents

Script: imputed_merged_vcf.py

Step 1: Generate the Merged Parent VCF

Step 2 (Optional): Filter Variants with bcftools

Keep only:

Step 3: Run the Script

Script Features

Performance Notes

Dependencies

Future Scripts

Step 4 (Upcoming): Calculate Founder Contribution

Installation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Script: `imputed_merged_vcf.py`

Step 2 (Optional): Filter Variants with `bcftools`

Packages