This repository contains scripts to assist with analyzing outputs from the Practical Haplotype Graph (PHG) version 2.4 and above.
The PHG, developed by the Buckler Lab, is a computational tool for creating, storing, and using a pangenome for genomic imputation. For more information and full documentation, please visit the official PHG website.
While this repository provides custom tools and workflows for PHG-based analysis, it is not officially affiliated with the PHG software or its authors. These scripts were developed for specific use cases but are shared here in the hope that they may be useful for others working with PHG data.
scripts/imputed_parents_merged_vcf.py: Generates a merged VCF of imputed genotypes using a merged parent VCF and imputed parents files output fromphg find-paths.- More scripts to come.
This script uses a PHG-derived merged parent VCF with a directory of imputed parents files, producing a final imputed VCF where each sample has one genotype's calls per reference range.
Before running imputed_parents_merged_vcf.py, first use the PHG tool merge-gvcfs to combine all founder GVCFs into a single VCF.
This merged VCF includes records for SNVs, indels (represented as <INS> or <DEL> for insertions and deletions), and missing genotypes. Each site contains genotype calls where:
0represents the reference allele1,2, ..., up to the number of alternate alleles (ALTs) represent the alternate variants (which can be SNVs,<INS>, or<DEL>).indicates a missing genotype call
Note: The reference genome can be used during imputation by adding its hVCF to your vcf_files directory, but the reference gVCF should be excluded from merging for this script. See script options below.
export JAVA_OPTS="-Xmx12g"
phg merge-gvcfs \
--input-dir ~/phg_v2/output/vcf_files \
--output-file ~/phg_v2/merged_parents/merged_parents.vcf
Filtering can improve downstream analysis performance and focus results on biallelic, informative SNPs.
- Biallelic SNPs
- Sites with ALT allele in ≥2 and ≤36 samples
- Sites without missing data
bcftools view \
-m2 -M2 \
--types snps \
-c 2 -C 36 \
-g ^miss \
-Oz -o ~/phg_v2/merged_parents/non_miss_filtered_merged_parents.vcf.gz
~/phg_v2/merged_parents/merged_parents.vcf.gz
# Move into the cloned repository
cd phg_v2_analysis
# Run the script
python3 /scripts/imputed_parents_merged_vcf.py \
--ref_ranges_file ~/phg_v2/output/ref_ranges.bed \
--merged_parents_vcf_path ~/phg_v2/merged_parents/non_miss_filtered_merged_parents.vcf.gz \
--out_parents_dir ~/phg_v2/output/read_mappings/vcf_files
Optional:
--reference_sample_name B73 \
--merged_imputed_vcf_path /path/to/output/merged_imputed.vcf
- Designed for homozygous/haploid imputation output from phg find-paths, using the --path-type haploid option and specifying --out-parents-dir.
- If a reference sample was used during imputation, provide the reference sample name as it appears in the imputed parents files using the optional --reference_sample_name argument.
- Tested on a merged parent VCF of 37 parents, both unfiltered (184 million variants) and filtered (16 million variants), with 2,000 imputed samples.
- Maximum RAM usage: < 32 GB.
- Runtime: approximately 53 hours (unfiltered) and 6 hours (filtered) on a high-memory server.
- Python 3.8+
- Standard libraries:
gzip,os,argparse,glob - bcftools for optional pre-processing
Planned features:
- Quantify founder haplotype contributions for each sample.
- Generate summary tables and genome-wide visualizations.
- Tools for specific founder panels (e.g., Zea mays).
Clone the repository:
git clone https://github.com/mtkelleher/phg_v2_analysis
The scripts can be run directly using python3.
This project was developed by Micah Kelleher as part of work in the Baxter Lab at the Donald Danforth Plant Science Center. The code reflects workflows tailored to Zea mays analysis and PHG-based imputation.
Special thanks to:
- Dr. Ivan Baxter & Lab – For providing the environment, resources, and insights critical to this project.
- Dr. Sherry Flint-Garcia – For her guidance and for providing the Zea mays population used in development and testing.
- Dr. Jeff Ross-Ibarra – For his help with performance benchmarking and interpreting PHG outputs.
- The Buckler Lab – For creating and maintaining the PHG, and for helpful guidance on usage.
- Tim Kosfeld – For foundational discussions and early support for this work.