Skip to content

svsiegel/vivax-mhaps

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lineage-informative microhaplotypes for spatio-temporal surveillance of Plasmodium vivax malaria parasites

This is the companion repository for the paper Lineage-informative microhaplotypes for spatio-temporal surveillance of Plasmodium vivax malaria parasites. Please use DOI 10.5281/zenodo.12622789 for citations.

The main aim of this repository is to provide details on the methods used for the discovery, selection, and exploration/optimisation of candidate microhaplotype panels for P. vivax. In addition, it provides a customisable template framework that can be adapted to different needs based on requirements for different panel use cases such as number of panel markers, diversity criteria, geographical region, selection process, etc. The github pages website link can be found here, where all the code is executable from directly: https://svsiegel.github.io/vivax-mhaps/

Here we provide two interactive notebooks that go step-by-step through the selection process and exploration of marker benchmarking, together with a number of accessory files to speed up future execution.

  1. In the first notebook, we scan the P. vivax genome in partially-overlapping sliding windows and then calculate a number of summary statistics (in this example: cardinality, heterozygosity, entropy). Each window represents a potential microhaplotype marker, assuming it satisfies a number of customisable selection criteria (e.g. diversity, number of variants, etc.). The selection criteria we use are based on a previous study which used an in silico approach to determine optimal criteria for capturing sufficient data from marker panels to detect identity-by-descent (IBD), or relatedness between parasite lineages Taylor et al, Genetics 2019. However, this is only one of many potential use cases that could be explored with this framework.
  2. The second notebook analyses all the windows together, explores panel optimisation, and then selects a candidate panel. The selection process is a challenging mathematical optimisation problem, and here we provide two complementary and effective ways to perform the task. It is worth mentioning that while we show selection methods here, often a subsequent manual curation would be required because of certain constraints, downstream requirements or other considerations (proximity to other markers, reduction of gaps across the genome, low/high diversity regions, or individual assay/panel performance during experimental validation, etc.) The codebase is also modular and can be extended to use different optmisation algorithms.

We used data from a subset of high-quality samples that are part of the open MalariaGEN Pv4 dataset, which contains genome variation data on nearly two-thousands worldwide samples of natural Plasmodium vivax infections. Details on this project, the methods used, and all contributing partners can be found in the key publication: MalariaGEN et al, Wellcome Open Research 2022, 7:136 https://doi.org/10.12688/wellcomeopenres.17795.1. The dataset can be accessed in a number of ways and here we used the malariagen_data Python package, which allows to use the data directly from the cloud and without having to first download them locally. The Pv4 user guide provides all the information on how to use the package as well as some examples to get started.

The notebooks can be run from any computer, including via MyBinder or Google Colab, two free interactive computing services that run in a cloud environment. Note that the first notebook requires navigation through hundreds of thousands of genetic variations in thousands of samples and, while the malariagen_data Python package provides and efficient way to access the data directly on the cloud, the process can still take hours (or days!) depending on the available computing infrastructure. To jump-start the selection process described in the second notebook, we have also provided a number of pre-calculated statistics for ease of use.

The code contained here has been developed by Roberto Amato, Kathryn Murie, and Sasha Siegel.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors