- This Jupyter notebook was created for my
CS35final project (Summer 2023) and uses machine learning to analyze genomic data.
- This project began with a focus on Genome Wide Association Studies (
GWAS), which are used to find associations between genes (genotypes) and particular traits (phenotypes). Datasets like the1000 Genomesdataset contain the genetic information for many people and are used to support these studies. - However, there is one main complication: each human genome is approximately 3 billion base pairs long, so using the
1000 Genomesdataset requires handling a lot of data!
In order to work with the 1000 Genomes dataset, I used the Hail library to:
- Access a subset of the dataset through distributed processing
- Run a
GWAS(analyze trait associations, generate a Manhattan Plot, etc.)
Then, to extend beyond Hail tutorials, I shifted my focus to predictive modeling of DNA sample quality (that is, which samples are good quality and which may be unreliable).
-
I decided to build a
Random Forestclassifier to predict which DNA samples should be filtered out. -
Also, instead of using all
1000 Genomes, I decided to build my model using a single person’s genomic data, which contained10,000DNA samples.
-
Hailprovides a built-in filtering method that allows you to identify 'ground truth' DNA sample quality labels (a form of quality control). -
Hailalso provides a lot of information about each DNA sample, so I decided to use a lot of that ‘messy’ data as my features. -
Then, with the ‘small’ size of 1 genome i.e.,
10,000DNA samples, I could clean the genomic data withPandas/Numpyand construct aRandom Forestclassifier.
- In the end, the
RFwas quite successful (~90% accuracy) at predicting the true quality of the DNA samples (without relying on theHailbuilt-in method itself!). - We also get to visualize the feature importances, which is always a good idea for Random Forests. : )
Repository URL: https://github.com/erikamoore/genomic-predictive-modeling.git


