This repository has designed to perform imputation on SNP-array data to fill gaps in single-cell genomic data. This is particularly crucial in addressing the challenges posed by Whole Genome Amplification (WGA), a common technique used in single-cell genomics that can introduce significant background noise and result in missing genetic information.
- Description - OAverview of the project's purpose and goals
- Getting started - Instructions on how to begin with this project
- Bioinformatic parameters - Explanation and details of the bioinformatic parameters used throughout the pipeline
- Repository structure - A layout of the repository's architecture, describing the purpose of each file or directory
- References - Tools used in the project
- Authors - List of contributors to the project
- Acknowledgments - Credits and thanks to those who helped with the project
The repository shows some statistics related on SNP-array data in order to understand if imputation is able to reintegrate the loss information presente in single-cell data. The analysis begins with the creation of a bulk reference considering five different bulk data, followed by un pre processing dei dati. Subsequently, l'analisi procede con il calcolo di coefficienti di similarita e recall per comparare le situazione che precede e segue l'imputazione. Finally, vengono compiute delle statistiche descrittive e creati dei plots per mostrare i risultati ottenuti.
To reproduce this analysis, it is essential to set up a Conda environment containing all the necessary libraries (specified in the requirements.txt file). After setting up the environment, it is important to run the following scripts in the specified order.
- Use the functions from
get_gdna_consensus.pyto manipulate and analyze genomic DNA (gDNA) data: they perform various operations ranging from data concatenation, filtering, cleaning and analysis to visualization and data transformation. - Use the functions from
get_references_map.pyfor downloading large genomic data files: they automate the process of downloading, unzipping, and organizing genomic data files into specified directories. - Use the functions from
data_processing_pre_imputation.pyfor processing, filtering, and analyzing genomic data, particularly focused on single-cell (SC) genomics and consensus genomic DNA (gDNA) data. - Use the functions from
get_positions_to_exclude.pyto . - Use the functions from
imputation.pyto performs genetic imputation for each chromosome. - Use the functions from
data_processing_post_imputation.pyto . - Use the functions from
creating_statistics.pyto . - Use the functions from
creating_plots.pyto .
| File | Description |
|---|---|
| data/ | This folder must contain another folder called "raw" in which there should be your personal input data included single-cell and bulk VCF files |
| requeriments.txt | File with names and versions of packages installed in the virtual environment to run the imputation |
| beagle.22Jul22.46e.jar | Beagle imputation tool to perform the imputation |
Contact me at marcor@dtu.dk for more detail or explanations.
I would like to extend my heartfelt gratitude to KU and CCS(Center for Chromosome Stability) for providing the essential resources and support that have been fundamental in the development and success of Eva Hoffmann group projects.