Pipelines to extract unfolded site frequency spectrum from 1000GP VCF and DGN alignments.
Each pipeline returns a tab-delimited file including information unfolded site frequency spectrum, analyzable sites by largest transcrip and divergence by genes, populations and MKT functional classes (0-fold: selected class; 4-fold: neutral class). Both allows the Drosophila melanogaster and humans proteins analysis through iMKT web-service and iMKT R-package.
This repository only include raw code to get main results. notebooks/ folder include two main Jupyter Notebooks running on Python 3.6 kernel to execute step by step the pipeline. src/ folder contain raw scripts to needed to execute the pipelin. Please note that multiple step could be parallelized, in this case create yourself customs bash scripts or run it on your server manually.
Pipeline were developed in the conda enviroment imktData.yml in local server: 100GB RAM and 16 Intel(R) Xeon(R) CPU.
In addition structure.sh deposited in scr/ create the folders we used to complete the whole process. If you decided execute it, ovewrite notebook/ and src/ with the same folders deposited at this repository.
Pipelines execution requiere to download the following files. Paths would need to be changed too.
Variation data generated by the Drosophila Genome Nexus, together with divergence data between D. melanogaster and D. simulans, was retrieved from PopFly (Hervás et al. 2017) in FASTA format (also available in DGN web site). Recomb data from Comeron et al. 2012
Genome variation data and information of the ancestral state of the variants generated by the 1000GP Phase III (1000 Genomes Project Consortium 2015), together with divergence between humans and chimpanzees, were retrieved from PopHuman (Casillas et al. 2018) in Variant Call Format (VCF). Recomb data from Bhèrer et al. 2017. Pilot mask to exclude low quality variants download from 1000GP ftp.