Skip to content

Code repository for machine learning and computational analysis of large-scale microbial genomics data (https://rdcu.be/9rHj)

Notifications You must be signed in to change notification settings

erolkavvas/microbial_AMR_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine learning of microbial pan-genomes

Computational platform applied to large-scale M. tuberculosis antimicrobial resistance (AMR) dataset, as described in,

ES. Kavvas, E. Catoui, N. Mih, JT. Yurkovich, Y. Seif, N. Dillon, D. Heckmann, A. Anand, L. Yang, V. Nizet, JM. Monk, BO. Palsson Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance, Nature Communications, (2018) 9:4306

alt text

Installation

git clone https://github.com/erolkavvas/microbial_AMR_ML.git

Primary scripts

  • 01_pairwise_tests.ipynb
    • Determines pairwise associations between pan-genome alleles and labeled phenotypes.
    • Generates Supplementary Data File 1
  • 02_ML_ensemble_SVM.ipynb
    • Performs machine learning (ensemble support vector machine) for selecting groups of alleles that are predictive of the labeled phenotypes.
    • Generates Supplementary Data File 2, Supplementary Data File 3, and svm_ensemble_data
  • 03_epistatic_analysis.ipynb
    • Uses the data generated by 02_ML_ensemble_SVM.ipynb to select an initial set of gene-gene pairs, and then performs gene-gene logistic regression modeling of these gene-gene pairs to identify statistical significant genetic interactions.
    • Generates cooccurence_table_excel, cooccurence_table_figures, and Supplementary Data File 4

Primary data structures

The following dataframes are required inputs for the computational platform.

  • cluster_info.csv
clust_to_rv gene_name ortho cog product refseq count score name_to_rv pan
Cluster 0 Rv2048c pks12 653045.Strvi_4160 Q Polyketide synthase AN47_01827 1590 7958.6 0 Core
Cluster 1 Rv3344c PE_PGRS49 0 0 PE-PGRS family protein X171_03503 794 0.0 0 Acces
... ... ... ... ... ... ... ... ... ... ...
  • pangen_allele_df.csv
Genome ID ... Cluster0_16 Cluster0_17 ...
1010834_3 ... 1 ...
1010835_3 ... 1 ...
1010836_3 ... 1 ...
... ... ... ... ...
  • pangen_cluster_df.csv
Genome ID Cluster 0 Cluster 1 Cluster 2 ...
1438838_3 1 1 0 ...
1408941_4 1 1 0 ...
1422035_3 1 0 0 ...
... ... ... ... ...
  • resistance_data.csv
genome_id isoniazid rifampicin ethambutol ...
1295764_3 R R R ...
1423468_3 R R S ...
... ... ... ... ...

External packages of note

About

Code repository for machine learning and computational analysis of large-scale microbial genomics data (https://rdcu.be/9rHj)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published