First you will need to download pyenv and pipx:
curl https://pyenv.run | bashor use Homebrew:
brew install pyenvThe you can install pipx:
brew install pipx
pipx ensurepathOr if you are on Linux:
sudo apt update
sudo apt install pipx
pipx ensurepathThen you can install poetry:
pipx install poetryFinally you can install the environment:
poetry installconda env create --file environment.yml
conda activate grapeFor running the analysis you can download the data from Zenodo: TODO
TODOYou may choose which ever model you want from the src/models folder. If you want you can also implement your own model. The available models are :
Before running the script, you should change the variable max_evals in the run_decision_tree.py file to 100 or more.
First thing to do is to download LOTUS from Zenodo:
wget https://zenodo.org/record/7534071/files/230106_frozen_metadata.csv.gz
mv ./230106_frozen_metadata.csv.gz ./data/molecules/230106_frozen_metadata.csv.gz
wget http://classyfire.wishartlab.com/system/downloads/1_0/chemont/ChemOnt_2_1.obo.zip
mv ./ChemOnt_2_1.obo.zip ./data/molecules/ChemOnt_2_1.obo.zip
# unzip the file
unzip ./data/molecules/ChemOnt_2_1.obo.zip
mv ./ChemOnt_2_1.obo ./data/molecules/ChemOnt_2_1.obo
wget https://raw.githubusercontent.com/mwang87/NP-Classifier/master/Classifier/dict/index_v1.json
mv ./index_v1.json ./data/molecules/NPClassifier_index.jsonFirst run the following command to prepare lotus, the molecules and the molecule ontology:
python prepare_data/prepare_lotus.py
python prepare_data/prepare_mol_to_chemont.py
python prepare_data/prepare_NPClassifier.pyThe species edge list will take a bit longer to prepare (2-3 minutes). We are downloading the entire taxonomy from Wikidata using this query.
python prepare_data/prepare_species.py Now we need to prepare the data to be suitable for grape library.
python prepare_data/prepare_graph.pyFinally we need to merge the data from NCBI and LOTUS.
python prepare_data/prepare_merge_ncbi.pyOnce this is done you should have in the data folder the following structure, with the following graphs available:
.
├── full_graph_with_ncbi_clean_edges.csv
├── full_graph_with_ncbi_clean_nodes.csv
├── full_graph_with_ncbi_edges.csv
├── full_graph_with_ncbi_nodes.csv
├── full_wd_taxonomy_with_molecules_in_lotus_clean_edges.csv
├── full_wd_taxonomy_with_molecules_in_lotus_clean_nodes.csv
├── full_wd_taxonomy_with_molecules_in_lotus_edges.csv
├── full_wd_taxonomy_with_molecules_in_lotus_nodes.csv
├── lotus
│ ├── lotus_edges.csv
│ └── lotus_nodes.csv
├── molecules
│ ├── 230106_frozen_metadata.csv.gz
│ ├── ChemOnt_2_1.obo
│ ├── ChemOnt_2_1.obo.zip
│ ├── NPClassifier_index.json
│ ├── chemont_edges.csv
│ ├── chemont_nodes.csv
│ ├── mol_to_chemont_edges.csv
│ ├── mol_to_chemont_nodes.csv
│ ├── mol_to_np_edges.csv
│ └── mol_to_np_nodes.csv
└── species
├── full_wikidata_taxonomy_edges.csv
└── full_wikidata_taxonomy_nodes.csvYou can choose which graph you want to use for the analysis. Here is an explanation of the different graphs:
full_graph_with_ncbi_clean: This graph contains all the data from LOTUS, the taxonomy from wikidata and the taxonomy from NCBI. It is cleaned meaning that there are no disconnected components in the graph.full_graph_with_ncbi: This graph contains all the data from LOTUS, the taxonomy from wikidata and the taxonomy from NCBI. This one is not cleaned meaning that there are some disconnected components in the graph.full_wd_taxonomy_with_molecules_in_lotus_clean: This graph contains the taxonomy from wikidata and the molecules from LOTUS (with the classification of the molecules). It is cleaned meaning that there are no disconnected components in the graph.full_wd_taxonomy_with_molecules_in_lotus: This graph is the same as the previous one. This one is not cleaned meaning that there are some disconnected components in the graph.lotus/lotus: This graph is only a bipartite graph with the species and the molecules from LOTUS.molecules/chemont: This graph contains only the different classes of molecules from Classyfire.molecules/mol_to_chemont: This graph contains the molecules and the classes of molecules from Classyfire.molecules/mol_to_np: This graph contains the molecules and the classes of molecules from NPClassifier.species/full_wikidata_taxonomy: This graph contains the entire taxonomy of the species on Earth from Wikidata.
In our case we will either use the full_graph_with_ncbi_clean or full_wd_taxonomy_with_molecules_in_lotus_clean. Further tests need to be made to see which one is the best for the predictions.
TODO :
- add to Zotero graph of mol to mol similarity (with explanations)
- Add a
wgetto get that graph and avoid users running stuff for hours. - merge current graph with graph of mol to mol similarity and make it the default graph for the analysis.
- add explanation of this new graph above.
This is not possible at the moment because the module ensmallen from grape does not support the HyperSketching yet. Once it will be available, we recommend to first run the run_model_dummy.py script with max_eval=1. This will first create the sketching of the different holdouts of training and testing. Then you can run the script with max_eval=100 or more to find the best parameters of the model.
Once the best parameters are found, you can train the model using the train_model.py script. You should manually change the parameters and the model in the script according to the best parameters found.
Here are some new molecules from :
CCC\C=C\C=C\C=C\C(O)CC1(O)OC(CC(O)C(O)C(O)C2OC(=O)C(C)C2O)C(O)C(O)C1Ofrom https://pubs.acs.org/doi/10.1021/acs.jnatprod.3c01043CC(C)CCC(C)C(=O)NCCCNC(=N)Nfrom https://pubs.acs.org/doi/10.1021/acs.jnatprod.3c01186CC2=CC(=O)CC3C(C)C(C)(COC1CC(O)C(O)C(C)O1)CCC23Cfrom https://pubs.acs.org/doi/10.1021/acs.jnatprod.3c00752CC1CCc2c(CCC=C(C)C)cnc3c(C)cc(O)c1c23from https://pubs.acs.org/doi/10.1021/acs.jnatprod.3c01072