Critical Insights into Data Curation and Label Noise for Accurate Prediction of Aerobic Biodegradability of Organic Chemicals

This is the code base for the paper "Critical Insights into Data Curation and Label Noise for Accurate Prediction of Aerobic Biodegradability of Organic Chemicals" written by Paulina Körner, Dr. Juliane Glüge, Dr. Stefan Glüge and Prof. Dr. Martin Scheringer.

Abstract

The focus of the current study is to enhance state-of-the-art Machine Learning (ML) models that can predict the aerobic biodegradability of organic chemicals through a data-centric approach. To do that, an already existing dataset that was previously used to train ML models was analyzed for mismatching chemical identifiers and data leakage between test and training set and the detected errors were corrected. Chemicals with high variance between study results were removed. An XGBoost was trained on the dataset and compared to a XGBoost that was trained on a dataset where certain substances were excluded. The results show that despite comprehensive data curation, only marginal improvement was observed in the classification model’s performance. This was attributed to three potential reasons: 1) a significant number of data labels were noisy, 2) the features could not sufficiently represent the chemicals, and/or 3) the model struggled to learn and generalize effectively. All three potential reasons were examined, but only removing data points with possibly noisy labels by performing label noise filtering using other predictive models increased the classification model’s balanced accuracy from 80.9% to 94.2%. While no indications were found that label noise filtering removed difficult-to-learn substances, this possibility cannot be entirely ruled out.

Setup

To run the code, Python3 is required. Three virtual environments are needed to run all scripts. The environment used to run most of the files can be installed like this:

python3 -m venv main_venv
source main_venv/bin/activate
pip install -r requirements.txt

To create the second environment, which is required to run the downloaded models from Hunag and Zhang [2022], run the following code:

python3 -m venv Huang_venv
source Huang_venv/bin/activate
pip install -r requirements_huang_zhang_replication.txt

The third environment is only required if one wants to use MolGpKa to add pKa and $\alpha$ values to a data frame:

python3 -m venv molgpka_venv
source molgpka_venv/bin/activate
pip install -r requirements_molgpka.txt

Applying the models to make predictions on new substances

If you want to use the provided classifiers to predict the ready biodegradability of organic substances, you can use our Biodegradability prediction app or the apply_models_example.ipynb file.

Overview of the scripts

processing_functions.py

Contains all kinds of functions used in the other scripts.

ml_functions.py

Contains functions for creating, validating, and testing machine learning models.

data_processing.py

This file carries out the steps described in the SMILES-Retrieval-Pipeline to create the $\text{Curated}\text{S}$ and the $\text{Curated}\text{SCS}$ datasets. It uses the dataset iuclid_echa.csv, which includes information on biodegradation screening tests from REACH. The code for retrieving this data will be published separately.

add_pka_values.py

To run this file, the molgpka_venv needs to be activated. It can be used to add pKa and $\alpha$ values to datasets.

Huang_Zhang_replicated.py

To run this file, the Huang_venv needs to be activated. The purpose of this file is to run the models presented by Huang and Zhang [2022], replicate the models, and test the models on the additional testing set.

creating_datasets.py

This file creates all curated datasets and saves them to the dataframes folder. This file needs to be run before running the following scripts.

curated_data.py

This file trains XGBoost models on the curated datasets. The train and test sets can be selected.

curated_data_analysis.py

This file carries out the analysis of the curated datasets after label validation with the BIOWIN™️ models.

other_features.py

This file is used to test the impact of other feature creation methods (RDK fingerprints, Morgan fingerprints, and features created using the pretrained model Molformer).

improved_models.py

This file is used to run LazyPredict and to carry out the hyperparameter tuning.

applicability_domain.py

This file can be run to define the applicability domain of the models trained on the curated datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
code_files		code_files
datasets		datasets
models		models
reach_study_results		reach_study_results
tests		tests
LICENSE		LICENSE
README.md		README.md
apply_models_example.ipynb		apply_models_example.ipynb
requirements.txt		requirements.txt
requirements_huang_zhang_replication.txt		requirements_huang_zhang_replication.txt
requirements_molformer.txt		requirements_molformer.txt
requirements_molgpka.txt		requirements_molgpka.txt
shap_analysis.ipynb		shap_analysis.ipynb
substance_file.xlsx		substance_file.xlsx
umap_analysis.ipynb		umap_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Critical Insights into Data Curation and Label Noise for Accurate Prediction of Aerobic Biodegradability of Organic Chemicals

Abstract

Setup

Applying the models to make predictions on new substances

Overview of the scripts

processing_functions.py

ml_functions.py

data_processing.py

add_pka_values.py

Huang_Zhang_replicated.py

creating_datasets.py

curated_data.py

curated_data_analysis.py

other_features.py

improved_models.py

applicability_domain.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Critical Insights into Data Curation and Label Noise for Accurate Prediction of Aerobic Biodegradability of Organic Chemicals

Abstract

Setup

Applying the models to make predictions on new substances

Overview of the scripts

processing_functions.py

ml_functions.py

data_processing.py

add_pka_values.py

Huang_Zhang_replicated.py

creating_datasets.py

curated_data.py

curated_data_analysis.py

other_features.py

improved_models.py

applicability_domain.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages