Skip to content

saviong/molecular-property-predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Molecular Property Predictor

Python Framework Libraries License

This is my first foundational cheminformatics project that predicts the aqueous solubility (LogS) of a molecule from its SMILES string using a machine learning model.

localhost


Table of Contents

  1. Project Description
  2. Key Concepts & Terminology
  3. Project Structure
  4. Getting Started
  5. The Machine Learning Pipeline
  6. The Web Application
  7. Future Improvements
  8. License
  9. Acknowledgements

1. Project Description

This project is a web application that predicts a key physicochemical property of a molecule—its aqueous solubility (LogS)—based on its chemical structure. Users can input a molecule's structure using its SMILES (Simplified Molecular-Input Line-Entry System) string. The application then uses a pre-trained Random Forest Regressor model to predict the LogS value and displays it to the user, along with a 2D visualization of the molecule.

The entire machine learning pipeline is demonstrated, from data acquisition and featurization to model training and deployment via a user-friendly web interface.

Core Technologies Used:

  • Python: The programming language used for the entire project.
  • RDKit: An open-source cheminformatics toolkit used for processing chemical structures and generating molecular descriptors.
  • Scikit-learn: A machine learning library used to train and evaluate the Random Forest model.
  • Pandas: A data manipulation library used for handling the chemical dataset.
  • Streamlit: A Python framework used to build and serve the interactive web application.
  • Docker: A containerization platform used to package the application and its dependencies for consistent and reproducible deployment.

2. Key Concepts & Terminology

  • SMILES (Simplified Molecular-Input Line-Entry System): A standard method for representing a 2D chemical structure using a short string of ASCII characters. For example, the SMILES for ethanol is CCO. Please refer to Wikipedia for further information.
  • LogS (Aqueous Solubility): The logarithm of the molar solubility (mol/L) of a compound in water. It is a measure of how much of a substance can dissolve in water. A higher LogS value means the compound is more soluble.
  • Cheminformatics: A field of science that uses computational and informational techniques to solve problems in chemistry. This project is a classic example of a cheminformatics application.
  • Featurization (or Molecular Descriptors): The process of converting a chemical structure (like a SMILES string) into a numerical representation that a machine learning model can understand. RDKit calculates hundreds of these numerical values (e.g., molecular weight, number of hydrogen bond donors, polar surface area), which are used as features for the model.
  • Random Forest Regressor: A type of machine learning algorithm known as an ensemble model. It builds multiple decision trees during training and outputs the average prediction of the individual trees, making it robust and effective for regression tasks like this one.
  • Virtual Environment: An isolated Python environment that allows you to manage dependencies for a specific project separately from other projects. This prevents conflicts between package versions.

3. Project Structure

molecular_predictor/
├── data/
│   └── delaney_solubility.csv      # The dataset used for training
├── venv/                           # The Python virtual environment (created locally)
├── app.py                          # The Streamlit web application script
├── Dockerfile                      # Instructions to build the Docker container
├── features.json                   # List of features used by the model (generated by train.py)
├── model.pkl                       # The saved, trained Scikit-learn model (generated by train.py)
├── README.md                       # This file
├── requirements.txt                # A list of all required Python packages
└── train.py                        # The script to train the machine learning model

4. Getting Started

Follow these instructions to get the project running on your local machine.

4.1. Prerequisites

Make sure you have the following software installed on your system:

4.2. Local Installation & Usage

This method runs the application directly on your host machine using a Python virtual environment.

  1. Clone the repository:

    git clone <your-repository-url>
    cd molecular_predictor
  2. Create and activate a virtual environment:

    # Create the virtual environment
    python -m venv venv
    
    # Activate it (on Windows Command Prompt)
    venv\Scripts\activate
  3. Install the required packages:

    pip install -r requirements.txt
  4. Train the model: This script will read the dataset and generate the model.pkl and features.json files.

    python train.py
  5. Run the Streamlit application:

    streamlit run app.py

    Your web browser will automatically open with the application running.

4.3. Docker Usage

This method runs the application inside an isolated Docker container, which is the recommended way to test for deployment.

  1. Ensure Docker Desktop is running.

  2. Generate the model artifacts first: The Docker container needs the trained model to be available before it's built. Run the local training script first (see steps 2, 3, and 4 from the local installation guide).

    # Activate local venv
    venv\Scripts\activate
    # Run training
    python train.py
    # Deactivate venv (optional)
    deactivate
  3. Build the Docker image: From the project's root directory, run the following command. This will build an image named molecular-predictor.

    docker build -t molecular-predictor .

dockerbuild

  1. Run the Docker container: This command starts a container from the image you just built.

    docker run -d -p 8501:8501 --name molecular-app molecular-predictor
    • -d runs the container in the background.
    • -p 8501:8501 maps your local port 8501 to the container's port 8501.
    • --name molecular-app gives the container a convenient name.
  2. Access the application: Open your web browser and navigate to: http://localhost:8501

5. The Machine Learning Pipeline

The train.py script automates the entire machine learning pipeline:

  1. Load Data: Reads the delaney_solubility.csv file into a pandas DataFrame.
  2. Featurization: For each SMILES string, it uses RDKit to calculate ~200 molecular descriptors.
  3. Train Model: It splits the data into training and testing sets and trains a RandomForestRegressor model on the training data.
  4. Evaluate Model: It evaluates the model's performance on the unseen test set and prints metrics like Mean Squared Error (MSE) and R-squared (R2).
  5. Save Artifacts: It saves the trained model (model.pkl) and the list of feature names (features.json) to disk for the web application to use.

6. The Web Application

The app.py script uses Streamlit to create the user interface. Its workflow is as follows:

  1. Load Artifacts: At startup, it loads the saved model.pkl and features.json files.
  2. User Input: It provides an interface for the user to enter a SMILES string, assisted by a searchable dropdown of all molecules from the training set.
  3. Featurize Input: When the user submits a SMILES string, the app uses the exact same RDKit featurization process as the training script.
  4. Predict: It feeds the generated features into the loaded model to get a LogS prediction.
  5. Display Result: It displays the predicted LogS value and a 2D image of the molecule's structure.

7. Future Improvements

This project serves as a strong foundation. Potential future improvements include:

  • Deployment: Deploying the application to a cloud service like Azure App Service using the provided Docker container.
  • Alternative Models: Experimenting with other regression models, such as Gradient Boosting or simple Neural Networks, to compare performance.
  • Different Properties: Modifying the pipeline to predict other properties, such as lipophilicity (LogP) or boiling point, using different datasets.
  • Advanced Featurization: Using molecular fingerprints (e.g., Morgan fingerprints) as an alternative to descriptors.

8. License

This project is licensed under the MIT License. See the LICENSE file for details.

9. Acknowledgements

  • This project uses the Delaney (ESOL) dataset for aqueous solubility, originally published in:
    • Delaney, J. S. (2004). ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences, 44(3), 1000-1005.

About

A Streamlit web app to predict molecular solubility using RDKit and Scikit-learn

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published