This is my first foundational cheminformatics project that predicts the aqueous solubility (LogS) of a molecule from its SMILES string using a machine learning model.
- Project Description
- Key Concepts & Terminology
- Project Structure
- Getting Started
- The Machine Learning Pipeline
- The Web Application
- Future Improvements
- License
- Acknowledgements
This project is a web application that predicts a key physicochemical property of a molecule—its aqueous solubility (LogS)—based on its chemical structure. Users can input a molecule's structure using its SMILES (Simplified Molecular-Input Line-Entry System) string. The application then uses a pre-trained Random Forest Regressor model to predict the LogS value and displays it to the user, along with a 2D visualization of the molecule.
The entire machine learning pipeline is demonstrated, from data acquisition and featurization to model training and deployment via a user-friendly web interface.
Core Technologies Used:
- Python: The programming language used for the entire project.
- RDKit: An open-source cheminformatics toolkit used for processing chemical structures and generating molecular descriptors.
- Scikit-learn: A machine learning library used to train and evaluate the Random Forest model.
- Pandas: A data manipulation library used for handling the chemical dataset.
- Streamlit: A Python framework used to build and serve the interactive web application.
- Docker: A containerization platform used to package the application and its dependencies for consistent and reproducible deployment.
- SMILES (Simplified Molecular-Input Line-Entry System): A standard method for representing a 2D chemical structure using a short string of ASCII characters. For example, the SMILES for ethanol is
CCO
. Please refer to Wikipedia for further information. - LogS (Aqueous Solubility): The logarithm of the molar solubility (mol/L) of a compound in water. It is a measure of how much of a substance can dissolve in water. A higher LogS value means the compound is more soluble.
- Cheminformatics: A field of science that uses computational and informational techniques to solve problems in chemistry. This project is a classic example of a cheminformatics application.
- Featurization (or Molecular Descriptors): The process of converting a chemical structure (like a SMILES string) into a numerical representation that a machine learning model can understand. RDKit calculates hundreds of these numerical values (e.g., molecular weight, number of hydrogen bond donors, polar surface area), which are used as features for the model.
- Random Forest Regressor: A type of machine learning algorithm known as an ensemble model. It builds multiple decision trees during training and outputs the average prediction of the individual trees, making it robust and effective for regression tasks like this one.
- Virtual Environment: An isolated Python environment that allows you to manage dependencies for a specific project separately from other projects. This prevents conflicts between package versions.
molecular_predictor/
├── data/
│ └── delaney_solubility.csv # The dataset used for training
├── venv/ # The Python virtual environment (created locally)
├── app.py # The Streamlit web application script
├── Dockerfile # Instructions to build the Docker container
├── features.json # List of features used by the model (generated by train.py)
├── model.pkl # The saved, trained Scikit-learn model (generated by train.py)
├── README.md # This file
├── requirements.txt # A list of all required Python packages
└── train.py # The script to train the machine learning model
Follow these instructions to get the project running on your local machine.
Make sure you have the following software installed on your system:
- Python 3.11
- Git
- Docker Desktop (for Docker usage)
This method runs the application directly on your host machine using a Python virtual environment.
-
Clone the repository:
git clone <your-repository-url> cd molecular_predictor
-
Create and activate a virtual environment:
# Create the virtual environment python -m venv venv # Activate it (on Windows Command Prompt) venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Train the model: This script will read the dataset and generate the
model.pkl
andfeatures.json
files.python train.py
-
Run the Streamlit application:
streamlit run app.py
Your web browser will automatically open with the application running.
This method runs the application inside an isolated Docker container, which is the recommended way to test for deployment.
-
Ensure Docker Desktop is running.
-
Generate the model artifacts first: The Docker container needs the trained model to be available before it's built. Run the local training script first (see steps 2, 3, and 4 from the local installation guide).
# Activate local venv venv\Scripts\activate # Run training python train.py # Deactivate venv (optional) deactivate
-
Build the Docker image: From the project's root directory, run the following command. This will build an image named
molecular-predictor
.docker build -t molecular-predictor .
-
Run the Docker container: This command starts a container from the image you just built.
docker run -d -p 8501:8501 --name molecular-app molecular-predictor
-d
runs the container in the background.-p 8501:8501
maps your local port 8501 to the container's port 8501.--name molecular-app
gives the container a convenient name.
-
Access the application: Open your web browser and navigate to:
http://localhost:8501
The train.py
script automates the entire machine learning pipeline:
- Load Data: Reads the
delaney_solubility.csv
file into a pandas DataFrame. - Featurization: For each SMILES string, it uses RDKit to calculate ~200 molecular descriptors.
- Train Model: It splits the data into training and testing sets and trains a
RandomForestRegressor
model on the training data. - Evaluate Model: It evaluates the model's performance on the unseen test set and prints metrics like Mean Squared Error (MSE) and R-squared (R2).
- Save Artifacts: It saves the trained model (
model.pkl
) and the list of feature names (features.json
) to disk for the web application to use.
The app.py
script uses Streamlit to create the user interface. Its workflow is as follows:
- Load Artifacts: At startup, it loads the saved
model.pkl
andfeatures.json
files. - User Input: It provides an interface for the user to enter a SMILES string, assisted by a searchable dropdown of all molecules from the training set.
- Featurize Input: When the user submits a SMILES string, the app uses the exact same RDKit featurization process as the training script.
- Predict: It feeds the generated features into the loaded model to get a LogS prediction.
- Display Result: It displays the predicted LogS value and a 2D image of the molecule's structure.
This project serves as a strong foundation. Potential future improvements include:
- Deployment: Deploying the application to a cloud service like Azure App Service using the provided Docker container.
- Alternative Models: Experimenting with other regression models, such as Gradient Boosting or simple Neural Networks, to compare performance.
- Different Properties: Modifying the pipeline to predict other properties, such as lipophilicity (LogP) or boiling point, using different datasets.
- Advanced Featurization: Using molecular fingerprints (e.g., Morgan fingerprints) as an alternative to descriptors.
This project is licensed under the MIT License. See the LICENSE
file for details.
- This project uses the Delaney (ESOL) dataset for aqueous solubility, originally published in:
- Delaney, J. S. (2004). ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences, 44(3), 1000-1005.