This repository contains the complete computational pipeline for QSAR (Quantitative Structure-Activity Relationship) modeling and virtual screening of anti-dengue compounds derived from Colombian medicinal flora. The study integrates ethnobotanical knowledge with machine learning to identify promising drug candidates.
- Curated Dataset: 358 Colombian medicinal plants with antiviral activity
- Machine Learning: XGBoost-based QSAR model (MCC = 0.583)
- Virtual Screening: 3,267 phytochemicals screened
- Comprehensive Analysis: Chemical diversity, applicability domain, and SAR analysis
- Bayesian Optimization: Automated hyperparameter tuning using Optuna
If you use this code or data, please cite:
[citation will be added here once published]
QSAR-DENV/
├── README.md
├── requirements.txt
├── environment.yml
├── LICENSE
├── scripts/
│ ├── 01_exploratory_data_analysis.py
│ ├── 02_chemical_diversity_analysis.py
│ ├── 03_data_preparation_baseline_models.py
│ ├── 04_advanced_model_optimization.py
│ └── 05_qsar_prediction_tool.py
├── data/
│ └── example_compounds.csv
├── models/
│ └── README.md
├── notebooks/
│ └── example_usage.ipynb
└── docs/
└── user_guide.md
- Python 3.8 or higher
- pip or conda package manager
# Clone the repository
git clone https://github.com/Sergio111999/QSAR-DENV.git
cd QSAR-DENV
# Install dependencies
pip install -r requirements.txt# Clone the repository
git clone https://github.com/Sergio111999/QSAR-DENV.git
cd QSAR-DENV
# Create conda environment
conda env create -f environment.yml
conda activate qsar-denvpython scripts/01_exploratory_data_analysis.pyPerforms comprehensive EDA on the anti-dengue bioactivity dataset including:
- Activity distribution analysis
- Physicochemical property correlations
- Molecular weight and drug-likeness statistics
python scripts/02_chemical_diversity_analysis.pyAnalyzes chemical diversity through:
- Molecular scaffolds identification
- Chemical clustering
- Fingerprint similarity analysis
- Pharmacophore feature extraction
python scripts/04_advanced_model_optimization.pyTrains and optimizes QSAR models using:
- Multiple machine learning algorithms (XGBoost, LightGBM, ExtraTrees)
- Bayesian hyperparameter optimization (Optuna)
- 5-fold cross-validation
- Comprehensive performance metrics
python scripts/05_qsar_prediction_tool.py --input your_compounds.csvScreens new compounds with:
- Activity prediction
- Drug-likeness assessment (QED)
- Applicability domain evaluation
Main dependencies (see requirements.txt for complete list):
- rdkit >= 2022.09.1
- pandas >= 1.5.0
- numpy >= 1.23.0
- scikit-learn >= 1.1.0
- xgboost >= 1.7.0
- lightgbm >= 3.3.0
- optuna >= 3.0.0
- matplotlib >= 3.6.0
- seaborn >= 0.12.0
The curated anti-dengue bioactivity dataset from ChEMBL is available in the supplementary materials of the associated publication.
The complete library of 3,267 compounds from 358 Colombian medicinal plants is available in the supplementary materials.
Pre-trained models are available for download:
- Download from Zenodo (will be added)
from rdkit import Chem
import pandas as pd
import pickle
# Load trained model
with open('models/xgboost_final_model.pkl', 'rb') as f:
model = pickle.load(f)
# Load your compounds
df = pd.read_csv('your_compounds.csv') # Must have 'SMILES' column
# Make predictions
predictions = model.predict_proba(X)[:, 1]
print(f"Predicted anti-dengue activity: {predictions[0]:.3f}")The final optimized XGBoost model achieved:
- MCC: 0.583
Performance was evaluated using 5-fold cross-validation on a balanced dataset.
- Total compounds screened: 3,267
- High-activity predictions (>0.7): 276
- High drug-likeness (QED >0.5): 20
- Top priority candidates: 14 novel compounds
The top priority compounds identified combine:
- High predicted anti-dengue activity
- Favorable drug-likeness properties (QED)
- Location within model's applicability domain
- Novel chemical scaffolds not present in training data
See supplementary materials for complete results.
- User Guide - Detailed usage instructions
- Supplementary Materials - Complete data and results
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.