QSAR-DENV: Anti-Dengue Drug Discovery from Colombian Medicinal Flora

Overview

This repository contains the complete computational pipeline for QSAR (Quantitative Structure-Activity Relationship) modeling and virtual screening of anti-dengue compounds derived from Colombian medicinal flora. The study integrates ethnobotanical knowledge with machine learning to identify promising drug candidates.

Key Features

Curated Dataset: 358 Colombian medicinal plants with antiviral activity
Machine Learning: XGBoost-based QSAR model (MCC = 0.583)
Virtual Screening: 3,267 phytochemicals screened
Comprehensive Analysis: Chemical diversity, applicability domain, and SAR analysis
Bayesian Optimization: Automated hyperparameter tuning using Optuna

Citation

If you use this code or data, please cite:

[citation will be added here once published]

Repository Structure

QSAR-DENV/
├── README.md                          
├── requirements.txt                   
├── environment.yml                   
├── LICENSE                           
├── scripts/                          
│   ├── 01_exploratory_data_analysis.py
│   ├── 02_chemical_diversity_analysis.py
│   ├── 03_data_preparation_baseline_models.py
│   ├── 04_advanced_model_optimization.py
│   └── 05_qsar_prediction_tool.py
├── data/                            
│   └── example_compounds.csv
├── models/                          
│   └── README.md
├── notebooks/                       
│   └── example_usage.ipynb
└── docs/                           
    └── user_guide.md

Installation

Prerequisites

Python 3.8 or higher
pip or conda package manager

Option 1: Using pip

# Clone the repository
git clone https://github.com/Sergio111999/QSAR-DENV.git
cd QSAR-DENV

# Install dependencies
pip install -r requirements.txt

Option 2: Using conda

# Clone the repository
git clone https://github.com/Sergio111999/QSAR-DENV.git
cd QSAR-DENV

# Create conda environment
conda env create -f environment.yml
conda activate qsar-denv

Quick Start

1. Exploratory Data Analysis

python scripts/01_exploratory_data_analysis.py

Performs comprehensive EDA on the anti-dengue bioactivity dataset including:

Activity distribution analysis
Physicochemical property correlations
Molecular weight and drug-likeness statistics

2. Chemical Diversity Analysis

python scripts/02_chemical_diversity_analysis.py

Analyzes chemical diversity through:

Molecular scaffolds identification
Chemical clustering
Fingerprint similarity analysis
Pharmacophore feature extraction

3. Model Training and Optimization

python scripts/04_advanced_model_optimization.py

Trains and optimizes QSAR models using:

Multiple machine learning algorithms (XGBoost, LightGBM, ExtraTrees)
Bayesian hyperparameter optimization (Optuna)
5-fold cross-validation
Comprehensive performance metrics

4. Virtual Screening

python scripts/05_qsar_prediction_tool.py --input your_compounds.csv

Screens new compounds with:

Activity prediction
Drug-likeness assessment (QED)
Applicability domain evaluation

Dependencies

Main dependencies (see requirements.txt for complete list):

rdkit >= 2022.09.1
pandas >= 1.5.0
numpy >= 1.23.0
scikit-learn >= 1.1.0
xgboost >= 1.7.0
lightgbm >= 3.3.0
optuna >= 3.0.0
matplotlib >= 3.6.0
seaborn >= 0.12.0

Data Availability

Training Dataset

The curated anti-dengue bioactivity dataset from ChEMBL is available in the supplementary materials of the associated publication.

Colombian Phytochemical Library

The complete library of 3,267 compounds from 358 Colombian medicinal plants is available in the supplementary materials.

Trained Models

Pre-trained models are available for download:

Download from Zenodo (will be added)

Usage Example

from rdkit import Chem
import pandas as pd
import pickle

# Load trained model
with open('models/xgboost_final_model.pkl', 'rb') as f:
    model = pickle.load(f)

# Load your compounds
df = pd.read_csv('your_compounds.csv')  # Must have 'SMILES' column

# Make predictions
predictions = model.predict_proba(X)[:, 1]

print(f"Predicted anti-dengue activity: {predictions[0]:.3f}")

Model Performance

The final optimized XGBoost model achieved:

MCC: 0.583

Performance was evaluated using 5-fold cross-validation on a balanced dataset.

Results

Virtual Screening Results

Total compounds screened: 3,267
High-activity predictions (>0.7): 276
High drug-likeness (QED >0.5): 20
Top priority candidates: 14 novel compounds

Top Identified Compounds

The top priority compounds identified combine:

High predicted anti-dengue activity
Favorable drug-likeness properties (QED)
Location within model's applicability domain
Novel chemical scaffolds not present in training data

See supplementary materials for complete results.

Documentation

User Guide - Detailed usage instructions
Supplementary Materials - Complete data and results

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QSAR-DENV: Anti-Dengue Drug Discovery from Colombian Medicinal Flora

Overview

Key Features

Citation

Repository Structure

Installation

Prerequisites

Option 1: Using pip

Option 2: Using conda

Quick Start

1. Exploratory Data Analysis

2. Chemical Diversity Analysis

3. Model Training and Optimization

4. Virtual Screening

Dependencies

Data Availability

Training Dataset

Colombian Phytochemical Library

Trained Models

Usage Example

Model Performance

Results

Virtual Screening Results

Top Identified Compounds

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
models		models
scripts		scripts
LICENCE		LICENCE
README.md		README.md
environment.yml		environment.yml
gitignore		gitignore
requirements.txt		requirements.txt

License

Sergio111999/QSAR_DENV

Folders and files

Latest commit

History

Repository files navigation

QSAR-DENV: Anti-Dengue Drug Discovery from Colombian Medicinal Flora

Overview

Key Features

Citation

Repository Structure

Installation

Prerequisites

Option 1: Using pip

Option 2: Using conda

Quick Start

1. Exploratory Data Analysis

2. Chemical Diversity Analysis

3. Model Training and Optimization

4. Virtual Screening

Dependencies

Data Availability

Training Dataset

Colombian Phytochemical Library

Trained Models

Usage Example

Model Performance

Results

Virtual Screening Results

Top Identified Compounds

Documentation

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages