End-to-End Machine Learning Suite for Antimicrobial Resistance & Drug Discovery
A comprehensive collection of production-ready ML projects tackling critical challenges in infectious disease and drug development. From resistance prediction to generative drug design and automated diagnostics.
Classification Model for Antibiotic Resistance Detection
- Task: Binary classification (Susceptible vs Resistant)
- Model: Random Forest Classifier
- Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
- Data: 4,383 E. coli isolates from NCBI
- App:
streamlit run src/app.py
Regression Model for Antimicrobial Peptide Potency Prediction
- Task: MIC (Minimum Inhibitory Concentration) prediction
- Model: Random Forest Regressor
- RΒ² Score: 0.9992 | RMSE: 0.024 log units
- Data: 3,143 E. coli isolates with MIC values
- App:
streamlit run src/app_MIC.py
Generative AI for Antimicrobial Peptide Design
- Task: Generate novel peptide sequences (generative modeling)
- Model: 2-Layer LSTM (PyTorch) - Character-level RNN
- Performance: Loss 0.8541 | Generates realistic AMP sequences
- Data: 2,872 E. coli peptides (10-50 AA length)
- Training: ~10 min CPU / ~2 min GPU | 50 epochs
- Status: β Fully trained, ready for inference
- Use: Computational screening, rational design, drug discovery
1D ResNet for Multi-label Antimicrobial Resistance Prediction from Mass Spectrometry
- Task: Multi-label classification (10 antibiotics)
- Model: ResNet-1D (2M parameters) - Deep CNN with residual blocks
- Architecture: Conv1D β 4 ResBlock stages β Global AvgPool β FC β Sigmoid
- Input: MALDI-TOF mass spectra (6000 m/z bins)
- Loss: BCEWithLogitsLoss with pos_weight (handles class imbalance)
- Optimizer: AdamW (lr=1e-4, weight_decay=1e-5)
- Metrics: AUPRC, AUROC tracked via TensorBoard
- Training: 20 epochs with automatic best model checkpointing
- Features: Flexible model sizes (small/medium/large), feature extraction
- Documentation: See src/README.md for detailed architecture
Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.
Solution: Use genomic markers to instantly predict resistance from DNA sequences.
Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ Β΅M).
Solution: Use machine learning to predict peptide efficacy and generate new candidates from physicochemical properties and sequence patterns.
Challenge: Design space for peptides is massive (20^50 for 50-length sequences = 10^65 possibilities). Manual screening is infeasible.
Solution: Train generative AI to learn natural peptide patterns and create novel, biologically plausible sequences for experimental validation.
Challenge: MALDI-TOF mass spectrometry is fast (minutes) but requires expert interpretation. Multi-drug resistance requires testing 10+ antibiotics.
Solution: Train deep neural networks to directly predict resistance profiles from raw mass spectra, enabling instant multi-drug diagnostics.
ML-Training/
βββ projects/
β βββ cefixime-resistance-training/ # Antibiotic resistance classifier
β β βββ data/
β β β βββ raw/ # Original NCBI isolates
β β β βββ processed/ # Cleaned genotype data
β β βββ src/
β β β βββ process.py # Data preprocessing
β β β βββ train.py # Model training (RF classifier)
β β βββ models/
β β β βββ ceftriaxone_model.pkl # Trained classifier
β β βββ results/
β β βββ confusion_matrix.html # Interactive CM
β β βββ feature_importance.csv # Top resistance genes
β β
β βββ MIC Regression/ # Peptide potency regressor
β βββ data/
β β βββ raw/ # Raw peptide sequences & MIC values
β β βββ processed/ # Computed physicochemical features
β βββ src/
β β βββ process.py # Data preprocessing
β β βββ train.py # Model training (RF regressor)
β βββ models/
β β βββ mic_predictor.pkl # Trained regressor
β βββ results/
β βββ predicted_vs_actual.png # Predictions visualization
β βββ feature_importance.png # Top peptide features
β β
β βββ week4_peptide_generator/ # Generative LSTM
β βββ data/
β β βββ ecolitraining_set_80.csv # 2,872 E. coli peptides
β βββ models/
β β βββ peptide_lstm.pth # Best model (loss: 0.854)
β β βββ config.json # Training hyperparameters
β βββ src/
β β βββ vocab.py # PeptideVocab: AA tokenization
β β βββ train_generator.py # PyTorch LSTM training
β βββ README.md
β
βββ src/ # π DeepG2P Model & Apps
β βββ model.py # ResNet-1D architecture (DeepG2P, ResidualBlock)
β βββ train.py # Training pipeline (BCEWithLogitsLoss, AdamW)
β βββ app.py # Ceftriaxone classifier Streamlit app
β βββ app_MIC.py # MIC regressor Streamlit app
β βββ features.py # Biopython feature extraction
β βββ README.md # DeepG2P documentation
β
βββ models/ # π Saved model checkpoints
β βββ best_model.pth # Best validation loss checkpoint
β βββ checkpoint_epoch_*.pth # Periodic training checkpoints
β
βββ results/ # π Training outputs
β βββ logs/ # TensorBoard logs
β βββ training_config.json # Hyperparameters & metadata
β
βββ utils/
β βββ model_evaluation.py # Shared evaluation metrics
β
βββ requirements.txt # Python dependencies (PyTorch, sklearn, etc.)
βββ README.md # This file
- Python 3.8+
- Git
# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training
# Install dependencies
pip install -r requirements.txtCeftriaxone Resistance Predictor (Classifier):
streamlit run src/app.pyAccess at http://localhost:8501
AI Peptide Dosing Calculator (Regressor):
streamlit run src/app_MIC.pyAccess at http://localhost:8501
DeepG2P Model Training:
# Train with default parameters
python src/train.py
# Custom training
python src/train.py \
--train-features data/processed/X_train.npy \
--train-labels data/processed/y_train.npy \
--val-features data/processed/X_val.npy \
--val-labels data/processed/y_val.npy \
--epochs 20 \
--batch-size 32 \
--model-size medium
# Monitor training
tensorboard --logdir results/logsAntibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.
- Model: Random Forest Classifier (100 trees, balanced class weights)
- Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
- Features: 352 detected resistance genes/mutations
| Metric | Value |
|---|---|
| Accuracy | 94.9% |
| Sensitivity | 93.9% |
| Specificity | 95.9% |
| ROC-AUC | 0.978 |
| Test Set Size | 876 isolates |
The model independently discovered known resistance mechanisms:
- blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
- blaCMY-2 (AmpC Cephalosporinase)
- gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)
Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.
- Training:
projects/cefixime-resistance-training/src/train.py - Model:
projects/cefixime-resistance-training/models/ceftriaxone_model.pkl - App:
src/app.py
Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.
- Model: Random Forest Regressor (100 trees)
- Data: 3,143 E. coli isolates with MIC values (NCBI)
- Target:
neg_log_mic_microM(-log10 of MIC in Β΅M)
| Metric | Current (K-mers) | Previous (Baseline) |
|---|---|---|
| RΒ² Score | 0.9992 | 0.4461 |
| RMSE | 0.024 log units | 0.629 log units |
| Pearson r | 0.9996 | 0.6742 |
| p-value | < 0.001 | < 0.001 |
| Test Set Size | 629 peptides | 629 peptides |
| Features | 410 (7 + 399 k-mers) | 7 (physicochemical only) |
- RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
- Model explains 99.9% of variance in test data (breakthrough performance)
- Near-perfect correlation with actual values (r = 0.9996)
Physicochemical Properties (7 features via Biopython):
- Molecular Weight - correlates with toxicity vs efficacy
- Aromaticity - aromatic residues enhance membrane interaction
- Instability Index - peptide stability in vivo
- Isoelectric Point - charge affects cellular uptake
- GRAVY (hydrophobicity) - hydrophobic residues improve activity
- Length - longer peptides often more potent but less specific
- Positive Charge - (K + R count) - important for bacterial binding
K-mer (Dipeptide) Features (399 features via CountVectorizer):
- Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
- Captures sequence order information (solves "bag of words" problem)
- Preserves local context: distinguishes
R-R-W-WfromW-R-W-R - Min frequency threshold (min_df=5) filters rare k-mers
- Breakthrough improvement: RΒ² 0.45 β 0.9992 (+122% relative gain)
- < 2 Β΅M: π Excellent (highly potent)
- 2-10 Β΅M: β Good (reasonable activity)
- 10-50 Β΅M:
β οΈ Weak (marginal) -
50 Β΅M: β Inactive (not viable)
Initial Challenge (RΒ² = 0.45)
The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.
The Problem:
- Sequence
R-R-W-W(positive charge β hydrophobic) might be highly potent - Sequence
W-R-W-R(alternating pattern) could be ineffective - Issue: Both have identical weight, charge, GRAVY β model couldn't distinguish them
Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.
Solution: K-mer Features (Implemented)
Added dipeptide counting to capture local sequence context:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
analyzer='char',
ngram_range=(2, 2), # Dipeptides (AA, AK, KE, WW, etc.)
min_df=5 # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence orderBreakthrough Results:
- RΒ² improved from 0.45 β 0.9992 (99.9% variance explained)
- RMSE reduced from 0.63 β 0.024 log units (~27x improvement)
- Model now distinguishes
R-R-W-WfromW-R-W-Rbased on local patterns
Why K-mers Work:
- Capture pairwise amino acid interactions (e.g.,
"KK"= strong positive clustering) - Preserve positional information without overfitting (unlike full sequence embeddings)
- Interpretable: Can analyze top k-mers for biological plausibility
- Computationally efficient for inference
Biological Validation: Top k-mer features likely include:
"KK","RR"- positive charge clustering (enhances bacterial binding)"WW","FF"- hydrophobic patches (membrane insertion)"KE","RD"- charged pairs (amphipathicity)
This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.
- Feature extraction:
src/features.py - Training:
projects/MIC Regression/src/train.py - Model:
projects/MIC Regression/models/mic_predictor.pkl - Processed data:
projects/MIC Regression/data/processed/processed_features.csv - App:
src/app_MIC.py
Designing antimicrobial peptides requires screening millions of candidates. The design space is massive (20^50 β 10^65 for 50-length sequences). Goal: Use generative AI to learn natural peptide patterns and create novel candidates for experimental validation.
- Model: 2-Layer LSTM (PyTorch character-level RNN)
- Data: 2,872 E. coli peptides (10-50 AA length)
- Task: Learn to predict next amino acid in sequence β generate new peptides
| Metric | Value | Status |
|---|---|---|
| Initial Loss (Epoch 1) | 2.81 | Random |
| Target Achieved (Epoch 15) | 1.59 | β Hit target |
| Final Loss (Epoch 50) | 0.854 | β¨ Excellent |
| Training Time (CPU) | ~10 min | Practical |
| Training Time (GPU) | ~2 min | Fast |
| Vocab Size | 23 | (20 AA + 3 special) |
| Model Parameters | ~1.3M | Manageable |
Input: Sequence of amino acid indices
β
Embedding (vocab_size=23 β embedding_dim=128)
β
LSTM Layer 1 (128 β 256 units) + Dropout(0.3)
β
LSTM Layer 2 (256 β 256 units) + Dropout(0.3)
β
Linear (256 β vocab_size=23)
β
Output: Logits for next token
Epoch 50 Generations (Temperature=0.8):
1. FLPAIVGAAAKFLPKIFCAITKKC β Hydrophobic core + basic tail
2. GIGKFLHSAKKFGKAFVGEIMNS β Alternating hydrophobic/charged
3. SKVGRHWRRFWHRAHRLLHR β Rich in W (aromatic) & R (cationic)
4. GLRKRLRKFRNKIKEKLKKIGQKIQGLLPKLAPRTDY
5. LLGDFFRKSKEKIGKEFKRIVQRIKDFFRNLVPRTES
Why These Look Realistic:
- Contain hydrophobic residues (L, V, I, F) for membrane interaction
- Cationic clusters (K, R) for bacterial binding
- Avoid D, E (acidic) which would reduce activity
- Length distribution matches natural AMPs
- No known toxins generated
- Model learned biological patterns without explicit rules
- Generative capability β enables computational screening
- Loss convergence shows genuine pattern learning (not memorization)
- Character-level modeling better than sequence models for this task
Next Steps (Future Work):
- β MIC Prediction: Use Project 2 regressor on generated sequences
- β Toxicity Screening: Hemolysis prediction models
- β Structural Validation: AlphaFold2 for 3D verification
- β Lab Validation: Experimental MIC testing
- Vocabulary:
projects/week4_peptide_generator/src/vocab.py - Training & Generation:
projects/week4_peptide_generator/src/train_generator.py - Best Model:
projects/week4_peptide_generator/models/peptide_lstm.pth - Checkpoints:
projects/week4_peptide_generator/models/peptide_lstm_epoch_{10,20,30,40,50}.pth - Documentation:
projects/week4_peptide_generator/README.md
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Stage 1: GENERATION (Week 4 Peptide Generator) β
β Generate 1000 candidate sequences β
β Temperature=0.8 for balanced novelty/realism β
βββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β Stage 2: POTENCY PREDICTION (Project 2: MIC Regressor) β
β Predict MIC for each candidate β
β Filter: Keep only high-potency (MIC < 5 Β΅M) β
β Result: ~50-100 promising candidates β
βββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β Stage 3: EXPERIMENTAL VALIDATION β
β Synthesize top 20 candidates β
β Test MIC, toxicity, stability β
β β 2-3 viable drug leads per iteration β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This computational-experimental hybrid dramatically reduces time & cost vs. random screening.
All projects include built-in validation mechanisms to ensure scientific rigor and prevent common ML failures.
Problem: Clustering treating single-sample locations as valid clusters.
Solution: Filter locations with <5 samples before matrix construction.
python src/process_matrix.py --min-location-samples 5Impact: Prevents geographic clustering artifacts, improves statistical reliability.
Problem: Generated peptides might be >90% identical to training data (memorization).
Solution: Check sequence homology using SequenceMatcher before screening.
Filtered 2 candidates for high homology (>90% identity)
β Novelty status: NOVEL
Impact: Ensures generated peptides are truly novel for experimental validation.
Problem: Regressor predicts values outside training range (hallucination).
Example: Training MIC range 0.5-256 Β΅M, but model predicts 0.017 Β΅M
Solution: Flag predictions outside training range with confidence indicators.
Flagged 2 predictions with LOW_CONFIDENCE* (outside training range 0.5-256 Β΅M)
prediction_confidence: HIGH_CONFIDENCE or LOW_CONFIDENCE*
Impact: Prevents overconfident predictions on extrapolated values.
Problem: Computer vision fails with poor lighting (too dark or overexposed).
Solution: Validate image intensity before analysis.
Image quality: mean_intensity = 125.4
β Image quality validated (within 50-200 range)
Impact: Prevents false positives/negatives from suboptimal imaging conditions.
- Pandas: Data manipulation & analysis
- NumPy: Numerical computations
- Scikit-Learn: RandomForest classifiers & regressors
- Biopython: Protein sequence analysis (
Bio.SeqUtils.ProtParam) - SciPy: Statistical tests (Pearson correlation, etc.)
- Matplotlib: Static publication-ready plots
- Plotly: Interactive HTML charts
- Kaleido: PNG export from Plotly
- Streamlit: Interactive web apps (no frontend coding)
- Joblib: Model persistence (.pkl files)
- GitHub: Version control & deployment integration
Global Impact:
- ~1.3M deaths/year attributable to AMR (WHO, 2022)
- Top 10 global health threat
- Economic cost: $100B+ annually in healthcare
Genetic Basis (Ceftriaxone Example):
- Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
- Target Modification: gyrA mutations alter DNA gyrase binding site
- Efflux Pumps: acrB overexpression exports antibiotics before they act
Natural Defense:
- Found in all life forms (immune system, skin, GI tract)
- Kill bacteria via direct membrane disruption
- Less likely to develop resistance (multiple mechanisms)
Design Challenge:
- Potency (MIC) varies 1000-fold (0.1 - 100+ Β΅M)
- Toxicity risk increases with potency
- Design space is massive (20^n for n-length peptides)
ML Solution:
- Use physicochemical properties to predict potency
- Enable rational design instead of random screening
- Reduce wet-lab costs & timelines
- NCBI MicroBIGG-E: https://microbiggdata.ncbi.nlm.nih.gov/ (genotypes + phenotypes)
- EUCAST Guidelines: https://www.eucast.org/ (standard testing methods)
- CARD Database: https://card.mcmaster.ca/ (resistance gene annotations)
- APD (APD3): https://aps.unmc.edu/APD/ (AMP database)
- BioPep: https://www.bipep.org/ (peptide bioactivity)
ProteinAnalysisdocumentation: https://biopython.org/wiki/Documentation
For research/educational use only. Not a clinical diagnostic device.
- Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
- Consult clinical microbiology before treatment decisions
- Models trained on specific E. coli population; validate locally
For research/design purposes only. Not validated for clinical use.
- Predicted MIC is a computational estimate; always validate experimentally
- Model trained on specific data; performance may vary on novel sequences
- Use as design guidance, not final arbiter of peptide efficacy
- Multi-organism support (Klebsiella, Pseudomonas)
- SHAP explainability for individual predictions
- Confidence intervals for MIC predictions
- REST API for integration with LIS systems
- Additional antibiotics (fluoroquinolones, aminoglycosides)
- Uncertainty quantification via Bayesian methods
- Mobile app (iOS/Android) for field deployment
- Real-time database updates from NCBI
- Community contribution framework
Vihaan Kulkarni β Bioinformatics & Machine Learning Engineer
MIT License β Free for academic and research use.
Last Updated: December 17, 2025
Status: β Active Development
- Fill out
README.mdwith:- Problem statement
- Key insights (with screenshots)
- Model metrics
- Deployment link
- Use "Problem β Method β Insight β Impact" structure
Every project includes:
- Data:
pandas,numpy - Visualization:
plotly,kaleido - Modeling:
scikit-learn - Explainability:
shap - Deployment:
streamlit
Optional (uncomment in requirements.txt if needed):
- Experiment Tracking:
mlflow,wandb - Deep Learning:
torch,tensorflow
- Run baseline first: Always compare against a simple model
- Plotly over Matplotlib: Interactive charts reveal more insights
- Document as you go: Fill README during the project, not after
- Save figures: Use
fig.write_html()to preserve interactivity - Version control: Commit after each major milestone
- β 1 high-quality project per week
- β Every project deployed with Streamlit
- β README formatted for resume/GitHub
- β Interactive visualizations (no static PNGs)
- β Model explainability included
Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller