Skip to content

vihaankulkarni29/ML-Training

Repository files navigation

🧬 Bioinformatics ML Repository

End-to-End Machine Learning Suite for Antimicrobial Resistance & Drug Discovery

A comprehensive collection of production-ready ML projects tackling critical challenges in infectious disease and drug development. From resistance prediction to generative drug design and automated diagnostics.

Python ML Bio GitHub


πŸ“‹ Projects

1. πŸ›‘οΈ Ceftriaxone Resistance Predictor

Classification Model for Antibiotic Resistance Detection

  • Task: Binary classification (Susceptible vs Resistant)
  • Model: Random Forest Classifier
  • Accuracy: 94.9% | Sensitivity: 93.9% | Specificity: 95.9%
  • Data: 4,383 E. coli isolates from NCBI
  • App: streamlit run src/app.py

2. πŸ’Š AI Peptide Dosing Calculator

Regression Model for Antimicrobial Peptide Potency Prediction

  • Task: MIC (Minimum Inhibitory Concentration) prediction
  • Model: Random Forest Regressor
  • RΒ² Score: 0.9992 | RMSE: 0.024 log units
  • Data: 3,143 E. coli isolates with MIC values
  • App: streamlit run src/app_MIC.py

3. 🧬 Week 4: Peptide Sequence Generator

Generative AI for Antimicrobial Peptide Design

  • Task: Generate novel peptide sequences (generative modeling)
  • Model: 2-Layer LSTM (PyTorch) - Character-level RNN
  • Performance: Loss 0.8541 | Generates realistic AMP sequences
  • Data: 2,872 E. coli peptides (10-50 AA length)
  • Training: ~10 min CPU / ~2 min GPU | 50 epochs
  • Status: βœ… Fully trained, ready for inference
  • Use: Computational screening, rational design, drug discovery

4. 🧠 DeepG2P - Deep Resistance Predictor ⭐ NEW

1D ResNet for Multi-label Antimicrobial Resistance Prediction from Mass Spectrometry

  • Task: Multi-label classification (10 antibiotics)
  • Model: ResNet-1D (2M parameters) - Deep CNN with residual blocks
  • Architecture: Conv1D β†’ 4 ResBlock stages β†’ Global AvgPool β†’ FC β†’ Sigmoid
  • Input: MALDI-TOF mass spectra (6000 m/z bins)
  • Loss: BCEWithLogitsLoss with pos_weight (handles class imbalance)
  • Optimizer: AdamW (lr=1e-4, weight_decay=1e-5)
  • Metrics: AUPRC, AUROC tracked via TensorBoard
  • Training: 20 epochs with automatic best model checkpointing
  • Features: Flexible model sizes (small/medium/large), feature extraction
  • Documentation: See src/README.md for detailed architecture

πŸ₯ Biological Context

Antimicrobial Resistance (AMR)

Challenge: Antibiotic-resistant bacteria cause ~1.3M deaths annually (WHO). Traditional lab testing takes 24-48 hours, delaying treatment.

Solution: Use genomic markers to instantly predict resistance from DNA sequences.

Antimicrobial Peptides (AMPs)

Challenge: Designing potent peptides requires expensive lab screening. Potency varies wildly (MIC: 0.1 - 1000+ Β΅M).

Solution: Use machine learning to predict peptide efficacy and generate new candidates from physicochemical properties and sequence patterns.

Peptide Generation

Challenge: Design space for peptides is massive (20^50 for 50-length sequences = 10^65 possibilities). Manual screening is infeasible.

Solution: Train generative AI to learn natural peptide patterns and create novel, biologically plausible sequences for experimental validation.

Deep Learning for Mass Spectrometry (NEW)

Challenge: MALDI-TOF mass spectrometry is fast (minutes) but requires expert interpretation. Multi-drug resistance requires testing 10+ antibiotics.

Solution: Train deep neural networks to directly predict resistance profiles from raw mass spectra, enabling instant multi-drug diagnostics.


οΏ½ Repository Structure

ML-Training/
β”œβ”€β”€ projects/
β”‚   β”œβ”€β”€ cefixime-resistance-training/    # Antibiotic resistance classifier
β”‚   β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”‚   β”œβ”€β”€ raw/                      # Original NCBI isolates
β”‚   β”‚   β”‚   └── processed/                # Cleaned genotype data
β”‚   β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”‚   β”œβ”€β”€ process.py                # Data preprocessing
β”‚   β”‚   β”‚   └── train.py                  # Model training (RF classifier)
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   └── ceftriaxone_model.pkl    # Trained classifier
β”‚   β”‚   └── results/
β”‚   β”‚       β”œβ”€β”€ confusion_matrix.html     # Interactive CM
β”‚   β”‚       └── feature_importance.csv    # Top resistance genes
β”‚   β”‚
β”‚   └── MIC Regression/                   # Peptide potency regressor
β”‚       β”œβ”€β”€ data/
β”‚       β”‚   β”œβ”€β”€ raw/                      # Raw peptide sequences & MIC values
β”‚       β”‚   └── processed/                # Computed physicochemical features
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   β”œβ”€β”€ process.py                # Data preprocessing
β”‚       β”‚   └── train.py                  # Model training (RF regressor)
β”‚       β”œβ”€β”€ models/
β”‚       β”‚   └── mic_predictor.pkl        # Trained regressor
β”‚       └── results/
β”‚           β”œβ”€β”€ predicted_vs_actual.png   # Predictions visualization
β”‚           └── feature_importance.png    # Top peptide features
β”‚   β”‚
β”‚   └── week4_peptide_generator/          # Generative LSTM
β”‚       β”œβ”€β”€ data/
β”‚       β”‚   └── ecolitraining_set_80.csv  # 2,872 E. coli peptides
β”‚       β”œβ”€β”€ models/
β”‚       β”‚   β”œβ”€β”€ peptide_lstm.pth          # Best model (loss: 0.854)
β”‚       β”‚   └── config.json               # Training hyperparameters
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   β”œβ”€β”€ vocab.py                  # PeptideVocab: AA tokenization
β”‚       β”‚   └── train_generator.py        # PyTorch LSTM training
β”‚       └── README.md
β”‚
β”œβ”€β”€ src/                                   # πŸ†• DeepG2P Model & Apps
β”‚   β”œβ”€β”€ model.py                          # ResNet-1D architecture (DeepG2P, ResidualBlock)
β”‚   β”œβ”€β”€ train.py                          # Training pipeline (BCEWithLogitsLoss, AdamW)
β”‚   β”œβ”€β”€ app.py                            # Ceftriaxone classifier Streamlit app
β”‚   β”œβ”€β”€ app_MIC.py                        # MIC regressor Streamlit app
β”‚   β”œβ”€β”€ features.py                       # Biopython feature extraction
β”‚   └── README.md                         # DeepG2P documentation
β”‚
β”œβ”€β”€ models/                                # πŸ†• Saved model checkpoints
β”‚   β”œβ”€β”€ best_model.pth                    # Best validation loss checkpoint
β”‚   └── checkpoint_epoch_*.pth            # Periodic training checkpoints
β”‚
β”œβ”€β”€ results/                               # πŸ†• Training outputs
β”‚   β”œβ”€β”€ logs/                             # TensorBoard logs
β”‚   └── training_config.json              # Hyperparameters & metadata
β”‚
β”œβ”€β”€ utils/
β”‚   └── model_evaluation.py               # Shared evaluation metrics
β”‚
β”œβ”€β”€ requirements.txt                      # Python dependencies (PyTorch, sklearn, etc.)
└── README.md                             # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • Git

Installation

# Clone repository
git clone https://github.com/vihaankulkarni29/ML-Training
cd ML-Training

# Install dependencies
pip install -r requirements.txt

Run Applications

Ceftriaxone Resistance Predictor (Classifier):

streamlit run src/app.py

Access at http://localhost:8501

AI Peptide Dosing Calculator (Regressor):

streamlit run src/app_MIC.py

Access at http://localhost:8501

DeepG2P Model Training:

# Train with default parameters
python src/train.py

# Custom training
python src/train.py \
  --train-features data/processed/X_train.npy \
  --train-labels data/processed/y_train.npy \
  --val-features data/processed/X_val.npy \
  --val-labels data/processed/y_val.npy \
  --epochs 20 \
  --batch-size 32 \
  --model-size medium

# Monitor training
tensorboard --logdir results/logs

πŸ“Š Project 1: Ceftriaxone Resistance Predictor

Problem Statement

Antibiotic susceptibility testing via culture takes 24-48 hours. Patients with life-threatening infections can't wait. Goal: Predict Ceftriaxone resistance instantly from genomic markers.

Solution

  • Model: Random Forest Classifier (100 trees, balanced class weights)
  • Data: 4,383 E. coli isolates from NCBI MicroBIGG-E
  • Features: 352 detected resistance genes/mutations

Performance Metrics

Metric Value
Accuracy 94.9%
Sensitivity 93.9%
Specificity 95.9%
ROC-AUC 0.978
Test Set Size 876 isolates

Key Insights

The model independently discovered known resistance mechanisms:

  • blaCTX-M-15 (Extended-Spectrum Beta-Lactamase) - strongest predictor
  • blaCMY-2 (AmpC Cephalosporinase)
  • gyrA_S83L (Gyrase mutation - fluoroquinolone resistance)

Biological Mechanism

Beta-lactamase genes encode enzymes that destroy beta-lactam antibiotics (e.g., cephalosporins) before they can bind to bacterial cell walls.

Files

  • Training: projects/cefixime-resistance-training/src/train.py
  • Model: projects/cefixime-resistance-training/models/ceftriaxone_model.pkl
  • App: src/app.py

πŸ’Š Project 2: AI Peptide Dosing Calculator

Problem Statement

Antimicrobial peptide (AMP) design is expensive and slow. Wet-lab screening for potency (MIC) takes months. Goal: Predict MIC instantly from sequence, enabling computational design cycles.

Solution

  • Model: Random Forest Regressor (100 trees)
  • Data: 3,143 E. coli isolates with MIC values (NCBI)
  • Target: neg_log_mic_microM (-log10 of MIC in Β΅M)

Performance Metrics

Metric Current (K-mers) Previous (Baseline)
RΒ² Score 0.9992 0.4461
RMSE 0.024 log units 0.629 log units
Pearson r 0.9996 0.6742
p-value < 0.001 < 0.001
Test Set Size 629 peptides 629 peptides
Features 410 (7 + 399 k-mers) 7 (physicochemical only)

Interpretation

  • RMSE of 0.024 log units = ~1.06x fold-change (nearly perfect prediction!)
  • Model explains 99.9% of variance in test data (breakthrough performance)
  • Near-perfect correlation with actual values (r = 0.9996)

Feature Engineering

Physicochemical Properties (7 features via Biopython):

  1. Molecular Weight - correlates with toxicity vs efficacy
  2. Aromaticity - aromatic residues enhance membrane interaction
  3. Instability Index - peptide stability in vivo
  4. Isoelectric Point - charge affects cellular uptake
  5. GRAVY (hydrophobicity) - hydrophobic residues improve activity
  6. Length - longer peptides often more potent but less specific
  7. Positive Charge - (K + R count) - important for bacterial binding

K-mer (Dipeptide) Features (399 features via CountVectorizer):

  • Extracts all 2-character amino acid combinations (e.g., "KK", "WR", "EK")
  • Captures sequence order information (solves "bag of words" problem)
  • Preserves local context: distinguishes R-R-W-W from W-R-W-R
  • Min frequency threshold (min_df=5) filters rare k-mers
  • Breakthrough improvement: RΒ² 0.45 β†’ 0.9992 (+122% relative gain)

Potency Categories

  • < 2 Β΅M: πŸ’Ž Excellent (highly potent)
  • 2-10 Β΅M: βœ… Good (reasonable activity)
  • 10-50 Β΅M: ⚠️ Weak (marginal)
  • 50 Β΅M: ❌ Inactive (not viable)

Model Evolution: Solving the "Bag of Words" Problem

Initial Challenge (RΒ² = 0.45)

The baseline model using only physicochemical properties hit a performance ceiling because it treated sequences as ingredients, not recipes.

The Problem:

  • Sequence R-R-W-W (positive charge β†’ hydrophobic) might be highly potent
  • Sequence W-R-W-R (alternating pattern) could be ineffective
  • Issue: Both have identical weight, charge, GRAVY β†’ model couldn't distinguish them

Physicochemical features are sequence-order agnostic - they summarize global composition but ignore local patterns critical for membrane interaction.

Solution: K-mer Features (Implemented)

Added dipeptide counting to capture local sequence context:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    analyzer='char',
    ngram_range=(2, 2),  # Dipeptides (AA, AK, KE, WW, etc.)
    min_df=5              # Ignore rare k-mers
)
kmer_features = vectorizer.fit_transform(sequences)
# Result: 399 k-mer features capturing sequence order

Breakthrough Results:

  • RΒ² improved from 0.45 β†’ 0.9992 (99.9% variance explained)
  • RMSE reduced from 0.63 β†’ 0.024 log units (~27x improvement)
  • Model now distinguishes R-R-W-W from W-R-W-R based on local patterns

Why K-mers Work:

  • Capture pairwise amino acid interactions (e.g., "KK" = strong positive clustering)
  • Preserve positional information without overfitting (unlike full sequence embeddings)
  • Interpretable: Can analyze top k-mers for biological plausibility
  • Computationally efficient for inference

Biological Validation: Top k-mer features likely include:

  • "KK", "RR" - positive charge clustering (enhances bacterial binding)
  • "WW", "FF" - hydrophobic patches (membrane insertion)
  • "KE", "RD" - charged pairs (amphipathicity)

This aligns with known AMP design principles where local sequence motifs drive activity more than global properties.

Files

  • Feature extraction: src/features.py
  • Training: projects/MIC Regression/src/train.py
  • Model: projects/MIC Regression/models/mic_predictor.pkl
  • Processed data: projects/MIC Regression/data/processed/processed_features.csv
  • App: src/app_MIC.py

🧬 Project 3: Week 4 Peptide Sequence Generator ⭐ NEW

Problem Statement

Designing antimicrobial peptides requires screening millions of candidates. The design space is massive (20^50 β‰ˆ 10^65 for 50-length sequences). Goal: Use generative AI to learn natural peptide patterns and create novel candidates for experimental validation.

Solution

  • Model: 2-Layer LSTM (PyTorch character-level RNN)
  • Data: 2,872 E. coli peptides (10-50 AA length)
  • Task: Learn to predict next amino acid in sequence β†’ generate new peptides

Training Results

Metric Value Status
Initial Loss (Epoch 1) 2.81 Random
Target Achieved (Epoch 15) 1.59 βœ… Hit target
Final Loss (Epoch 50) 0.854 ✨ Excellent
Training Time (CPU) ~10 min Practical
Training Time (GPU) ~2 min Fast
Vocab Size 23 (20 AA + 3 special)
Model Parameters ~1.3M Manageable

Architecture

Input: Sequence of amino acid indices
    ↓
Embedding (vocab_size=23 β†’ embedding_dim=128)
    ↓
LSTM Layer 1 (128 β†’ 256 units) + Dropout(0.3)
    ↓
LSTM Layer 2 (256 β†’ 256 units) + Dropout(0.3)
    ↓
Linear (256 β†’ vocab_size=23)
    ↓
Output: Logits for next token

Sample Generated Sequences

Epoch 50 Generations (Temperature=0.8):

1. FLPAIVGAAAKFLPKIFCAITKKC     ← Hydrophobic core + basic tail
2. GIGKFLHSAKKFGKAFVGEIMNS      ← Alternating hydrophobic/charged
3. SKVGRHWRRFWHRAHRLLHR         ← Rich in W (aromatic) & R (cationic)
4. GLRKRLRKFRNKIKEKLKKIGQKIQGLLPKLAPRTDY
5. LLGDFFRKSKEKIGKEFKRIVQRIKDFFRNLVPRTES

Why These Look Realistic:

  • Contain hydrophobic residues (L, V, I, F) for membrane interaction
  • Cationic clusters (K, R) for bacterial binding
  • Avoid D, E (acidic) which would reduce activity
  • Length distribution matches natural AMPs
  • No known toxins generated

Key Insights

  1. Model learned biological patterns without explicit rules
  2. Generative capability β†’ enables computational screening
  3. Loss convergence shows genuine pattern learning (not memorization)
  4. Character-level modeling better than sequence models for this task

Biological Potential

Next Steps (Future Work):

  • βœ… MIC Prediction: Use Project 2 regressor on generated sequences
  • βœ… Toxicity Screening: Hemolysis prediction models
  • βœ… Structural Validation: AlphaFold2 for 3D verification
  • βœ… Lab Validation: Experimental MIC testing

Files

  • Vocabulary: projects/week4_peptide_generator/src/vocab.py
  • Training & Generation: projects/week4_peptide_generator/src/train_generator.py
  • Best Model: projects/week4_peptide_generator/models/peptide_lstm.pth
  • Checkpoints: projects/week4_peptide_generator/models/peptide_lstm_epoch_{10,20,30,40,50}.pth
  • Documentation: projects/week4_peptide_generator/README.md

Use Case: Multi-Stage Screening Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 1: GENERATION (Week 4 Peptide Generator)              β”‚
β”‚ Generate 1000 candidate sequences                            β”‚
β”‚ Temperature=0.8 for balanced novelty/realism                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 2: POTENCY PREDICTION (Project 2: MIC Regressor)     β”‚
β”‚ Predict MIC for each candidate                              β”‚
β”‚ Filter: Keep only high-potency (MIC < 5 Β΅M)                β”‚
β”‚ Result: ~50-100 promising candidates                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stage 3: EXPERIMENTAL VALIDATION                            β”‚
β”‚ Synthesize top 20 candidates                                β”‚
β”‚ Test MIC, toxicity, stability                               β”‚
β”‚ β†’ 2-3 viable drug leads per iteration                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This computational-experimental hybrid dramatically reduces time & cost vs. random screening.


πŸ”¬ Scientific Validation Framework

All projects include built-in validation mechanisms to ensure scientific rigor and prevent common ML failures.

1️⃣ Sparse Data Bias Mitigation (Week 3: Outbreak Detective)

Problem: Clustering treating single-sample locations as valid clusters.

Solution: Filter locations with <5 samples before matrix construction.

python src/process_matrix.py --min-location-samples 5

Impact: Prevents geographic clustering artifacts, improves statistical reliability.

2️⃣ Plagiarism Detection (Week 4: Peptide Generator)

Problem: Generated peptides might be >90% identical to training data (memorization).

Solution: Check sequence homology using SequenceMatcher before screening.

Filtered 2 candidates for high homology (>90% identity)
βœ“ Novelty status: NOVEL

Impact: Ensures generated peptides are truly novel for experimental validation.

3️⃣ Extrapolation Detection (Week 2 & 4: MIC Prediction)

Problem: Regressor predicts values outside training range (hallucination).

Example: Training MIC range 0.5-256 Β΅M, but model predicts 0.017 Β΅M

Solution: Flag predictions outside training range with confidence indicators.

Flagged 2 predictions with LOW_CONFIDENCE* (outside training range 0.5-256 Β΅M)
prediction_confidence: HIGH_CONFIDENCE or LOW_CONFIDENCE*

Impact: Prevents overconfident predictions on extrapolated values.

4️⃣ Image Quality Gating (Week 5: Auto AST)

Problem: Computer vision fails with poor lighting (too dark or overexposed).

Solution: Validate image intensity before analysis.

Image quality: mean_intensity = 125.4
βœ“ Image quality validated (within 50-200 range)

Impact: Prevents false positives/negatives from suboptimal imaging conditions.


πŸ”¬ Technical Stack

Data Science

  • Pandas: Data manipulation & analysis
  • NumPy: Numerical computations
  • Scikit-Learn: RandomForest classifiers & regressors
  • Biopython: Protein sequence analysis (Bio.SeqUtils.ProtParam)
  • SciPy: Statistical tests (Pearson correlation, etc.)

Visualization

  • Matplotlib: Static publication-ready plots
  • Plotly: Interactive HTML charts
  • Kaleido: PNG export from Plotly

Deployment

  • Streamlit: Interactive web apps (no frontend coding)
  • Joblib: Model persistence (.pkl files)
  • GitHub: Version control & deployment integration

πŸ₯ Biological Background

Antimicrobial Resistance (AMR)

Global Impact:

  • ~1.3M deaths/year attributable to AMR (WHO, 2022)
  • Top 10 global health threat
  • Economic cost: $100B+ annually in healthcare

Genetic Basis (Ceftriaxone Example):

  1. Enzymatic Inactivation: blaCTX-M genes produce beta-lactamases that hydrolyze beta-lactam ring
  2. Target Modification: gyrA mutations alter DNA gyrase binding site
  3. Efflux Pumps: acrB overexpression exports antibiotics before they act

Antimicrobial Peptides (AMPs)

Natural Defense:

  • Found in all life forms (immune system, skin, GI tract)
  • Kill bacteria via direct membrane disruption
  • Less likely to develop resistance (multiple mechanisms)

Design Challenge:

  • Potency (MIC) varies 1000-fold (0.1 - 100+ Β΅M)
  • Toxicity risk increases with potency
  • Design space is massive (20^n for n-length peptides)

ML Solution:

  • Use physicochemical properties to predict potency
  • Enable rational design instead of random screening
  • Reduce wet-lab costs & timelines

πŸ“š Literature & Data Sources

Antimicrobial Resistance

Antimicrobial Peptides

Biopython Feature Extraction


⚠️ Disclaimers

Ceftriaxone Predictor

For research/educational use only. Not a clinical diagnostic device.

  • Always confirm predictions with lab culture + antibiotic susceptibility testing (EUCAST/CLSI)
  • Consult clinical microbiology before treatment decisions
  • Models trained on specific E. coli population; validate locally

MIC Calculator

For research/design purposes only. Not validated for clinical use.

  • Predicted MIC is a computational estimate; always validate experimentally
  • Model trained on specific data; performance may vary on novel sequences
  • Use as design guidance, not final arbiter of peptide efficacy

🎯 Roadmap

Q1 2025

  • Multi-organism support (Klebsiella, Pseudomonas)
  • SHAP explainability for individual predictions
  • Confidence intervals for MIC predictions

Q2 2025

  • REST API for integration with LIS systems
  • Additional antibiotics (fluoroquinolones, aminoglycosides)
  • Uncertainty quantification via Bayesian methods

Q3 2025

  • Mobile app (iOS/Android) for field deployment
  • Real-time database updates from NCBI
  • Community contribution framework

πŸ‘€ Author

Vihaan Kulkarni β€” Bioinformatics & Machine Learning Engineer


πŸ“„ License

MIT License β€” Free for academic and research use.


Last Updated: December 17, 2025

Status: βœ… Active Development

Phase 6: Documentation

  1. Fill out README.md with:
    • Problem statement
    • Key insights (with screenshots)
    • Model metrics
    • Deployment link
  2. Use "Problem β†’ Method β†’ Insight β†’ Impact" structure

πŸ“¦ Standard Dependencies

Every project includes:

  • Data: pandas, numpy
  • Visualization: plotly, kaleido
  • Modeling: scikit-learn
  • Explainability: shap
  • Deployment: streamlit

Optional (uncomment in requirements.txt if needed):

  • Experiment Tracking: mlflow, wandb
  • Deep Learning: torch, tensorflow

πŸ’‘ Pro Tips

  1. Run baseline first: Always compare against a simple model
  2. Plotly over Matplotlib: Interactive charts reveal more insights
  3. Document as you go: Fill README during the project, not after
  4. Save figures: Use fig.write_html() to preserve interactivity
  5. Version control: Commit after each major milestone

πŸŽ“ Learning Resources


πŸ“Š Portfolio Goals

  • βœ… 1 high-quality project per week
  • βœ… Every project deployed with Streamlit
  • βœ… README formatted for resume/GitHub
  • βœ… Interactive visualizations (no static PNGs)
  • βœ… Model explainability included

Built by Vihaan Kulkarni
Senior ML Engineer & Data Storyteller

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors