Skip to content

Abu-Sameer-66/OLMO-ESOL-Regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation


🧬 Project Mission

This repository fine-tunes OLMo-7B for molecular solubility regression (ESOL dataset) using two novel techniques working together:

  1. RDKit-Augmented Prompts — injects MW, LogP, TPSA, HBD, HBA, Rings, RotB directly into the LLM context, giving OLMo topological awareness it otherwise lacks from 1D SMILES alone
  2. Regression Head on Hidden States — instead of parsing generated text for numbers, extracts OLMo's 4096-dim last hidden state and feeds it through a learned regression head — fundamentally more precise

🔬 Novel Approach — RDKit-Augmented Prompts

Standard LLM prompt (what everyone else does):

Molecule: CC(=O)O
Solubility:

RDKit-Augmented prompt (this work):

Molecule: CC(=O)O
MW: 60.05 | LogP: -0.17 | HBD: 1 | HBA: 2 | Rings: 0 | TPSA: 37.30 | RotB: 1
Solubility:

Why this matters: LogP is the single strongest predictor of aqueous solubility. TPSA and MW encode polar surface area and size — key drivers of solvation. By injecting these into the prompt, OLMo receives both 1D SMILES sequence grammar AND 2D molecular descriptor context simultaneously. This bridges the gap between text and topology without any architectural changes.


🏗️ Architecture — Two-Stage Design

SMILES → RDKit Descriptors → Augmented Prompt
                                    ↓
                            OLMo-7B (4-bit NF4, frozen)
                                    ↓
                        Last Hidden State (4096-dim)
                                    ↓
                        Linear(4096 → 256) → GELU → Dropout(0.1)
                                    ↓
                            Linear(256 → 1)  ← Regression Head
                                    ↓
                        Predicted log solubility

Key insight: The regression head attaches to OLMo's final hidden state — not to generated text. This eliminates text parsing entirely and allows direct gradient flow through the regression objective.


📊 Complete Results — All Experiments

Phase 1 — Text Generation Approach

Epochs Strategy RMSE MAE
3 QLoRA + text generation 1.2169 0.8147
8 QLoRA + text generation 1.1274 0.7286
13 QLoRA + text generation 1.1630 0.7748

Finding: Text generation plateaus at ~1.12 RMSE. Model predicts directionally correct values but precision is limited by text parsing.

Phase 2 — Regression Head on Hidden States ✅ Final

Epochs RMSE MAE Status
1 1.0945 Learning
3 0.9387 Improving ↑
4 0.8764 Improving ↑
5 0.8802 0.6817 Slight plateau
7 0.8582 0.6644 BEST
8 0.8831 Overfitting ↓

Best RMSE: 0.8582 — Regression Head, Epoch 7, Scaffold Split

Improvement from regression head: −0.27 RMSE vs text generation

Sample Predictions (Best Model)

True Predicted Error
-1.99 -1.95 0.04
-8.49 -7.56 0.93
-4.63 -4.45 0.18
-2.56 -2.80 0.24
-1.57 -1.80 0.23
-3.59 (best) 0.15

🔧 Hardware & Setup

Dataset:   ESOL (Delaney) — 1128 molecules
Target:    measured log solubility in mols per litre
Range:     -11.60 to +1.58
Mean:      -3.05

Model:     allenai/OLMo-7B-hf
Quant:     4-bit NF4 (BitsAndBytes)
LoRA:      r=8, lora_alpha=32, target q_proj + v_proj
Trainable: 4,194,304 / 6,892,290,048 = 0.06%
Hardware:  Kaggle T4 (15.6 GB VRAM)
VRAM used: 2.11 GB (model load)
Fix:       transformers==4.40.2 + device_map="auto"

Split:     Scaffold — Train 904 | Valid 113 | Test 111
Sanitization: 0 molecules dropped (ESOL is clean)

🚀 Key Scientific Contributions

Challenge Solution Impact
1D SMILES lacks topology RDKit descriptors (MW, LogP, TPSA) in prompt LLM gains 2D molecular context
Text generation imprecise Regression head on 4096-dim hidden state −0.27 RMSE improvement
Scaffold generalization DeepChem ScaffoldSplitter True out-of-distribution evaluation
Memory 4-bit NF4, only 2.11GB VRAM Runs on free Kaggle T4
LR decay Cosine scheduler with warmup Stable convergence across 10 epochs

📂 File Index

File Description
olmo-esol-regression.ipynb Complete notebook — both phases, best RMSE 0.8582
requirements.txt Dependencies

💻 How to Run

pip install transformers==4.40.2 accelerate bitsandbytes peft==0.11.1 rdkit

# Full notebook on Kaggle
# kaggle.com/sameernadeem66/olmo-esol-regression

🔗 Part of DeepChem GSoC 2026 Research

Task Model Result Repo
BACE Classification Mistral-7B QLoRA 0.8371 ROC-AUC BACE
BBBP Classification Mistral-7B QLoRA 0.7141 ROC-AUC BBBP
ClinTox Classification Mistral-7B QLoRA 0.9913 ROC-AUC ClinTox
Tox21 Multi-Task OLMo-7B QLoRA 0.7225 Mean ROC-AUC Tox21
ESOL Regression OLMo-7B + Reg Head RMSE 0.8582 This Repo
SMILES Generation OLMo-7B + RDKit TSM 20/20 = 100% valid Generation

About

OLMo-7B QLoRA fine-tuned on ESOL solubility regression. RDKit-Augmented Prompts + Regression Head on 4096-dim hidden states. Best RMSE: 0.8582. DeepChem GSoC 2026.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors