This repository fine-tunes OLMo-7B for molecular solubility regression (ESOL dataset) using two novel techniques working together:
- RDKit-Augmented Prompts — injects MW, LogP, TPSA, HBD, HBA, Rings, RotB directly into the LLM context, giving OLMo topological awareness it otherwise lacks from 1D SMILES alone
- Regression Head on Hidden States — instead of parsing generated text for numbers, extracts OLMo's 4096-dim last hidden state and feeds it through a learned regression head — fundamentally more precise
Standard LLM prompt (what everyone else does):
Molecule: CC(=O)O
Solubility:
RDKit-Augmented prompt (this work):
Molecule: CC(=O)O
MW: 60.05 | LogP: -0.17 | HBD: 1 | HBA: 2 | Rings: 0 | TPSA: 37.30 | RotB: 1
Solubility:
Why this matters: LogP is the single strongest predictor of aqueous solubility. TPSA and MW encode polar surface area and size — key drivers of solvation. By injecting these into the prompt, OLMo receives both 1D SMILES sequence grammar AND 2D molecular descriptor context simultaneously. This bridges the gap between text and topology without any architectural changes.
SMILES → RDKit Descriptors → Augmented Prompt
↓
OLMo-7B (4-bit NF4, frozen)
↓
Last Hidden State (4096-dim)
↓
Linear(4096 → 256) → GELU → Dropout(0.1)
↓
Linear(256 → 1) ← Regression Head
↓
Predicted log solubility
Key insight: The regression head attaches to OLMo's final hidden state — not to generated text. This eliminates text parsing entirely and allows direct gradient flow through the regression objective.
| Epochs | Strategy | RMSE | MAE |
|---|---|---|---|
| 3 | QLoRA + text generation | 1.2169 | 0.8147 |
| 8 | QLoRA + text generation | 1.1274 | 0.7286 |
| 13 | QLoRA + text generation | 1.1630 | 0.7748 |
Finding: Text generation plateaus at ~1.12 RMSE. Model predicts directionally correct values but precision is limited by text parsing.
Phase 2 — Regression Head on Hidden States ✅ Final
| Epochs | RMSE | MAE | Status |
|---|---|---|---|
| 1 | 1.0945 | — | Learning |
| 3 | 0.9387 | — | Improving ↑ |
| 4 | 0.8764 | — | Improving ↑ |
| 5 | 0.8802 | 0.6817 | Slight plateau |
| 7 | 0.8582 | 0.6644 | BEST ✅ |
| 8 | 0.8831 | — | Overfitting ↓ |
Best RMSE: 0.8582 — Regression Head, Epoch 7, Scaffold Split
Improvement from regression head: −0.27 RMSE vs text generation
| True | Predicted | Error |
|---|---|---|
| -1.99 | -1.95 | 0.04 |
| -8.49 | -7.56 | 0.93 |
| -4.63 | -4.45 | 0.18 |
| -2.56 | -2.80 | 0.24 |
| -1.57 | -1.80 | 0.23 |
| -3.59 | (best) | 0.15 |
Dataset: ESOL (Delaney) — 1128 molecules
Target: measured log solubility in mols per litre
Range: -11.60 to +1.58
Mean: -3.05
Model: allenai/OLMo-7B-hf
Quant: 4-bit NF4 (BitsAndBytes)
LoRA: r=8, lora_alpha=32, target q_proj + v_proj
Trainable: 4,194,304 / 6,892,290,048 = 0.06%
Hardware: Kaggle T4 (15.6 GB VRAM)
VRAM used: 2.11 GB (model load)
Fix: transformers==4.40.2 + device_map="auto"
Split: Scaffold — Train 904 | Valid 113 | Test 111
Sanitization: 0 molecules dropped (ESOL is clean)
| Challenge | Solution | Impact |
|---|---|---|
| 1D SMILES lacks topology | RDKit descriptors (MW, LogP, TPSA) in prompt | LLM gains 2D molecular context |
| Text generation imprecise | Regression head on 4096-dim hidden state | −0.27 RMSE improvement |
| Scaffold generalization | DeepChem ScaffoldSplitter | True out-of-distribution evaluation |
| Memory | 4-bit NF4, only 2.11GB VRAM | Runs on free Kaggle T4 |
| LR decay | Cosine scheduler with warmup | Stable convergence across 10 epochs |
| File | Description |
|---|---|
olmo-esol-regression.ipynb |
Complete notebook — both phases, best RMSE 0.8582 |
requirements.txt |
Dependencies |
pip install transformers==4.40.2 accelerate bitsandbytes peft==0.11.1 rdkit
# Full notebook on Kaggle
# kaggle.com/sameernadeem66/olmo-esol-regression| Task | Model | Result | Repo |
|---|---|---|---|
| BACE Classification | Mistral-7B QLoRA | 0.8371 ROC-AUC | BACE |
| BBBP Classification | Mistral-7B QLoRA | 0.7141 ROC-AUC | BBBP |
| ClinTox Classification | Mistral-7B QLoRA | 0.9913 ROC-AUC | ClinTox |
| Tox21 Multi-Task | OLMo-7B QLoRA | 0.7225 Mean ROC-AUC | Tox21 |
| ESOL Regression | OLMo-7B + Reg Head | RMSE 0.8582 | This Repo |
| SMILES Generation | OLMo-7B + RDKit TSM | 20/20 = 100% valid | Generation |