GitHub - Abu-Sameer-66/OLMO-ESOL-Regression: OLMo-7B QLoRA fine-tuned on ESOL solubility regression. RDKit-Augmented Prompts + Regression Head on 4096-dim hidden states. Best RMSE: 0.8582. DeepChem GSoC 2026.

🧬 Project Mission

This repository fine-tunes OLMo-7B for molecular solubility regression (ESOL dataset) using two novel techniques working together:

RDKit-Augmented Prompts — injects MW, LogP, TPSA, HBD, HBA, Rings, RotB directly into the LLM context, giving OLMo topological awareness it otherwise lacks from 1D SMILES alone
Regression Head on Hidden States — instead of parsing generated text for numbers, extracts OLMo's 4096-dim last hidden state and feeds it through a learned regression head — fundamentally more precise

🔬 Novel Approach — RDKit-Augmented Prompts

Standard LLM prompt (what everyone else does):

Molecule: CC(=O)O
Solubility:

RDKit-Augmented prompt (this work):

Molecule: CC(=O)O
MW: 60.05 | LogP: -0.17 | HBD: 1 | HBA: 2 | Rings: 0 | TPSA: 37.30 | RotB: 1
Solubility:

Why this matters: LogP is the single strongest predictor of aqueous solubility. TPSA and MW encode polar surface area and size — key drivers of solvation. By injecting these into the prompt, OLMo receives both 1D SMILES sequence grammar AND 2D molecular descriptor context simultaneously. This bridges the gap between text and topology without any architectural changes.

🏗️ Architecture — Two-Stage Design

SMILES → RDKit Descriptors → Augmented Prompt
                                    ↓
                            OLMo-7B (4-bit NF4, frozen)
                                    ↓
                        Last Hidden State (4096-dim)
                                    ↓
                        Linear(4096 → 256) → GELU → Dropout(0.1)
                                    ↓
                            Linear(256 → 1)  ← Regression Head
                                    ↓
                        Predicted log solubility

Key insight: The regression head attaches to OLMo's final hidden state — not to generated text. This eliminates text parsing entirely and allows direct gradient flow through the regression objective.

📊 Complete Results — All Experiments

Phase 1 — Text Generation Approach

Epochs	Strategy	RMSE	MAE
3	QLoRA + text generation	1.2169	0.8147
8	QLoRA + text generation	1.1274	0.7286
13	QLoRA + text generation	1.1630	0.7748

Finding: Text generation plateaus at ~1.12 RMSE. Model predicts directionally correct values but precision is limited by text parsing.

Phase 2 — Regression Head on Hidden States ✅ Final

Epochs	RMSE	MAE	Status
1	1.0945	—	Learning
3	0.9387	—	Improving ↑
4	0.8764	—	Improving ↑
5	0.8802	0.6817	Slight plateau
7	0.8582	0.6644	BEST ✅
8	0.8831	—	Overfitting ↓

Best RMSE: 0.8582 — Regression Head, Epoch 7, Scaffold Split

Improvement from regression head: −0.27 RMSE vs text generation

Sample Predictions (Best Model)

True	Predicted	Error
-1.99	-1.95	0.04
-8.49	-7.56	0.93
-4.63	-4.45	0.18
-2.56	-2.80	0.24
-1.57	-1.80	0.23
-3.59	(best)	0.15

🔧 Hardware & Setup

Dataset:   ESOL (Delaney) — 1128 molecules
Target:    measured log solubility in mols per litre
Range:     -11.60 to +1.58
Mean:      -3.05

Model:     allenai/OLMo-7B-hf
Quant:     4-bit NF4 (BitsAndBytes)
LoRA:      r=8, lora_alpha=32, target q_proj + v_proj
Trainable: 4,194,304 / 6,892,290,048 = 0.06%
Hardware:  Kaggle T4 (15.6 GB VRAM)
VRAM used: 2.11 GB (model load)
Fix:       transformers==4.40.2 + device_map="auto"

Split:     Scaffold — Train 904 | Valid 113 | Test 111
Sanitization: 0 molecules dropped (ESOL is clean)

🚀 Key Scientific Contributions

Challenge	Solution	Impact
1D SMILES lacks topology	RDKit descriptors (MW, LogP, TPSA) in prompt	LLM gains 2D molecular context
Text generation imprecise	Regression head on 4096-dim hidden state	−0.27 RMSE improvement
Scaffold generalization	DeepChem ScaffoldSplitter	True out-of-distribution evaluation
Memory	4-bit NF4, only 2.11GB VRAM	Runs on free Kaggle T4
LR decay	Cosine scheduler with warmup	Stable convergence across 10 epochs

📂 File Index

File	Description
`olmo-esol-regression.ipynb`	Complete notebook — both phases, best RMSE 0.8582
`requirements.txt`	Dependencies

💻 How to Run

pip install transformers==4.40.2 accelerate bitsandbytes peft==0.11.1 rdkit

# Full notebook on Kaggle
# kaggle.com/sameernadeem66/olmo-esol-regression

🔗 Part of DeepChem GSoC 2026 Research

Task	Model	Result	Repo
BACE Classification	Mistral-7B QLoRA	0.8371 ROC-AUC	BACE
BBBP Classification	Mistral-7B QLoRA	0.7141 ROC-AUC	BBBP
ClinTox Classification	Mistral-7B QLoRA	0.9913 ROC-AUC	ClinTox
Tox21 Multi-Task	OLMo-7B QLoRA	0.7225 Mean ROC-AUC	Tox21
ESOL Regression	OLMo-7B + Reg Head	RMSE 0.8582	This Repo
SMILES Generation	OLMo-7B + RDKit TSM	20/20 = 100% valid	Generation

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
olmo-esol-regression.ipynb		olmo-esol-regression.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Project Mission

🔬 Novel Approach — RDKit-Augmented Prompts

🏗️ Architecture — Two-Stage Design

📊 Complete Results — All Experiments

Phase 1 — Text Generation Approach

Phase 2 — Regression Head on Hidden States ✅ Final

Sample Predictions (Best Model)

🔧 Hardware & Setup

🚀 Key Scientific Contributions

📂 File Index

💻 How to Run

🔗 Part of DeepChem GSoC 2026 Research

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 Project Mission

🔬 Novel Approach — RDKit-Augmented Prompts

🏗️ Architecture — Two-Stage Design

📊 Complete Results — All Experiments

Phase 1 — Text Generation Approach

Phase 2 — Regression Head on Hidden States ✅ Final

Sample Predictions (Best Model)

🔧 Hardware & Setup

🚀 Key Scientific Contributions

📂 File Index

💻 How to Run

🔗 Part of DeepChem GSoC 2026 Research

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages