A research-only QSAR validation prototype for retrospective analysis of public bioactivity data. The current DAT-centered path is deliberately narrow: fixed ChEMBL snapshots, assay-context metadata, scaffold and external splits, uncertainty, applicability domain, and reproducible CLI/API surfaces that a scientific reviewer can inspect without GPU access.
This repository is not intended to recommend, rank, synthesize, optimize, dose, or support human use of psychoactive substances or therapeutic candidates. Predictions are exploratory software outputs and must not be interpreted as prospective potency, safety, medical, regulatory, or manufacturing evidence.
The project started as a DAT activity predictor and now demonstrates a modular research pipeline for:
- multi-target pIC50 and pKi modeling across DAT, 5-HT2A, CB1, CB2, and opioid receptors
- RDKit descriptors, ECFP4/MACCS fingerprints, SMARTS flags, and graph features
- elastic-looped Transformer regression as a third deep-learning path after descriptor Transformer and GNN baselines
- ETKDG 3D conformer generation with geometry descriptors
- ADMET and developability descriptors for non-decisional research review
- synthetic accessibility scoring interfaces kept separate from QSAR claims
- reaction-planning interfaces kept separate from public potency reporting
- molecule image features for multimodal image + structure experiments
- optional Prefect/Airflow-style automation hooks
- future AlphaFold3 and docking simulation integration contracts
See docs/research_only_qsar_validation_plan.md for the current safety boundary, MVP gates, BioRender figure plan, and GitHub/Hugging Face release checklist.
| Field | Current public evidence |
|---|---|
| Model surface | Transformer pIC50 workflow, optional GNN adapters, elastic-looped Transformer path, ensemble hooks, uncertainty reporting, and no-model compound assessment paths |
| Dataset surface | ChEMBL-backed pIC50/pKi target activity workflows for DAT, 5-HT2A, CB1, CB2, and opioid receptors, with assay-context lineage and SMILES inputs treated as retrospective research queries |
| Feature engineering | RDKit descriptors, ECFP4/MACCS fingerprints, SMARTS flags, ETKDGv3 3D descriptors, graph features, ADMET proxies, and synthetic accessibility scoring |
| Repro command | uv sync then uv run python cli.py train --target CHEMBL238 --optimize and uv run python cli.py assess --smiles "CCN(CC)CC" |
| Metrics to inspect | Unit/integration tests cover model, pipeline, discovery extension, and structure integration contracts; promote benchmark tables here when a calibrated public run is available |
| Limitations | Research-only retrospective analysis; pIC50/pKi, ADMET, docking, synthesis, and route outputs require calibration, provenance review, and expert approval before any real-world interpretation |
The CPU Ridge result on methylphenidate is directionally useful but weak by about
1.33 log units versus literature. A natural next model to try is ELT, based on
zapabob/elastic-looped-transformer: a Transformer block is shared across a
selectable number of loop iterations, so the same checkpoint can trade latency
for iterative refinement. In this repo, the idea is adapted from causal language
modeling to pIC50 regression over molecular descriptor tokens.
Current implementation:
src/models/elastic_looped_transformer.pyaddsElasticLoopedPIC50ModelandLitElasticLoopedPIC50.MultimodalElasticLoopedPIC50Modelextends that path with ViT-style molecule image patches plus descriptor tokens, leaving room for graph summary tokens.train-eltexposes the model from the CLI.deep-cvcompares compact GNN and multimodal ELT runs on the same stable scaffold folds.- The same checkpoint can be evaluated with shorter or longer loop schedules, making it a practical candidate for uncertainty and budget-sensitive pharma triage.
uv run python -B cli.py train-elt --target CHEMBL238 --loop-count 4 --epochs 20CHEMBL238 CPU smoke run on the frozen snapshot:
uv run python -B scripts/run_elt_chembl238_smoke.pyThe checked smoke report is
artifacts/elt_chembl238_smoke_report.json. In the 5-epoch CPU run, the ELT
path is not yet globally stronger than Ridge (external R2 = -0.0213, RMSE = 1.1566), but the methylphenidate loop trajectory is useful: pIC50 moves from
4.7812 at L=1 to 6.3530 at L=4. That is still 1.0189 log units weaker than
the literature mean, but it is about 0.3130 pIC50 closer than the Ridge baseline
and roughly halves the methylphenidate fold error from about 21x to about 10x.
Reference implementation: https://github.com/zapabob/elastic-looped-transformer
Multimodal ELT/GNN cross-validation smoke run:
uv run python -B cli.py deep-cv --folds 3 --epochs 2 --max-rows 240The checked report is artifacts/deep_cv_chembl238_report.json. This run uses
the frozen CHEMBL238 snapshot, holds external rows out of CV, and evaluates 240
CPU-sampled non-external rows with stable scaffold-hash folds. It is still a
small smoke comparison, not a production benchmark: multimodal ELT averaged
R2 = -0.0342, RMSE = 1.1735, MAE = 0.9881, MSE loss = 1.3785, while
compact GNN averaged R2 = -0.1055, RMSE = 1.2146, MAE = 1.0161,
MSE loss = 1.4854. The value is that the same
fold policy now compares graph and ViT-style looped-Transformer candidates.
Category-expanded scaffold CV run:
uv run python -B cli.py build-chembl-snapshot --targets CHEMBL224,CHEMBL218,CHEMBL253,CHEMBL233,CHEMBL236,CHEMBL238 --output data/chembl_category_pic50_snapshot.csv --manifest artifacts/chembl_category_pic50_snapshot.manifest.json --max-rows-per-target 300
uv run python -B cli.py deep-cv --snapshot data/chembl_category_pic50_snapshot.csv --output artifacts/deep_cv_category_report.json --target ALL --folds 3 --epochs 2 --max-rows 0This expands the CV evidence to psychedelic (CHEMBL224), cannabinoid
(CHEMBL218, CHEMBL253), opioid (CHEMBL233, CHEMBL236), and
phenethylamine-like structure labels. The checked category report uses 1,800
ChEMBL rows, excludes 262 external rows from CV, and evaluates 1,538 rows with
stable scaffold folds.
| Model | Scope | n | R2 | RMSE | MAE | MSE loss |
|---|---|---|---|---|---|---|
| multimodal ELT | overall | 1,538 | 0.1413 | 1.2627 | 1.0510 | 1.5952 |
| GNN | overall | 1,538 | 0.0118 | 1.3560 | 1.1249 | 1.8410 |
| multimodal ELT | psychedelic | 272 | -0.5130 | 1.2497 | 0.9878 | 1.5616 |
| GNN | psychedelic | 272 | -0.5412 | 1.2612 | 1.0163 | 1.5907 |
| multimodal ELT | cannabinoid | 488 | 0.1243 | 1.3665 | 1.1450 | 1.8672 |
| GNN | cannabinoid | 488 | 0.0032 | 1.4579 | 1.2177 | 2.1256 |
| multimodal ELT | opioid | 519 | 0.0141 | 1.2454 | 1.0875 | 1.5511 |
| GNN | opioid | 519 | -0.1291 | 1.3328 | 1.1560 | 1.7764 |
| multimodal ELT | phenethylamine | 1,066 | 0.1576 | 1.3040 | 1.1024 | 1.7005 |
| GNN | phenethylamine | 1,066 | 0.0022 | 1.4193 | 1.1932 | 2.0143 |
The opioid slice includes checked mu-opioid (CHEMBL233) and delta-opioid
(CHEMBL236) rows. Kappa-opioid (CHEMBL237) remains mapped in code, but the
local ChEMBL fetch timed out before producing a checked snapshot.
This README is written for four reviewers at once:
| Audience | What to inspect | Why it matters |
|---|---|---|
| Pharmacology / cheminformatics review | Fixed CHEMBL238 snapshot, methylphenidate literature check, target-level R2/RMSE/MAE, context of use | Shows the model is framed as retrospective research software, with evidence separated from regulatory or therapeutic claims |
| MLOps | Dataset manifest, split policy, checksum, JSON model artifact, CPU reproducibility, /health endpoint |
Makes data lineage, reproducibility, deployment shape, and lifecycle hooks visible |
| LLMOps | Structured API outputs, model version, uncertainty, applicability-domain status, research-only language | Lets an LLM copilot quote bounded evidence instead of inventing model confidence or use claims |
| AI engineering | RDKit descriptors, scikit-learn CPU baseline, FastAPI, tests, Docker CPU service | Gives a small but complete reference path from data to model to service |
Current CHEMBL238 CPU benchmark:
| Split | n | R2 | RMSE | MAE |
|---|---|---|---|---|
| train | 1,762 | 0.2450 | 1.0474 | 0.8553 |
| scaffold_test | 359 | 0.3263 | 0.8699 | 0.7090 |
| external | 261 | 0.2062 | 1.0197 | 0.8295 |
Methylphenidate activity check against CHEMBL238 DAT literature values:
| Statistic | Value |
|---|---|
| Literature IC50 values | 17.0, 19.9, 79.0, 121.7 nM |
| Literature pIC50 mean | 7.3719 |
| Literature pIC50 95% CI | 6.6917 to 8.0521 |
| Geometric mean IC50 | 42.4673 nM |
| CPU model prediction | pIC50 6.0400, IC50 912.0108 nM |
| Model uncertainty / applicability domain | 0.8700, in-domain |
| Model minus literature mean | -1.3319 log units |
| One-sample t-test vs literature mean | t(3) = -6.2317, two-sided p = 0.008333 |
| Effect size | Cohen dz = -3.1159 |
| Observed power | 0.9754 at alpha = 0.05, two-sided |
| Inactive-rule result | 0 methylphenidate rows marked inactive under IQL / qualitative inactive or >=1000 uM rule |
Interpretation: methylphenidate is literature-active on CHEMBL238, while the small CPU Ridge baseline underpredicts potency by about 1.33 log units. That is useful MVP evidence because it exposes the full evaluation loop, not because it claims production-grade accuracy. The next pharma evaluation step is a governed multi-target ChEMBL or sponsor snapshot with locked data lineage, stronger models, calibration, drift monitoring, and lifecycle change control.
Endpoint-aware psychopharmacology standard panel:
uv run python -B cli.py psychopharm-checkThe standard panel now separates IC50 -> pIC50, Ki -> pKi, and keeps
EC50 -> pEC50 as literature-only context instead of pooling all rows into a
single potency mean. The reference file covers 10 compounds: LSD, bkMDMA
(methylone), MDMA, Adderall as a d-amphetamine proxy, methylphenidate, morphine,
tramadol, delta-9-THC, CBD, and CBN. The JSON report includes per-endpoint n,
mean, median, SD, SEM, ChEMBL document IDs, DOI/PubMed IDs, RDKit descriptor
features, endpoint predictions, uncertainty, and applicability-domain status.
The checked endpoint ChEMBL training snapshot has 2,103 rows. It is deliberately
small enough for CPU runs, but still keeps endpoint and scaffold/external split
lineage visible through
artifacts/chembl_endpoint_activity_snapshot.manifest.json.
| Endpoint | Target | Train n | Scaffold R2 | Scaffold RMSE | External R2 | External RMSE |
|---|---|---|---|---|---|---|
| pIC50 | CB1 (CHEMBL218) | 157 | 0.3335 | 0.8558 | 0.4064 | 0.4242 |
| pIC50 | 5HT2A (CHEMBL224) | 169 | 0.2938 | 1.2142 | 0.1282 | 1.1508 |
| pIC50 | mu-opioid (CHEMBL233) | 147 | 0.2784 | 1.0827 | -0.1845 | 1.0193 |
| pIC50 | delta-opioid (CHEMBL236) | 144 | 0.3506 | 1.2635 | 0.0536 | 1.2991 |
| pIC50 | DAT (CHEMBL238) | 100 | -0.3838 | 1.0344 | 0.0796 | 0.9879 |
| pIC50 | CB2 (CHEMBL253) | 148 | -0.1090 | 1.2118 | -0.0165 | 0.6824 |
| pKi | CB1 (CHEMBL218) | 117 | 0.3322 | 0.9891 | -0.4526 | 1.0583 |
| pKi | 5HT2A (CHEMBL224) | 120 | 0.2592 | 0.9822 | -0.5980 | 1.0443 |
| pKi | mu-opioid (CHEMBL233) | 157 | -0.0086 | 1.3639 | -0.1398 | 1.2508 |
| pKi | delta-opioid (CHEMBL236) | 130 | -1.6863 | 1.8044 | -0.3634 | 1.7561 |
| pKi | DAT (CHEMBL238) | 109 | 0.0559 | 1.0929 | 0.3718 | 1.0212 |
| pKi | CB2 (CHEMBL253) | 131 | 0.2822 | 0.8283 | -0.7024 | 1.0760 |
Standard-panel prediction check:
| Compound | Target | Lit pIC50 mean | Pred pIC50 | Delta pIC50 | Lit pKi mean | Pred pKi | Delta pKi | Domain pIC50 / pKi |
|---|---|---|---|---|---|---|---|---|
| Adderall / d-amphetamine | DAT | 6.5400 | 7.5960 | 1.0560 | 6.9600 | 6.4620 | -0.4980 | out / out |
| CBD / Cannabidiol | CB1 | 5.4200 | 5.5680 | 0.1480 | 5.6067 | 8.2240 | 2.6173 | out / in |
| CBD / Cannabidiol | CB2 | 8.2550 | 5.6567 | 8.5450 | 2.8883 | out / in | ||
| CBN / Cannabinol | CB1 | 5.8190 | 7.1333 | 7.3900 | 0.2567 | out / in | ||
| CBN / Cannabinol | CB2 | 7.3560 | 7.2400 | 7.4960 | 0.2560 | out / in | ||
| delta-9-THC / Dronabinol | CB1 | 8.5500 | 5.8190 | -2.7310 | 7.8480 | 7.9410 | 0.0930 | out / in |
| delta-9-THC / Dronabinol | CB2 | 8.7800 | 7.6800 | 8.6080 | 0.9280 | out / in | ||
| LSD / Lysergide | 5HT2A | 7.3450 | 8.4920 | 6.8160 | -1.6760 | out / out | ||
| MDMA / Midomafetamine | DAT | 5.7300 | 7.5300 | 1.8000 | 6.0500 | 6.7300 | 0.6800 | in / out |
| Methylphenidate / Methylphenidate | DAT | 7.4350 | 7.2420 | -0.1930 | 7.2150 | 6.5590 | -0.6560 | in / out |
| Morphine / Morphine | mu-opioid | 7.5160 | 8.6300 | 7.2600 | -1.3700 | in / in | ||
| Morphine / Morphine | delta-opioid | 7.5510 | 6.6850 | 6.9630 | 0.2780 | in / out | ||
| Tramadol / Tramadol | mu-opioid | 5.1200 | 6.8910 | 1.7710 | 5.7500 | 7.1660 | 1.4160 | in / in |
| Tramadol / Tramadol | delta-opioid | 7.6360 | 8.0300 | 7.6900 | -0.3400 | out / out | ||
| bkMDMA / methylone | DAT | 6.8790 | 6.6830 | in / out |
This is portfolio-grade evidence, not a validated QSAR claim. It is useful because it shows endpoint separation, chemical-family coverage, descriptor-level applicability-domain reporting, and honest weak spots before an LLM or reviewer quotes the result.
The graph and README statistics are regenerated from local JSON evidence:
uv run python -B scripts/build_pharma_mvp_readme_assets.pyThis repository can be read as a research-only software MVP for DAT-centered retrospective QSAR validation. The present focus is not to recommend compounds for use, synthesis, optimization, or progression. It is to make the data, model assumptions, uncertainty, and assay-context limitations visible enough for a cheminformatics, pharmacology, or ML reviewer to criticize and improve the workflow.
The current DAT-centered path uses CHEMBL238 as a reproducible target case. The
modeling goal is endpoint-aware prediction for pIC50 and pKi, with special
attention to methylphenidate, amphetamine-like reference compounds, cocaine-like
DAT pharmacology, phenethylamine scaffolds, aminorex-family structures,
Betanamin/pemoline, 4-MAR, 4,4-DMAR, and a 4B-MAR candidate structure. These
compounds are handled as research reference structures and validation stress
tests, not as development recommendations.
Recent MVP additions make the workflow more suitable for scientific review:
- CHEMBL238 endpoint snapshots can be fetched without the former small row cap.
- Assay metadata is retained so assay type, species, cell system, tissue, and binding-vs-uptake modality can be separated instead of silently pooled.
- Repeated measurements for the same compound and assay context are aggregated by median or robust mean.
- Endpoint values keep
IC50 -> pIC50andKi -> pKiseparate. - Inactive activity handling supports the project rule that values at or above
1000 uMare inactive for research triage. - dIQR-style outlier flags are tracked in the dataset manifest.
- The candidate panel now exposes descriptor features, SMILES token sequences, and RDKit molecular node graphs as explicit input representations.
- CUDA-backed compact Transformer, GNN, and elastic-looped Transformer models can be evaluated after the CPU baseline.
- Optuna can run after the first baseline pass, with 50-trial MVP settings now used for the CHEMBL238 candidate workflow.
- Consensus output reports median, mean, variance, range, and member-model predictions instead of hiding disagreement behind one scalar value.
The most recent CHEMBL238 full-endpoint run produced 4,374 aggregated rows from
4,769 measurements. The CPU Ridge baseline remains intentionally modest:
external RMSE was 0.9289 for pIC50 and 1.0937 for pKi. That is useful as a
baseline, but it is not strong enough to call the CPU-only model a validated
QSAR system.
The 4B-MAR CUDA deep50 run is a stronger engineering stress test. With 50
Optuna trials and 50 epochs per compact model, the 4B-MAR consensus median was
6.1702 for pIC50 and 5.7715 for pKi. The pIC50 members were comparatively
stable (SD = 0.2760), while pKi had high model disagreement (SD = 1.2856):
ELT predicted 4.2293 and GNN predicted 7.3700. The GNN pKi refit reached RMSE
0.7943, close to the provisional scaffold/external target range, but the member
disagreement means this result should be treated as a hypothesis for model
improvement. The correct reporting action for the current pKi result is to
disclose disagreement and withhold decisive numeric interpretation.
The main scientific weaknesses are now explicit:
- The CPU baseline is too weak for confident scaffold extrapolation.
- Binding and uptake assays should be trained and reported as separate model contexts, not only separated in the dataset summary.
- Species, tissue, and cell-system effects need stronger filtering and per-context performance tables.
- Local aminorex and phenethylamine chemical neighborhoods remain sparse.
- Candidate scoring should train each endpoint/context model once and score candidate batches, rather than refitting deep models per candidate.
- Uncertainty calibration is still residual-RMSE based and should be replaced or supplemented by conformal, ensemble, or repeated-split calibration.
- Mechanistic interpretation should remain separate from QSAR prediction until supported by curated assay and structure evidence.
Use the following prompt when asking a stronger reviewer model for critique:
You are reviewing a research-only QSAR validation software MVP for retrospective
DAT bioactivity analysis and assay-context-aware reporting. The project
uses ChEMBL CHEMBL238 data, endpoint-aware pIC50/pKi modeling, assay-context
metadata, RDKit descriptors, SMILES token sequences, molecular node graphs,
CPU Ridge, compact Transformer, GNN, and elastic-looped Transformer models.
Please critique the README and technical direction as a medicinal chemistry,
psychopharmacology, QSAR validation, and software-engineering reviewer.
Important safety boundary: do not provide synthesis routes, dosing advice,
human-use recommendations, or guidance for creating or optimizing controlled
psychoactive substances. Keep the review focused on data governance, assay
stratification, model validation, uncertainty calibration, software design,
MLOps, LLM-assisted reporting, and bounded research-only reporting.
Current evidence:
- Full CHEMBL238 endpoint snapshot: 4,374 aggregated rows from 4,769
measurements.
- CPU Ridge external RMSE: pIC50 0.9289, pKi 1.0937.
- 4B-MAR CUDA deep50 consensus median: pIC50 6.1702, pKi 5.7715.
- 4B-MAR pKi member disagreement is high: ELT 4.2293 vs GNN 7.3700.
- Best 4B-MAR pKi GNN refit RMSE: 0.7943.
Please identify:
1. The top scientific validity risks.
2. The top software-engineering risks.
3. What should be implemented before calling this a credible QSAR MVP.
4. What should be removed or reworded to avoid overclaiming.
5. How to structure the next milestone for binding-vs-uptake, species, and
local chemical-series validation.
6. What figures or tables should be added for a BioRender-style scientific
summary figure.
.
|-- cli.py # Command-line entry point
|-- dat_predictor.py # Legacy DAT predictor and GUI logic
|-- pyproject.toml # UV-managed project dependencies
|-- uv.lock # Reproducible dependency lockfile
|-- src/
| |-- admet/ # ADMET and developability profiling
| |-- active_learning/ # Compound selection helpers
| |-- data/ # ChEMBL loading and dataset splitting
| |-- features/ # Molecular, graph, and 3D featurizers
| |-- integrations/ # AlphaFold3 and docking job contracts
| |-- models/ # Transformer, GNN, ensemble, geometry GNNs
| |-- multimodal/ # Molecule image feature extraction
| |-- pipeline/ # Integrated compound assessment workflows
| |-- reactions/ # Retro/forward reaction planning baseline
| `-- synthesis/ # Synthetic accessibility scoring
|-- tests/ # Unit and integration tests
|-- docs/ # Design and environment notes
`-- scripts/ # Environment smoke checks and utilities
UV is the preferred environment manager for this repository.
uv syncThe default environment installs the core scientific stack:
- RDKit
- NumPy, pandas, SciPy, scikit-learn
- PyTorch
- pytest and development tools
- pIC50, ADMET, 3D conformer, synthesis, and image-feature dependencies
Optional extras are available for heavier workflows:
# Prefect orchestration
uv sync --extra workflow
# PyTorch Geometric model adapters
uv sync --extra gnn
# Protein structure and docking file helpers
uv sync --extra structure
# GUI dependencies
uv sync --extra gui
# API and production runtime dependencies
uv sync --extra prodAirflow is intentionally managed separately because it should be installed with the official Airflow constraints file:
uv pip install "apache-airflow==3.2.1" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-3.2.1/constraints-3.12.txt"See docs/uv_environment.md for more details.
The repository includes a small CPU-only demo model for portfolio and stakeholder walkthroughs. It uses a fixed descriptor benchmark, a scikit-learn Ridge model, and checked-in JSON artifacts, so it does not require CUDA or a GPU.
Build or refresh the demo artifacts:
uv run python -B scripts/build_demo_cpu_model.pyRun a CPU-only prediction:
uv run python -B cli.py predict \
--model models/demo_cpu_pic50_model.json \
--target CHEMBL238 \
--smiles "CC(=O)OC1=CC=CC=C1C(=O)O" \
--uncertaintyStart the FastAPI demo service:
uv run uvicorn src.api.app:app --host 127.0.0.1 --port 8000Available endpoints:
GET /healthPOST /predictPOST /assess
When PIC50_MODEL_PATH=models/chembl_endpoint_cpu_model.json, /predict
accepts {"endpoints": ["pIC50", "pKi"]} and returns endpoint-keyed
predictions with uncertainty and applicability-domain payloads.
The CPU demo is intentionally scoped as research triage and portfolio evidence.
It is not a clinical, regulatory, manufacturing, or patient-care decision
system. Replace data/demo_pic50_benchmark.csv with a governed ChEMBL or sponsor
snapshot before using the benchmark for scientific claims.
Freeze a ChEMBL pIC50 evaluation snapshot for a more serious research review:
uv run python -B cli.py build-chembl-snapshot \
--targets CHEMBL238,CHEMBL224 \
--output data/chembl_pic50_snapshot.csv \
--manifest artifacts/chembl_pic50_snapshot.manifest.jsonThe snapshot command writes a CSV plus a JSON manifest containing filters, split policy, per-target counts, and a SHA-256 checksum. Large generated ChEMBL snapshots are local evaluation artifacts by default; the small endpoint MVP snapshot is explicitly whitelisted for review reproducibility.
Build the endpoint-aware pIC50/pKi snapshot and CPU model used for the standard psychopharmacology panel:
uv run python -B cli.py build-chembl-endpoint-snapshot \
--targets CHEMBL238,CHEMBL224,CHEMBL218,CHEMBL253,CHEMBL233,CHEMBL236 \
--endpoints pIC50,pKi \
--output data/chembl_endpoint_activity_snapshot.csv \
--manifest artifacts/chembl_endpoint_activity_snapshot.manifest.json \
--max-rows-per-target-endpoint 200
uv run python -B cli.py build-endpoint-cpu-model \
--dataset data/chembl_endpoint_activity_snapshot.csv \
--output models/chembl_endpoint_cpu_model.json \
--report artifacts/chembl_endpoint_cpu_benchmark.jsonPredict both endpoints for an input SMILES:
uv run python -B cli.py predict \
--model models/chembl_endpoint_cpu_model.json \
--target CHEMBL238 \
--endpoints pIC50,pKi \
--smiles "COC(=O)C(c1ccccc1)C1CCCCN1" \
--uncertaintyRun a no-model compound assessment. This produces ADMET and synthesis outputs, and can optionally include 3D, reaction, and image features.
uv run python cli.py assess --smiles "CC(=O)OC1=CC=CC=C1C(=O)O"Write the assessment to JSON:
uv run python cli.py assess \
--smiles "CC(=O)OC1=CC=CC=C1C(=O)O" \
--output artifacts/assessment.jsonAssess a file of SMILES strings:
uv run python cli.py assess --input compounds.smi --output artifacts/assessment.csvInclude rendered image features:
uv run python cli.py assess \
--input compounds.smi \
--include-image \
--output artifacts/multimodal_assessment.jsonUse a trained pIC50 model when available:
uv run python cli.py assess \
--model models/dat_transformer_model.pt \
--target CHEMBL238 \
--smiles "CCN(CC)CC"The existing pIC50 workflow supports ChEMBL-based target data retrieval, feature calculation, Transformer training, and prediction with optional uncertainty reporting.
uv run python cli.py train --target CHEMBL238 --optimize
uv run python cli.py predict --model models/dat_transformer_model.pt --smiles "CCN(CC)CC"Supported target examples:
| Target | ChEMBL ID | Description |
|---|---|---|
| DAT | CHEMBL238 | Dopamine transporter |
| 5-HT2A | CHEMBL224 | Serotonin 2A receptor |
| CB1 | CHEMBL218 | Cannabinoid receptor 1 |
| CB2 | CHEMBL253 / CHEMBL1861 | Cannabinoid receptor 2 |
| mu opioid | CHEMBL233 | Mu opioid receptor |
| delta opioid | CHEMBL236 | Delta opioid receptor |
| kappa opioid | CHEMBL237 | Kappa opioid receptor |
src/features/structure3d.py generates RDKit ETKDGv3 conformers, optimizes
them with MMFF or UFF, and returns 3D descriptors such as radius of gyration,
asphericity, eccentricity, principal moments of inertia, and spherocity index.
These descriptors are available through the integrated assess command.
src/models/geometry_gnn.py defines a factory adapter for SchNet and DimeNet++.
Install the GNN extra before using these models:
uv sync --extra gnnSome PyTorch Geometric operations may require compiled extensions such as
torch-scatter or torch-sparse. On Windows, install those from the PyG wheel
index that matches your local Torch and CUDA build.
src/admet/predictor.py provides a lightweight rule-based ADMET profile using
RDKit descriptors:
- molecular weight
- LogP
- TPSA
- HBD/HBA
- rotatable bonds
- formal charge
- QED
- permeability and solubility proxies
- developability proxy score
This is a triage layer. Replace or ensemble it with calibrated ADMET models for production-grade prediction.
src/synthesis/scores.py provides:
- SA score proxy
- SCScore-style proxy
- synthetic feasibility score
- complexity drivers such as stereocenters, ring count, graph complexity, spiro atoms, bridgehead atoms, and flexibility
Use these outputs as non-decisional descriptors for retrospective review. Do not use them to rank compounds for synthesis planning, progression, dosing, or human-use decisions.
src/reactions/planner.py provides a conservative baseline interface for:
- retrosynthetic template disconnections
- forward reaction templates
- route serialization through
ReactionRouteandReactionStep
The current templates are intentionally simple. The interface is ready for AiZynthFinder, ASKCOS, IBM RXN, or an in-house reaction transformer.
src/multimodal/image_featurizer.py renders molecule images with RDKit and
converts them into compact image-derived features. These can be combined with
graph, descriptor, or 3D features for image + structure experiments.
src/pipeline/workflows.py includes:
- batch assessment runner
- JSON/CSV result writing
- optional Prefect flow factory
- optional Airflow DAG factory
Run Prefect workflows after installing:
uv sync --extra workflowsrc/integrations/structure_pipeline.py defines project-level contracts for:
- protein target metadata
- local AlphaFold3-style protein-ligand JSON payloads
- docking job specifications
- command-line docking runners
The AlphaFold3 contract exports payloads containing sequences, ligand SMILES,
modelSeeds, dialect, and version. AlphaFold Server has non-commercial and
ligand restrictions, so this project keeps the integration focused on local or
managed AlphaFold3 deployments.
Docking support is intentionally backend-neutral. The current runner can build command lines for tools such as Vina, Gnina, or site-specific docking wrappers.
Run the environment smoke check:
uv run python -B scripts/smoke_uv_env.pyRun the discovery extension tests:
uv run python -B -m pytest tests/test_discovery_extensions.py tests/test_structure_integration_contracts.py -qRun Ruff checks:
uv run ruff check . --preview
uv run ruff format . --checkThe current codebase still contains legacy Ruff issues outside the newly added discovery modules. Treat a full-project Ruff cleanup as a separate refactoring task.
The current repo is suitable as a research-only QSAR validation prototype, not as a validated QSAR product. It has the minimum pieces a serious reviewer expects to see:
- Clear context of use: retrospective research analysis only; not clinical, regulatory, manufacturing, patient-care, therapeutic, synthesis, dosing, or compound-optimization decision-making.
- Fixed data path: demo fixture plus a CHEMBL238 ChEMBL snapshot with manifest, split policy, row counts, and checksum.
- Risk-based performance report: target-level R2, RMSE, and MAE for train, scaffold-test, and external splits.
- Statistical sanity checks: methylphenidate literature IC50 comparison with error bars, p-value, effect size, and observed power, plus a pIC50/pKi standard-panel report for 10 psychopharmacology reference compounds.
- Applicability domain and uncertainty: every CPU prediction returns domain status and uncertainty for review-time triage.
- API contract:
/health,/predict, and/assessexpose model status, prediction, and compound assessment through FastAPI. - MLOps shape: JSON model artifact, benchmark JSON, reproducible CLI commands, Docker CPU service, tests, and explicit residual risks.
- LLMOps shape: structured model outputs and bounded research-use language make the service safer to wrap with an LLM assistant or report generator, provided the assistant withholds unsupported potency interpretation.
What this MVP proves:
- The evidence loop is implemented end to end on CPU.
- The evaluation is honest about scaffold/external generalization.
- A known active compound can be compared to literature with statistical outputs rather than a single anecdotal prediction.
- Model limitations are visible: the current CPU baseline underpredicts methylphenidate, so the demo invites model improvement instead of hiding it.
What remains before credible QSAR technical diligence:
- Governed ChEMBL snapshot with frozen version, license review, and data-quality gates.
- Stronger baselines and calibrated uncertainty across multiple target families.
- External validation by target, assay context, species, and chemical series, including binding-vs-uptake stratification.
- Model registry, approval workflow, drift monitoring, rollback, audit logging, and lifecycle management.
- Mechanistic interpretation where feasible, aligned with QSAR validation expectations.
See docs/pharma_mvp_cpu_demo.md and docs/research_only_qsar_validation_plan.md for model-card-style details, release gates, and CPU/GPU validation commands.
The production compose stack includes:
- application service
- PostgreSQL
- Redis
- Ollama for TxGemma
- Nginx
- Prometheus
- Grafana
Typical production deployment:
cp .env.example .env
cp config/config.yaml.example config/config.yaml
./scripts/deploy.sh production deployService defaults:
- application:
http://localhost:8000 - Grafana:
http://localhost:3000 - Prometheus:
http://localhost:9090
- Use scaffold splits when evaluating medicinal chemistry generalization.
- Keep pIC50, ADMET, synthesis, docking, and AlphaFold-derived evidence separate until calibration data and expert review justify any combined interpretation.
- Track uncertainty and applicability domain for every prediction.
- Validate reaction routes with a chemist and a dedicated retrosynthesis engine before synthesis decisions.
- Treat docking and AlphaFold3 outputs as structural hypotheses, not binding truth.
This project is licensed under the MIT License.
