MolMiner is a multi-property-conditioned, geometry-aware transformer that builds molecules fragment-by-fragment while 3-D aware.
It supports:
- Order-agnostic roll-outs with symmetry-aware attachment handling
- Up to 12 simultaneous property conditions (logP, QED, …)
- End-to-end scripts for vocab extraction, preprocessing, GMM, starter model and full MolMiner training
git clone https://github.com/raulorteg/molminer.git
cd molminer
pip install -r requirements.txtPre-trained model checkpoints for MolMiner are available via Zenodo.
- Zenodo Record: https://zenodo.org/records/15496963
- Direct Download (Files Archive): https://zenodo.org/api/records/15496963/files-archive
These checkpoints contain trained weights for: MolMiner model, GMM (Gaussian Mixture Model), Fragment-Starter model.
Below is the minimal happy path from raw CSV -> trained MolMiner model. All scripts share the common option --help for full CLI details.
python extract_vocabulary.py \
--dataset ../data/test/example.csv
# -> vocab_anchors.csv, vocab_attachments.csv, vocab_fragments.csv, stats.jsonpython dataset_split.py \
--dataset ../data/test/example.csv
# -> train.csv, valid.csv, test.csvpython preprocess_starter.py \
--data_dir ../data/test
# -> train_starter.pkl, valid_starter.pkl, test_starter.pklpython preprocess_molminer.py \
--data_dir ../data/test \
--total_epochs 10 \
--max_workers 2
# -> steps/test, steps/valid, steps/{epoch}, ...stage command
| stage | command |
|---|---|
| GMM | python train_gmm.py --data_dir ../data/test --model_out ../checkpoints/test_gmm_model.pkl |
| Fragment-Starter | python train_starter.py --data_dir ../data/test --ckpt_dir ../checkpoints |
python train_molminer.py \
--data_dir ../data/test \
--ckpt_dir ../checkpoints \
--fixedrollout \ # remove to use adaptive roll-outs
--total_epochs 10
# -> ckpt_dir/best_molminer.pth, cpkt_dir/last_molminer/pthpython postprocess_calibration.py --samples=10 --ckpt_molminer='../checkpoints/best_molminer.pth' --ckpt_starter='../checkpoints/best_starter.pth' --ckpt_gmm='../checkpoints/gmm_model.pkl' --stats_path='../data/zinc/stats.json' --vocab_fragments=
'../data/zinc/vocab_fragments.csv' --vocab_attachments='../data/zinc/vocab_attachments.csv' --vocab_anchors='../data/zinc/vocab_anchors.csv' --device=cpu --weighted
# -> data/calibration/{prop}_calibration.txtpython generate_random.py --samples=10 --ckpt_molminer='../checkpoints/best_molminer.pth' --ckpt_starter='../checkpoints/best_starter.pth' --ckpt_gmm='../checkpoints/gmm_model.pkl' --stats_path='../data/zinc/stats.json' --vocab_fragments='../data/zinc/vocab_fragments.csv' --vocab_attachments='../data/zinc/vocab_attachments.csv' --vocab_anchors='../data/zinc/vocab_anchors.csv' --device=cpu --weighted
# -> data/generated.txtpython postprocess_generated_statistics.pypython postprocess_calibration_plot.py --calibration_dir='../data/calibration' --stats_path='../data/zinc/stats.json' --figure_savepath='../figures/calibration.png'
# -> figures/calibration.pngIf you use MolMiner in academic work, please cite:
@misc{ortegaochoa2025molminercontrollable3dawarefragmentbased,
title={MolMiner: Towards Controllable, 3D-Aware, Fragment-Based Molecular Design},
author={Raul Ortega-Ochoa and Tejs Vegge and Jes Frellsen},
year={2025},
eprint={2411.06608},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.06608},
}MolMiner is released under the Apache 2.0 License – see LICENSE for details. Contributions are welcome via pull requests or issues!
