This repository accompanies the paper Chemical Language Models for Natural Products: A State-Space Model Approach. arXiv:2602.13958 [cs.LG] It includes major components of all experiments carried out in this study. This Readme file contains detailed information regarding this repository and how to run every experiment outlined below.
🤗 Pre-trained models: All pre-trained models from the paper are available on Hugging Face: Link to the Models
💬 Key Terms
NPs: Natural Products
CLMs: Chemical Language Models
SMILES: Simplified Molecular Input Line Entry System
BPE: Byte-Pair Encoding
NPBPE: Natural Product Byte-Pair Encoding
- Data includes the pre-training data of over 1 million NPs and 4 downstream property prediction datasets
- Tokenizer implementation and vocab.json files for Character-level, Atom-in-SMILES (AIS), BPE (from DeepChem), and five NPBPE tokenizers of different vocabulary sizes
- Pre-training code for 48 CLMs (3 model types (GPT, Mamba, Mamba-2) × 8 tokenizers × 2 data splits (scaffold and random splits)) on the curated 1M NP dataset
- Hyperparameter search code to optimize the 48 model-tokenizer pair configurations
- Fine-tuning code for NP-relevant property prediction tasks (excluding Tox21):
- Peptide membrane permeability
- Taste classification
- Anti-cancer activity prediction
- Fine-tuning scripts for benchmark models: MolFormer and ChemBERTa-2 (MLM and MTR versions)
- Molecule generation script using autoregressive sampling
- Generated pseudo-NP molecules using each of the 48 models
- A dockerfile is provided to set up the environment to run all experiments
- Experiment launcher script: A main shell script (
run_experiments.sh) is provided to run all major experiments
🔑 WandB API Key
A Weights & Biases (wandb) API key is required for some tasks, such as pretraining. It must be passed to the job script as a command-line argument via the HTCondor submit file.
To do this, set theargumentsfield in your submit file like this:arguments = YOUR_WANDB_KEY
📁 Example Usage
Therun_experiments.shscript provides examples for running all major tasks (molecule generation, hyperparameter search, pretraining, and fine-tuning). Uncomment the relevant blocks to execute.
All tasks are orchestrated via main.py and can be launched with minimal configuration using run_experiments.sh.
data/ # Contains pre-training 1M NPs and downstream task data files
├── 1M_NPs/ # Random and Scaffold Split 1M NPs pre-training data
└── downstream_task_ata/ # Random and Scaffold Split 5x5 CV Downstream Task Datasets
generated_pseudo_NPs/ # Generated NP strings from each of the 48 models
images/ # Images of workflow and downstream application overview
lsv_cluster_files/ # Contains cluster-related setup
├── mamba.dockerfile # Dockerfile for Mamba training environment
├── run_experiments.sh # Shell script to run experiments using main.py
└── run_experiments.sub # Cluster job submission script
vocab_files/ # Contains vocab.json files for all custom tokenizers
ChemBERTa2_MLM_Finetune_on_1M_NPs.py # Fine-tune ChemBERTa models on 1M NPs
ChemBERTa2_finetuned_model.pth # Model checkpoint of the fine-tuned ChemBERTa-2 (MLM) on 1M NPs
ChemBERTa2_finetuning.py # Fine-tune ChemBERTa models on property prediction tasks
MoLFormer_finetuned_model.pth # Model checkpoint of the fine-tuned MolFormer on 1M NPs
MolFormer_Finetuning_on_1M_NPs.py # Fine-tune MolFormer on 1M NPs
MolFormer_Finetuning.py # Fine-tune MolFormer on property prediction tasks
Readme.md # Project overview and usage instructions
finetuning.py # Fine-tuning the 48 NP-pretrained models on property prediction tasks
hpsearch.py # Pre-training hyperparameter search for the 48 model-tokenizer combinations
main.py # Entry point
mol_generation.py # Pseudo-NP molecule generation
pretraining.py # Model pre-training for the 48 model-tokenizer combinations
requirements.txt # Python dependencies required to run the project
sam.py # SAM implementation from UU-Mamba (arXiv:2402.03394)
tokenisers.py # Custom tokenizers implementation
-
Pull the pre-built image from GitHub Container Registry:
docker pull ghcr.io/rozariwang/npclms:latest -
(Optional) Retag for your own use:
docker tag ghcr.io/rozariwang/npclms [your_preferred_name]:[your_tag]
All tasks can be executed via main.py using the run_experiments.sh bash file. It is provided as an example of how to run all tasks specified above, and the details on how to set task-specific configurations are described below. Uncomment the block corresponding to the task you want to run in run_experiments.sh.
Molecule generation logic loads a pretrained GPT or Mamba model and its corresponding tokenizer to generate pseudo-NP SMILES strings using autoregressive sampling. It infers configuration details from the model name, generate the sequence, computes token-sum log-likelihood for each sequence, and writes the results to a CSV file.
Configuration Options:
- task: "generate"
- num_mols: default=32 (set the number of molecules you want to generate)
- temperature: default=1.0 (control sampling randomness (lower = more deterministic, higher = more random), 1.0 means no adjustment to the model’s predicted probabilities)
- max_length: default=512 (set the max length of the generated molecules)
- model_names: model names follow the format rozariwang/[MODEL]-[TOKENIZER]-[SPLIT], where:
- [MODEL]:
GPT,M1, orM2 - [TOKENIZER]:
Char,AIS,BPE,npbpe60,npbpe100,npbpe1000,npbpe7924, ornpbpe30k - [SPLIT]:
rds(random split) orsfs(scaffold split)
- [MODEL]:
python3 /CLMs-for-NPs/main.py \
--task generate \
--num_mols 1000 \
--temperature 1 \
--max_length 512
--model_names rozariwang/M2-NPBPE1000-rdsHyperparameter (random) search over half of the entire search space for GPT and Mamba-based models, determining the best hyperparameter set for pre-training different model-tokenizer combinations. It trains each model over 5 epochs per configuration, and selects the best hyperparameters based on the lowest validation loss in the last epoch.
Configuration Options:
- task: "hpsearch"
- hp_model: "GPT", "Mamba1", or "Mamba2"
- hp_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
- hp_split: "random" or "scaffold"
python3 /CLMs-for-NPs/main.py \
--task hpsearch \
--hp_model GPT \
--hp_tokenizer AIS \
--hp_split randomPre-training 48 model variations on 1M NPs: 3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers (Char, BPE, AIS, NPBPE60, NPBPE100, NPBPE1000, NPBPE7924, NPBPE30k) * 2 data split methods (random, scaffold)
Configuration Options:
- task: "pretrain"
- wandb_key: set it the
argumentsfield in your submit file - pt_model: "GPT", "Mamba1", or "Mamba2"
- pt_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
- pt_split: "random" or "scaffold"
- pt_n_embd: default=256 (set from hyperparameter search result)
- pt_n_layer: default=8 (set from hyperparameter search result)
- pt_lr: default=1e-4 (set from hyperparameter search result)
- pt_n_head: default=None (set from hyperparameter search result, only needed for transformer models)
python3 /CLMs-for-NPs/main.py \
--task pretrain \
--wandb_key "$1" \
--pt_model GPT \
--pt_tokenizer NPBPE1000 \
--pt_split random \
--pt_n_embd 256 \
--pt_n_layer 8 \
--pt_lr 1e-4 \
--pt_n_head 4Use --pt_n_head None for non-GPT models.
Fine-tuning and evaluation for 3 downstream classification tasks using 48 NP pretrained models (3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers * 2 data split methods)
Configuration Options:
- task: "finetune"
- sub_task: "anti_cancer", "peptides", or "tastes"
- model_split: "sfs" or "rds" (how the pre-training 1M NPs data is split)
- data_split: "sf" or "rd"
python3 /CLMs-for-NPs/main.py \
--task finetune \
--sub_task peptides \
--model_split sfs \
--data_split sfFine-tuning ChemBERTa-2 MLM on 1M NPs
python3 /CLMs-for-NPs/ChemBERTa2_MLM_Finetune_on_1M_NPs.py Fine-tuning ChemBERTa-2 on property prediction tasks
Configuration Options:
- task: "chemberta"
- chemberta_model_type: "mlm" (original model), "mtr" (original model), or "mlm-finetuned" (fine-tuned on 1M NPs)
- sub_task: "anti_cancer" or "peptides"
- data_split: "rd" or "sf"
python3 /CLMs-for-NPs/main.py \
--task chemberta \
--chemberta_model_type mtr \
--sub_task anti_cancer \
--data_split sfFine-tuning MolFormer on 1M NPs (requires WandB key)
python3 /CLMs-for-NPs/main.py \
--task molformer_1M_NPs \
--wandb_key "$1"Fine-tuning MolFormer on property prediction tasks
Configuration Options:
- task: "molformer" (original model) or "molformer-finetuned" (fine-tuned on 1M NPs)
- molformer_variant: "molformer" or "molformer-finetuned"
- sub_task: "anti_cancer" or "peptides"
- data_split: "rd" or "sf"
python3 /CLMs-for-NPs/main.py \
--task molformer \
--molformer_variant molformer \
--sub_task peptides \
--data_split rdIf you use this repository or the related pre-trained models on HF, please cite the paper:
@misc{wang2026chemicallanguagemodelsnatural,
title={Chemical Language Models for Natural Products: A State-Space Model Approach},
author={Ho-Hsuan Wang and Afnan Sultan and Andrea Volkamer and Dietrich Klakow},
year={2026},
eprint={2602.13958},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.13958},
}
