🧪 CLMs for Natural Products

This repository accompanies the paper Chemical Language Models for Natural Products: A State-Space Model Approach. arXiv:2602.13958 [cs.LG] It includes major components of all experiments carried out in this study. This Readme file contains detailed information regarding this repository and how to run every experiment outlined below.

🤗 Pre-trained models: All pre-trained models from the paper are available on Hugging Face: Link to the Models

💬 Key Terms
NPs: Natural Products
CLMs: Chemical Language Models
SMILES: Simplified Molecular Input Line Entry System
BPE: Byte-Pair Encoding
NPBPE: Natural Product Byte-Pair Encoding

✅ What's Included

Data includes the pre-training data of over 1 million NPs and 4 downstream property prediction datasets
Tokenizer implementation and vocab.json files for Character-level, Atom-in-SMILES (AIS), BPE (from DeepChem), and five NPBPE tokenizers of different vocabulary sizes
Pre-training code for 48 CLMs (3 model types (GPT, Mamba, Mamba-2) × 8 tokenizers × 2 data splits (scaffold and random splits)) on the curated 1M NP dataset
Hyperparameter search code to optimize the 48 model-tokenizer pair configurations
Fine-tuning code for NP-relevant property prediction tasks (excluding Tox21):
- Peptide membrane permeability
- Taste classification
- Anti-cancer activity prediction
Fine-tuning scripts for benchmark models: MolFormer and ChemBERTa-2 (MLM and MTR versions)
Molecule generation script using autoregressive sampling
Generated pseudo-NP molecules using each of the 48 models
A dockerfile is provided to set up the environment to run all experiments
Experiment launcher script: A main shell script (run_experiments.sh) is provided to run all major experiments

🔑 WandB API Key
A Weights & Biases (wandb) API key is required for some tasks, such as pretraining. It must be passed to the job script as a command-line argument via the HTCondor submit file.
To do this, set the arguments field in your submit file like this: arguments = YOUR_WANDB_KEY

📁 Example Usage
The run_experiments.sh script provides examples for running all major tasks (molecule generation, hyperparameter search, pretraining, and fine-tuning). Uncomment the relevant blocks to execute.

All tasks are orchestrated via main.py and can be launched with minimal configuration using run_experiments.sh.

🖼️ Workflow & Downstream Application Overview

🗂️ Directory Structure

data/                                  # Contains pre-training 1M NPs and downstream task data files
  ├── 1M_NPs/                          # Random and Scaffold Split 1M NPs pre-training data
  └── downstream_task_ata/             # Random and Scaffold Split 5x5 CV Downstream Task Datasets
generated_pseudo_NPs/                  # Generated NP strings from each of the 48 models 
images/                                # Images of workflow and downstream application overview 
lsv_cluster_files/                     # Contains cluster-related setup
  ├── mamba.dockerfile                 # Dockerfile for Mamba training environment
  ├── run_experiments.sh               # Shell script to run experiments using main.py 
  └── run_experiments.sub              # Cluster job submission script
vocab_files/                           # Contains vocab.json files for all custom tokenizers 
ChemBERTa2_MLM_Finetune_on_1M_NPs.py   # Fine-tune ChemBERTa models on 1M NPs
ChemBERTa2_finetuned_model.pth         # Model checkpoint of the fine-tuned ChemBERTa-2 (MLM) on 1M NPs 
ChemBERTa2_finetuning.py               # Fine-tune ChemBERTa models on property prediction tasks
MoLFormer_finetuned_model.pth          # Model checkpoint of the fine-tuned MolFormer on 1M NPs 
MolFormer_Finetuning_on_1M_NPs.py      # Fine-tune MolFormer on 1M NPs
MolFormer_Finetuning.py                # Fine-tune MolFormer on property prediction tasks
Readme.md                              # Project overview and usage instructions
finetuning.py                          # Fine-tuning the 48 NP-pretrained models on property prediction tasks
hpsearch.py                            # Pre-training hyperparameter search for the 48 model-tokenizer combinations 
main.py                                # Entry point
mol_generation.py                      # Pseudo-NP molecule generation 
pretraining.py                         # Model pre-training for the 48 model-tokenizer combinations
requirements.txt                       # Python dependencies required to run the project
sam.py                                 # SAM implementation from UU-Mamba (arXiv:2402.03394)
tokenisers.py                          # Custom tokenizers implementation

📦 Environment Setup

🔧 Docker Image

Link to the Docker Image

Pull the pre-built image from GitHub Container Registry: docker pull ghcr.io/rozariwang/npclms:latest
(Optional) Retag for your own use: docker tag ghcr.io/rozariwang/npclms [your_preferred_name]:[your_tag]

⚙️ How to Run Tasks

All tasks can be executed via main.py using the run_experiments.sh bash file. It is provided as an example of how to run all tasks specified above, and the details on how to set task-specific configurations are described below. Uncomment the block corresponding to the task you want to run in run_experiments.sh.

1. Molecule Generation

Molecule generation logic loads a pretrained GPT or Mamba model and its corresponding tokenizer to generate pseudo-NP SMILES strings using autoregressive sampling. It infers configuration details from the model name, generate the sequence, computes token-sum log-likelihood for each sequence, and writes the results to a CSV file.

Configuration Options:

task: "generate"
num_mols: default=32 (set the number of molecules you want to generate)
temperature: default=1.0 (control sampling randomness (lower = more deterministic, higher = more random), 1.0 means no adjustment to the model’s predicted probabilities)
max_length: default=512 (set the max length of the generated molecules)
model_names: model names follow the format rozariwang/[MODEL]-[TOKENIZER]-[SPLIT], where:
- [MODEL]: GPT, M1, or M2
- [TOKENIZER]: Char, AIS, BPE, npbpe60, npbpe100, npbpe1000, npbpe7924, or npbpe30k
- [SPLIT]: rds (random split) or sfs (scaffold split)

python3 /CLMs-for-NPs/main.py \
  --task generate \
  --num_mols 1000 \
  --temperature 1 \
  --max_length 512 
  --model_names rozariwang/M2-NPBPE1000-rds

2. Hyperparameter Search

Hyperparameter (random) search over half of the entire search space for GPT and Mamba-based models, determining the best hyperparameter set for pre-training different model-tokenizer combinations. It trains each model over 5 epochs per configuration, and selects the best hyperparameters based on the lowest validation loss in the last epoch.

Configuration Options:

task: "hpsearch"
hp_model: "GPT", "Mamba1", or "Mamba2"
hp_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
hp_split: "random" or "scaffold"

python3 /CLMs-for-NPs/main.py \
  --task hpsearch \
  --hp_model GPT \
  --hp_tokenizer AIS \
  --hp_split random

3. Pretraining (requires WandB key)

Pre-training 48 model variations on 1M NPs: 3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers (Char, BPE, AIS, NPBPE60, NPBPE100, NPBPE1000, NPBPE7924, NPBPE30k) * 2 data split methods (random, scaffold)

Configuration Options:

task: "pretrain"
wandb_key: set it the arguments field in your submit file
pt_model: "GPT", "Mamba1", or "Mamba2"
pt_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
pt_split: "random" or "scaffold"
pt_n_embd: default=256 (set from hyperparameter search result)
pt_n_layer: default=8 (set from hyperparameter search result)
pt_lr: default=1e-4 (set from hyperparameter search result)
pt_n_head: default=None (set from hyperparameter search result, only needed for transformer models)

python3 /CLMs-for-NPs/main.py \
  --task pretrain \
  --wandb_key "$1" \
  --pt_model GPT \
  --pt_tokenizer NPBPE1000 \
  --pt_split random \
  --pt_n_embd 256 \
  --pt_n_layer 8 \
  --pt_lr 1e-4 \
  --pt_n_head 4

Use --pt_n_head None for non-GPT models.

4. Fine-tuning

Fine-tuning and evaluation for 3 downstream classification tasks using 48 NP pretrained models (3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers * 2 data split methods)

Configuration Options:

task: "finetune"
sub_task: "anti_cancer", "peptides", or "tastes"
model_split: "sfs" or "rds" (how the pre-training 1M NPs data is split)
data_split: "sf" or "rd"

python3 /CLMs-for-NPs/main.py \
  --task finetune \
  --sub_task peptides \
  --model_split sfs \
  --data_split sf

5. Fine-tuning ChemBERTa-2

Fine-tuning ChemBERTa-2 MLM on 1M NPs

python3 /CLMs-for-NPs/ChemBERTa2_MLM_Finetune_on_1M_NPs.py

Fine-tuning ChemBERTa-2 on property prediction tasks
Configuration Options:

task: "chemberta"
chemberta_model_type: "mlm" (original model), "mtr" (original model), or "mlm-finetuned" (fine-tuned on 1M NPs)
sub_task: "anti_cancer" or "peptides"
data_split: "rd" or "sf"

python3 /CLMs-for-NPs/main.py \
  --task chemberta \
  --chemberta_model_type mtr \
  --sub_task anti_cancer \
  --data_split sf

6. Fine-tuning MolFormer

Fine-tuning MolFormer on 1M NPs (requires WandB key)

python3 /CLMs-for-NPs/main.py \
  --task molformer_1M_NPs \
  --wandb_key "$1"

Fine-tuning MolFormer on property prediction tasks
Configuration Options:

task: "molformer" (original model) or "molformer-finetuned" (fine-tuned on 1M NPs)
molformer_variant: "molformer" or "molformer-finetuned"
sub_task: "anti_cancer" or "peptides"
data_split: "rd" or "sf"

python3 /CLMs-for-NPs/main.py \
  --task molformer \
  --molformer_variant molformer \
  --sub_task peptides \
  --data_split rd

Citation

If you use this repository or the related pre-trained models on HF, please cite the paper:

@misc{wang2026chemicallanguagemodelsnatural,
  title={Chemical Language Models for Natural Products: A State-Space Model Approach}, 
  author={Ho-Hsuan Wang and Afnan Sultan and Andrea Volkamer and Dietrich Klakow},
  year={2026},
  eprint={2602.13958},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.13958}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 CLMs for Natural Products

✅ What's Included

🖼️ Workflow & Downstream Application Overview

🗂️ Directory Structure

📦 Environment Setup

🔧 Docker Image

⚙️ How to Run Tasks

1. Molecule Generation

2. Hyperparameter Search

3. Pretraining (requires WandB key)

4. Fine-tuning

5. Fine-tuning ChemBERTa-2

6. Fine-tuning MolFormer

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
data		data
generated_pseudo_NPs		generated_pseudo_NPs
images		images
lsv_cluster_files		lsv_cluster_files
vocab_files		vocab_files
.gitattributes		.gitattributes
ChemBERTa2_MLM_Finetune_on_1M_NPs.py		ChemBERTa2_MLM_Finetune_on_1M_NPs.py
ChemBERTa2_finetuned_model.pth		ChemBERTa2_finetuned_model.pth
ChemBERTa2_finetuning.py		ChemBERTa2_finetuning.py
MoLFormer_finetuned_model.pth		MoLFormer_finetuned_model.pth
MolFormer_Finetuning_on_1M_NPs.py		MolFormer_Finetuning_on_1M_NPs.py
MolFormer_finetuning.py		MolFormer_finetuning.py
Readme.md		Readme.md
finetuning.py		finetuning.py
hpsearch.py		hpsearch.py
main.py		main.py
mol_generation.py		mol_generation.py
pretraining.py		pretraining.py
requirements.txt		requirements.txt
sam.py		sam.py
tokenisers.py		tokenisers.py

Folders and files

Latest commit

History

Repository files navigation

🧪 CLMs for Natural Products

✅ What's Included

🖼️ Workflow & Downstream Application Overview

🗂️ Directory Structure

📦 Environment Setup

🔧 Docker Image

⚙️ How to Run Tasks

1. Molecule Generation

2. Hyperparameter Search

3. Pretraining (requires WandB key)

4. Fine-tuning

5. Fine-tuning ChemBERTa-2

6. Fine-tuning MolFormer

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages