Skip to content

rozariwang/CLMs-for-NPs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

112 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧪 CLMs for Natural Products

This repository accompanies the paper Chemical Language Models for Natural Products: A State-Space Model Approach. arXiv:2602.13958 [cs.LG] It includes major components of all experiments carried out in this study. This Readme file contains detailed information regarding this repository and how to run every experiment outlined below.

🤗 Pre-trained models: All pre-trained models from the paper are available on Hugging Face: Link to the Models

💬 Key Terms
NPs: Natural Products
CLMs: Chemical Language Models
SMILES: Simplified Molecular Input Line Entry System
BPE: Byte-Pair Encoding
NPBPE: Natural Product Byte-Pair Encoding

✅ What's Included

  • Data includes the pre-training data of over 1 million NPs and 4 downstream property prediction datasets
  • Tokenizer implementation and vocab.json files for Character-level, Atom-in-SMILES (AIS), BPE (from DeepChem), and five NPBPE tokenizers of different vocabulary sizes
  • Pre-training code for 48 CLMs (3 model types (GPT, Mamba, Mamba-2) × 8 tokenizers × 2 data splits (scaffold and random splits)) on the curated 1M NP dataset
  • Hyperparameter search code to optimize the 48 model-tokenizer pair configurations
  • Fine-tuning code for NP-relevant property prediction tasks (excluding Tox21):
    • Peptide membrane permeability
    • Taste classification
    • Anti-cancer activity prediction
  • Fine-tuning scripts for benchmark models: MolFormer and ChemBERTa-2 (MLM and MTR versions)
  • Molecule generation script using autoregressive sampling
  • Generated pseudo-NP molecules using each of the 48 models
  • A dockerfile is provided to set up the environment to run all experiments
  • Experiment launcher script: A main shell script (run_experiments.sh) is provided to run all major experiments

🔑 WandB API Key
A Weights & Biases (wandb) API key is required for some tasks, such as pretraining. It must be passed to the job script as a command-line argument via the HTCondor submit file.
To do this, set the arguments field in your submit file like this: arguments = YOUR_WANDB_KEY

📁 Example Usage
The run_experiments.sh script provides examples for running all major tasks (molecule generation, hyperparameter search, pretraining, and fine-tuning). Uncomment the relevant blocks to execute.

All tasks are orchestrated via main.py and can be launched with minimal configuration using run_experiments.sh.

🖼️ Workflow & Downstream Application Overview

Workflow Overview Downstream Application Overview

🗂️ Directory Structure

data/                                  # Contains pre-training 1M NPs and downstream task data files
  ├── 1M_NPs/                          # Random and Scaffold Split 1M NPs pre-training data
  └── downstream_task_ata/             # Random and Scaffold Split 5x5 CV Downstream Task Datasets
generated_pseudo_NPs/                  # Generated NP strings from each of the 48 models 
images/                                # Images of workflow and downstream application overview 
lsv_cluster_files/                     # Contains cluster-related setup
  ├── mamba.dockerfile                 # Dockerfile for Mamba training environment
  ├── run_experiments.sh               # Shell script to run experiments using main.py 
  └── run_experiments.sub              # Cluster job submission script
vocab_files/                           # Contains vocab.json files for all custom tokenizers 
ChemBERTa2_MLM_Finetune_on_1M_NPs.py   # Fine-tune ChemBERTa models on 1M NPs
ChemBERTa2_finetuned_model.pth         # Model checkpoint of the fine-tuned ChemBERTa-2 (MLM) on 1M NPs 
ChemBERTa2_finetuning.py               # Fine-tune ChemBERTa models on property prediction tasks
MoLFormer_finetuned_model.pth          # Model checkpoint of the fine-tuned MolFormer on 1M NPs 
MolFormer_Finetuning_on_1M_NPs.py      # Fine-tune MolFormer on 1M NPs
MolFormer_Finetuning.py                # Fine-tune MolFormer on property prediction tasks
Readme.md                              # Project overview and usage instructions
finetuning.py                          # Fine-tuning the 48 NP-pretrained models on property prediction tasks
hpsearch.py                            # Pre-training hyperparameter search for the 48 model-tokenizer combinations 
main.py                                # Entry point
mol_generation.py                      # Pseudo-NP molecule generation 
pretraining.py                         # Model pre-training for the 48 model-tokenizer combinations
requirements.txt                       # Python dependencies required to run the project
sam.py                                 # SAM implementation from UU-Mamba (arXiv:2402.03394)
tokenisers.py                          # Custom tokenizers implementation 

📦 Environment Setup

🔧 Docker Image

Link to the Docker Image

  1. Pull the pre-built image from GitHub Container Registry: docker pull ghcr.io/rozariwang/npclms:latest

  2. (Optional) Retag for your own use: docker tag ghcr.io/rozariwang/npclms [your_preferred_name]:[your_tag]

⚙️ How to Run Tasks

All tasks can be executed via main.py using the run_experiments.sh bash file. It is provided as an example of how to run all tasks specified above, and the details on how to set task-specific configurations are described below. Uncomment the block corresponding to the task you want to run in run_experiments.sh.

1. Molecule Generation

Molecule generation logic loads a pretrained GPT or Mamba model and its corresponding tokenizer to generate pseudo-NP SMILES strings using autoregressive sampling. It infers configuration details from the model name, generate the sequence, computes token-sum log-likelihood for each sequence, and writes the results to a CSV file.

Configuration Options:

  • task: "generate"
  • num_mols: default=32 (set the number of molecules you want to generate)
  • temperature: default=1.0 (control sampling randomness (lower = more deterministic, higher = more random), 1.0 means no adjustment to the model’s predicted probabilities)
  • max_length: default=512 (set the max length of the generated molecules)
  • model_names: model names follow the format rozariwang/[MODEL]-[TOKENIZER]-[SPLIT], where:
    • [MODEL]: GPT, M1, or M2
    • [TOKENIZER]: Char, AIS, BPE, npbpe60, npbpe100, npbpe1000, npbpe7924, or npbpe30k
    • [SPLIT]: rds (random split) or sfs (scaffold split)
python3 /CLMs-for-NPs/main.py \
  --task generate \
  --num_mols 1000 \
  --temperature 1 \
  --max_length 512 
  --model_names rozariwang/M2-NPBPE1000-rds

2. Hyperparameter Search

Hyperparameter (random) search over half of the entire search space for GPT and Mamba-based models, determining the best hyperparameter set for pre-training different model-tokenizer combinations. It trains each model over 5 epochs per configuration, and selects the best hyperparameters based on the lowest validation loss in the last epoch.

Configuration Options:

  • task: "hpsearch"
  • hp_model: "GPT", "Mamba1", or "Mamba2"
  • hp_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
  • hp_split: "random" or "scaffold"
python3 /CLMs-for-NPs/main.py \
  --task hpsearch \
  --hp_model GPT \
  --hp_tokenizer AIS \
  --hp_split random

3. Pretraining (requires WandB key)

Pre-training 48 model variations on 1M NPs: 3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers (Char, BPE, AIS, NPBPE60, NPBPE100, NPBPE1000, NPBPE7924, NPBPE30k) * 2 data split methods (random, scaffold)

Configuration Options:

  • task: "pretrain"
  • wandb_key: set it the arguments field in your submit file
  • pt_model: "GPT", "Mamba1", or "Mamba2"
  • pt_tokenizer: "Char", "AIS", "BPE", "NPBPE60", "NPBPE100", "NPBPE1000", "NPBPE7924", or "NPBPE30k"
  • pt_split: "random" or "scaffold"
  • pt_n_embd: default=256 (set from hyperparameter search result)
  • pt_n_layer: default=8 (set from hyperparameter search result)
  • pt_lr: default=1e-4 (set from hyperparameter search result)
  • pt_n_head: default=None (set from hyperparameter search result, only needed for transformer models)
python3 /CLMs-for-NPs/main.py \
  --task pretrain \
  --wandb_key "$1" \
  --pt_model GPT \
  --pt_tokenizer NPBPE1000 \
  --pt_split random \
  --pt_n_embd 256 \
  --pt_n_layer 8 \
  --pt_lr 1e-4 \
  --pt_n_head 4

Use --pt_n_head None for non-GPT models.


4. Fine-tuning

Fine-tuning and evaluation for 3 downstream classification tasks using 48 NP pretrained models (3 model types (GPT, Mamba, or Mamba2) * 8 tokenizers * 2 data split methods)

Configuration Options:

  • task: "finetune"
  • sub_task: "anti_cancer", "peptides", or "tastes"
  • model_split: "sfs" or "rds" (how the pre-training 1M NPs data is split)
  • data_split: "sf" or "rd"
python3 /CLMs-for-NPs/main.py \
  --task finetune \
  --sub_task peptides \
  --model_split sfs \
  --data_split sf

5. Fine-tuning ChemBERTa-2

Fine-tuning ChemBERTa-2 MLM on 1M NPs

python3 /CLMs-for-NPs/ChemBERTa2_MLM_Finetune_on_1M_NPs.py 

Fine-tuning ChemBERTa-2 on property prediction tasks
Configuration Options:

  • task: "chemberta"
  • chemberta_model_type: "mlm" (original model), "mtr" (original model), or "mlm-finetuned" (fine-tuned on 1M NPs)
  • sub_task: "anti_cancer" or "peptides"
  • data_split: "rd" or "sf"
python3 /CLMs-for-NPs/main.py \
  --task chemberta \
  --chemberta_model_type mtr \
  --sub_task anti_cancer \
  --data_split sf

6. Fine-tuning MolFormer

Fine-tuning MolFormer on 1M NPs (requires WandB key)

python3 /CLMs-for-NPs/main.py \
  --task molformer_1M_NPs \
  --wandb_key "$1"

Fine-tuning MolFormer on property prediction tasks
Configuration Options:

  • task: "molformer" (original model) or "molformer-finetuned" (fine-tuned on 1M NPs)
  • molformer_variant: "molformer" or "molformer-finetuned"
  • sub_task: "anti_cancer" or "peptides"
  • data_split: "rd" or "sf"
python3 /CLMs-for-NPs/main.py \
  --task molformer \
  --molformer_variant molformer \
  --sub_task peptides \
  --data_split rd

Citation

If you use this repository or the related pre-trained models on HF, please cite the paper:

@misc{wang2026chemicallanguagemodelsnatural,
  title={Chemical Language Models for Natural Products: A State-Space Model Approach}, 
  author={Ho-Hsuan Wang and Afnan Sultan and Andrea Volkamer and Dietrich Klakow},
  year={2026},
  eprint={2602.13958},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.13958}, 
}

About

NP-specific chemical language models (NPCLMs) for molecule generation and property prediction using state-space and transformer architectures.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors