Skip to content

oxpig/cond-semla

Repository files navigation

Interpolation-Based Conditioning of Flow Matching Models for Bioisosteric Ligand Design

This work introduces two training-free, inference-time conditioning strategies, Interpolate-Integrate and Replacement Guidance, that provide fine-grained control over a fast E(3)-equivariant flow-matching model for 3D molecular generation. These methods enable several key tasks in ligand-based drug design without costly retraining.

This codebase is built upon the original SemlaFlow repository by Irwin et al.

Key Features

Interpolate-Integrate:

A seed-guided resampling method for generating molecules with controlled similarity to a reference.

Replacement Guidance:

A flexible method for bioisosteric fragment merging with user-controlled relaxation.

Installation

The provided cond-semla.yml file contains all the necessary Python packages.

Create and activate the conda environment:

conda env create -f cond-semla.yml
conda activate cond-semla

Dependencies for the Evaluation Pipeline

To run the full evaluation pipeline as described in the paper, one external dependency must be installed manually. Other key packages for evaluation are already included in the environment file.

  1. xTB for ShEPhERD Score (Manual Installation Required) The semi-empirical quantum chemistry program xTB is used for geometry relaxation.

    • Download the binary from the official GitHub releases: https://github.com/grimme-lab/xtb/releases

    • Unpack the archive (e.g., tar -xf xtb-version-x.x.x.tar.xz).

    • Make the xtb binary executable and ensure its location is added to your system's $PATH.

  2. Docking & Scoring Tools (Included in environment.yml) No separate installation is needed for the main evaluation tools. Packages including ShEPhERD Score, AutoDock Vina, Meeko, and PoseBusters are automatically installed when you create the cond-semla environment.

Usage

This guide explains how to use the code to generate new molecules from your own data, as well as how to reproduce the specific results from our paper.

Workflow for New Data The primary workflow is managed through the provided Jupyter notebooks. Follow these steps to use the code with your own fragments or seed molecules.

Step 1: Prepare the Input File

The required input is a single .sdf file containing all of your 3D fragments or seed molecules. It is crucial that all molecules in this file are correctly positioned and oriented relative to each other, as this spatial information is used to guide the generation process.

Step 2: Convert Data to .smol Format

Use the notebooks/01_Data_Preparation.ipynb notebook to convert your .sdf file into the model's internal .smol format. This notebook contains all the necessary code to load and process your input data.

Step 3: Generate New Molecules

Use the notebooks/02_Molecular_Generation.ipynb notebook to run the generative models on the .smol file you created. In this notebook, you can select which of our two conditioning strategies to use:

Interpolate-Integrate: Best for generating analogues with controlled similarity to a single seed molecule.

Parameters:

  • data_path: Path to the dataset folder containing .smol files (e.g., ./geom_data/smol/).
  • ckpt_path: Path to the pre-trained model checkpoint (e.g., ./trained_models/geom-drugs/200epochs.ckpt).
  • dataset: Name of the dataset (e.g., geom-drugs).
  • n_molecules: Number of reference molecules to generate similar molecules for.
  • batch_cost: Number of similar molecules to generate for each reference molecule.
  • sigma: Standard deviation of random noise added to coordinates during sampling.
  • denoise_steps: Number of integration steps used by the ODE during sampling.
  • save_dir: Directory where the output will be saved.
  • save_file: File name for the saved results.
python semlaflow/interpolate_integrate.py \
  --data_path ./data/NP_analogues_2500/ref_0/sampled_atoms.smol \
  --ckpt_path ./trained_models/geom-drugs/200epochs.ckpt \
  --dataset geom-drugs \
  --n_molecules 1000 \
  --batch_cost 100 \
   --sigma 0.1 \
  --denoise_steps 72 \
  --save_dir ./predictions/ \
  --save_file merged_molecules.sdf

Replacement Guidance: Best for bioisosteric merging of multiple fragments.

Parameters:

  • data_path: Path to the dataset folder containing .smol files (e.g., ./geom_data/difflinker_data/).
  • ckpt_path: Path to the pre-trained model checkpoint (e.g., ./trained_models/geom-drugs/200epochs.ckpt).
  • dataset: Name of the dataset (e.g., geom-drugs).
  • n_molecules: Number of molecules to process.
  • batch_cost: Number of fragments merged for each molecule.
  • save_dir: Directory where the output will be saved.
  • save_file: File name for the merged results (e.g., link_replacement_no_h.sdf).
python semlaflow/replacment_guidance.py \
  --data_path ./data/NP_analogues_2500/ref_0/sampled_atoms.smol \
  --ckpt_path ./trained_models/geom-drugs/200epochs.ckpt \
  --dataset geom-drugs \
  --n_molecules 1000 \
  --batch_cost 100 \
  --save_dir ./predictions/ \
  --save_file merged_molecules.sdf

You can configure all necessary parameters, such as the input path, the number of molecules to generate, and method-specific hyperparameters, directly within the notebook.

Step 4: Evaluate Generated Molecules

Use the notebooks/03_Molecular_Evaluation_Analysis.ipynb and notebooks/04_PoseBusters_Molecular_Evaluation.ipynb notebooks to perform a full evaluation of the molecules you generated. This pipeline runs all the analyses described in the paper, including:

Data

The pre-processed input data and all generated molecules from the paper are available on Zenodo: Zenodo

Pre-processed data for our experiments is already provided in the data/ directory. To reproduce the paper results, download the data and directly run the generation and evaluation notebooks (02, 03, and 04) using the provided datasets, skipping Steps 1 and 2.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published