This work introduces two training-free, inference-time conditioning strategies, Interpolate-Integrate and Replacement Guidance, that provide fine-grained control over a fast E(3)-equivariant flow-matching model for 3D molecular generation. These methods enable several key tasks in ligand-based drug design without costly retraining.
This codebase is built upon the original SemlaFlow repository by Irwin et al.
A seed-guided resampling method for generating molecules with controlled similarity to a reference.
A flexible method for bioisosteric fragment merging with user-controlled relaxation.
The provided cond-semla.yml file contains all the necessary Python packages.
Create and activate the conda environment:
conda env create -f cond-semla.yml
conda activate cond-semlaTo run the full evaluation pipeline as described in the paper, one external dependency must be installed manually. Other key packages for evaluation are already included in the environment file.
-
xTB for ShEPhERD Score (Manual Installation Required) The semi-empirical quantum chemistry program xTB is used for geometry relaxation.
-
Download the binary from the official GitHub releases: https://github.com/grimme-lab/xtb/releases
-
Unpack the archive (e.g., tar -xf xtb-version-x.x.x.tar.xz).
-
Make the xtb binary executable and ensure its location is added to your system's $PATH.
-
-
Docking & Scoring Tools (Included in environment.yml) No separate installation is needed for the main evaluation tools. Packages including ShEPhERD Score, AutoDock Vina, Meeko, and PoseBusters are automatically installed when you create the cond-semla environment.
This guide explains how to use the code to generate new molecules from your own data, as well as how to reproduce the specific results from our paper.
Workflow for New Data The primary workflow is managed through the provided Jupyter notebooks. Follow these steps to use the code with your own fragments or seed molecules.
The required input is a single .sdf file containing all of your 3D fragments or seed molecules. It is crucial that all molecules in this file are correctly positioned and oriented relative to each other, as this spatial information is used to guide the generation process.
Use the notebooks/01_Data_Preparation.ipynb notebook to convert your .sdf file into the model's internal .smol format. This notebook contains all the necessary code to load and process your input data.
Use the notebooks/02_Molecular_Generation.ipynb notebook to run the generative models on the .smol file you created. In this notebook, you can select which of our two conditioning strategies to use:
Interpolate-Integrate: Best for generating analogues with controlled similarity to a single seed molecule.
Parameters:
- data_path: Path to the dataset folder containing .smol files (e.g., ./geom_data/smol/).
- ckpt_path: Path to the pre-trained model checkpoint (e.g., ./trained_models/geom-drugs/200epochs.ckpt).
- dataset: Name of the dataset (e.g., geom-drugs).
- n_molecules: Number of reference molecules to generate similar molecules for.
- batch_cost: Number of similar molecules to generate for each reference molecule.
- sigma: Standard deviation of random noise added to coordinates during sampling.
- denoise_steps: Number of integration steps used by the ODE during sampling.
- save_dir: Directory where the output will be saved.
- save_file: File name for the saved results.
python semlaflow/interpolate_integrate.py \
--data_path ./data/NP_analogues_2500/ref_0/sampled_atoms.smol \
--ckpt_path ./trained_models/geom-drugs/200epochs.ckpt \
--dataset geom-drugs \
--n_molecules 1000 \
--batch_cost 100 \
--sigma 0.1 \
--denoise_steps 72 \
--save_dir ./predictions/ \
--save_file merged_molecules.sdfReplacement Guidance: Best for bioisosteric merging of multiple fragments.
Parameters:
- data_path: Path to the dataset folder containing .smol files (e.g., ./geom_data/difflinker_data/).
- ckpt_path: Path to the pre-trained model checkpoint (e.g., ./trained_models/geom-drugs/200epochs.ckpt).
- dataset: Name of the dataset (e.g., geom-drugs).
- n_molecules: Number of molecules to process.
- batch_cost: Number of fragments merged for each molecule.
- save_dir: Directory where the output will be saved.
- save_file: File name for the merged results (e.g., link_replacement_no_h.sdf).
python semlaflow/replacment_guidance.py \
--data_path ./data/NP_analogues_2500/ref_0/sampled_atoms.smol \
--ckpt_path ./trained_models/geom-drugs/200epochs.ckpt \
--dataset geom-drugs \
--n_molecules 1000 \
--batch_cost 100 \
--save_dir ./predictions/ \
--save_file merged_molecules.sdfYou can configure all necessary parameters, such as the input path, the number of molecules to generate, and method-specific hyperparameters, directly within the notebook.
Use the notebooks/03_Molecular_Evaluation_Analysis.ipynb and notebooks/04_PoseBusters_Molecular_Evaluation.ipynb notebooks to perform a full evaluation of the molecules you generated. This pipeline runs all the analyses described in the paper, including:
-
3D Similarity Scoring: Using ShEPhERD Score.
-
Validity Assessment: Using PoseBusters.
The pre-processed input data and all generated molecules from the paper are available on Zenodo: Zenodo
Pre-processed data for our experiments is already provided in the data/ directory. To reproduce the paper results, download the data and directly run the generation and evaluation notebooks (02, 03, and 04) using the provided datasets, skipping Steps 1 and 2.