Struct2Query

Structure-based composite query generation for virtual screening.

Struct2Query generates composite molecule ROCS queries for virtual screening by leveraging structural similarity of protein binding sites. Given a target protein structure, it searches a database of protein-ligand complexes for proteins with similar pockets, retrieves their co-crystal ligands, and constructs a multi-molecule ROCS query that recapitulates binding modes across structurally similar binding sites.

Citation

If you use Struct2Query in your research, please cite:

[Citation details to be added upon publication]

Overview

Struct2Query bridges structure-based and ligand-based virtual screening by converting protein structural information into ligand-based queries suitable for high-throughput screening.

Workflow

Input: Protein structure with defined binding pocket (the current implementation uses a bound ligand to define the pocket)
Pocket Search: SiteHopper identifies structurally similar binding site patches in a curated database of protein-ligand complexes
Ligand Retrieval: Co-crystallized ligands from similar pockets are retrieved and energy-minimized
Filtering: Ligands are filtered by pose quality (RMSD and Chemgauss4 score)
Query Construction: Retained ligands are combined into a composite ROCS query
Virtual Screening: Compound libraries are scored against the query using shape and color overlap

Installation

Prerequisites

Python 3.11
OpenEye toolkits license

Set the OE_LICENSE environment variable to point to your license file:

export OE_LICENSE=/path/to/oe_license.txt

Using uv (recommended)

# Create virtual environment and install dependencies from lockfile
uv sync --extra dev --frozen

# Activate the environment
source .venv/bin/activate

The --frozen flag ensures exact versions from uv.lock are installed for reproducibility.

Data Setup

The data required to run Struct2Query is available on Zenodo. Download the archive and extract it in the project root directory.

Download: https://doi.org/10.5281/zenodo.18021612

The archive contains a data/ directory with:

sitehopper_databases/Struct2Query.shdb - Curated SiteHopper database of 78,806 protein-ligand complexes derived from PLINDER
dekois/ - Prepared DEKOIS 2.0 benchmark dataset (81 targets)
dudez/ - Prepared DUDE-Z benchmark dataset (43 targets)

Quick Start

Single-target SiteHopper workflow

Run Struct2Query on a single target from the DEKOIS benchmark:

python src/struct2query/run.py \
    experiment=sitehopper_dekois \
    data.target_idx=0 \
    paths.data_dir=./data

This will:

Load target 0 from DEKOIS with its protein structure
Search for similar binding sites (retrieving up to 75 hits with patch score >= 1.5)
Filter candidates by RMSD (<= 2.0 A) and docking score (Chemgauss4 <= 0)
Construct a composite ROCS query from filtered ligands
Score active and decoy compounds using FitTversky (alpha=0.05)
Output results to the outputs/ directory

ROCS-only workflow

For a simpler workflow using only the crystal ligand (without SiteHopper pocket search):

python src/struct2query/run.py \
    experiment=rocs_dekois \
    data.target_idx=0 \
    paths.data_dir=./data

Configuration

Struct2Query uses Hydra for configuration management. Configuration files are located in src/struct2query/configs/.

Key Parameters

The default parameters are based on ablation studies across DEKOIS 2.0 and DUDE-Z benchmarks (see manuscript for details).

Parameter	Config Path	Default	Description
`max_hits`	`generation.max_hits`	75	Maximum ligands retrieved from SiteHopper (N_SHDB in manuscript)
`min_patch_score`	`generation.filter_params.min_patch_score`	1.5	Minimum binding site patch similarity (PS_min)
`max_rmsd`	`generation.filter_params.max_rmsd`	2.0	Maximum pose RMSD after optimization in Angstroms (RMSD_MAX)
`score_threshold`	`generation.filter_params.score_threshold`	0	Maximum Chemgauss4 score (Chemgauss4_MAX)
`scoring_type`	`scorer.scoring_type`	fit_tversky	ROCS scoring function for composite queries

Scoring Functions

Struct2Query supports three ROCS scoring functions:

tanimoto (scorer=rocs): Symmetric Tanimoto scoring, weighs query and fit molecule equally. Standard choice for single-ligand queries.
fit_tversky (scorer=rocs_fit_tversky): Asymmetric Tversky scoring with alpha=0.05. Prioritizes fit molecule coverage. Recommended for composite queries because it rewards molecules that are well-covered by any conformer in the multi-conformer query.
ref_tversky (scorer=rocs_ref_tversky): Asymmetric Tversky scoring with alpha=0.95. Prioritizes reference/query coverage.

FitTversky is the default for SiteHopper experiments because composite queries have much larger self-overlap compared to single-molecule queries. Using Tanimoto would bias against certain scaffold classes (e.g., bridged polycyclics).

Configuration Overrides

Override any parameter via command line:

python src/struct2query/run.py \
    experiment=sitehopper_dekois \
    data.target_idx=0 \
    paths.data_dir=./data \
    generation.max_hits=100 \
    generation.filter_params.min_patch_score=2.0

Available Experiment Presets

Preset	Dataset	Method	Description
`sitehopper_dekois`	DEKOIS 2.0	SiteHopper	Composite query with pocket similarity search
`sitehopper_dudez`	DUDE-Z	SiteHopper	Composite query with pocket similarity search
`rocs_dekois`	DEKOIS 2.0	ROCS	Single-ligand query from crystal ligand
`rocs_dudez`	DUDE-Z	ROCS	Single-ligand query from crystal ligand

Running on Benchmark Datasets

DEKOIS 2.0 (81 targets)

Run SiteHopper workflow on all DEKOIS targets:

python src/struct2query/run.py -m \
    experiment=sitehopper_dekois \
    data.target_idx="range(0,81)" \
    paths.data_dir=./data \
    paths.output_dir=./results/dekois_sitehopper

The -m flag enables Hydra's multirun mode, executing the workflow for each target sequentially.

DUDE-Z (43 targets)

Run SiteHopper workflow on all DUDE-Z targets:

python src/struct2query/run.py -m \
    experiment=sitehopper_dudez \
    data.target_idx="range(0,43)" \
    paths.data_dir=./data \
    paths.output_dir=./results/dudez_sitehopper

Custom Datasets

To use your own dataset, create a data loader class following the interface in src/struct2query/data/ and a corresponding config file in src/struct2query/configs/data/.

Output Files

Each run produces the following files in the output directory:

File	Description
`scores_<target_idx>.csv`	Scoring results with SMILES, similarity metrics, and active/decoy labels
`scores_<target_idx>.sdf`	Generated candidate ligands used to construct the query
`scores_<target_idx>_generator_stats.csv`	Statistics on filtered ligands (RMSD, scores, atom counts)

Score CSV Columns

The main output CSV contains:

smiles: SMILES string of the scored molecule
shape_score: Shape Tanimoto/Tversky score
color_score: Color Tanimoto/Tversky score
combo_score: Combined shape + color score
is_active: Boolean indicating active (1) or decoy (0)

Running Tests

uv run pytest

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
src/struct2query		src/struct2query
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Struct2Query

Citation

Overview

Workflow

Installation

Prerequisites

Using uv (recommended)

Data Setup

Quick Start

Single-target SiteHopper workflow

ROCS-only workflow

Configuration

Key Parameters

Scoring Functions

Configuration Overrides

Available Experiment Presets

Running on Benchmark Datasets

DEKOIS 2.0 (81 targets)

DUDE-Z (43 targets)

Custom Datasets

Output Files

Score CSV Columns

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Struct2Query

Citation

Overview

Workflow

Installation

Prerequisites

Using uv (recommended)

Data Setup

Quick Start

Single-target SiteHopper workflow

ROCS-only workflow

Configuration

Key Parameters

Scoring Functions

Configuration Overrides

Available Experiment Presets

Running on Benchmark Datasets

DEKOIS 2.0 (81 targets)

DUDE-Z (43 targets)

Custom Datasets

Output Files

Score CSV Columns

Running Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages