Structure-based composite query generation for virtual screening.
Struct2Query generates composite molecule ROCS queries for virtual screening by leveraging structural similarity of protein binding sites. Given a target protein structure, it searches a database of protein-ligand complexes for proteins with similar pockets, retrieves their co-crystal ligands, and constructs a multi-molecule ROCS query that recapitulates binding modes across structurally similar binding sites.
If you use Struct2Query in your research, please cite:
[Citation details to be added upon publication]
Struct2Query bridges structure-based and ligand-based virtual screening by converting protein structural information into ligand-based queries suitable for high-throughput screening.
- Input: Protein structure with defined binding pocket (the current implementation uses a bound ligand to define the pocket)
- Pocket Search: SiteHopper identifies structurally similar binding site patches in a curated database of protein-ligand complexes
- Ligand Retrieval: Co-crystallized ligands from similar pockets are retrieved and energy-minimized
- Filtering: Ligands are filtered by pose quality (RMSD and Chemgauss4 score)
- Query Construction: Retained ligands are combined into a composite ROCS query
- Virtual Screening: Compound libraries are scored against the query using shape and color overlap
- Python 3.11
- OpenEye toolkits license
Set the OE_LICENSE environment variable to point to your license file:
export OE_LICENSE=/path/to/oe_license.txt# Create virtual environment and install dependencies from lockfile
uv sync --extra dev --frozen
# Activate the environment
source .venv/bin/activateThe --frozen flag ensures exact versions from uv.lock are installed for reproducibility.
The data required to run Struct2Query is available on Zenodo. Download the archive and extract it in the project root directory.
Download: https://doi.org/10.5281/zenodo.18021612
The archive contains a data/ directory with:
sitehopper_databases/Struct2Query.shdb- Curated SiteHopper database of 78,806 protein-ligand complexes derived from PLINDERdekois/- Prepared DEKOIS 2.0 benchmark dataset (81 targets)dudez/- Prepared DUDE-Z benchmark dataset (43 targets)
Run Struct2Query on a single target from the DEKOIS benchmark:
python src/struct2query/run.py \
experiment=sitehopper_dekois \
data.target_idx=0 \
paths.data_dir=./dataThis will:
- Load target 0 from DEKOIS with its protein structure
- Search for similar binding sites (retrieving up to 75 hits with patch score >= 1.5)
- Filter candidates by RMSD (<= 2.0 A) and docking score (Chemgauss4 <= 0)
- Construct a composite ROCS query from filtered ligands
- Score active and decoy compounds using FitTversky (alpha=0.05)
- Output results to the
outputs/directory
For a simpler workflow using only the crystal ligand (without SiteHopper pocket search):
python src/struct2query/run.py \
experiment=rocs_dekois \
data.target_idx=0 \
paths.data_dir=./dataStruct2Query uses Hydra for configuration management. Configuration files are located in src/struct2query/configs/.
The default parameters are based on ablation studies across DEKOIS 2.0 and DUDE-Z benchmarks (see manuscript for details).
| Parameter | Config Path | Default | Description |
|---|---|---|---|
max_hits |
generation.max_hits |
75 | Maximum ligands retrieved from SiteHopper (N_SHDB in manuscript) |
min_patch_score |
generation.filter_params.min_patch_score |
1.5 | Minimum binding site patch similarity (PS_min) |
max_rmsd |
generation.filter_params.max_rmsd |
2.0 | Maximum pose RMSD after optimization in Angstroms (RMSD_MAX) |
score_threshold |
generation.filter_params.score_threshold |
0 | Maximum Chemgauss4 score (Chemgauss4_MAX) |
scoring_type |
scorer.scoring_type |
fit_tversky | ROCS scoring function for composite queries |
Struct2Query supports three ROCS scoring functions:
-
tanimoto (
scorer=rocs): Symmetric Tanimoto scoring, weighs query and fit molecule equally. Standard choice for single-ligand queries. -
fit_tversky (
scorer=rocs_fit_tversky): Asymmetric Tversky scoring with alpha=0.05. Prioritizes fit molecule coverage. Recommended for composite queries because it rewards molecules that are well-covered by any conformer in the multi-conformer query. -
ref_tversky (
scorer=rocs_ref_tversky): Asymmetric Tversky scoring with alpha=0.95. Prioritizes reference/query coverage.
FitTversky is the default for SiteHopper experiments because composite queries have much larger self-overlap compared to single-molecule queries. Using Tanimoto would bias against certain scaffold classes (e.g., bridged polycyclics).
Override any parameter via command line:
python src/struct2query/run.py \
experiment=sitehopper_dekois \
data.target_idx=0 \
paths.data_dir=./data \
generation.max_hits=100 \
generation.filter_params.min_patch_score=2.0| Preset | Dataset | Method | Description |
|---|---|---|---|
sitehopper_dekois |
DEKOIS 2.0 | SiteHopper | Composite query with pocket similarity search |
sitehopper_dudez |
DUDE-Z | SiteHopper | Composite query with pocket similarity search |
rocs_dekois |
DEKOIS 2.0 | ROCS | Single-ligand query from crystal ligand |
rocs_dudez |
DUDE-Z | ROCS | Single-ligand query from crystal ligand |
Run SiteHopper workflow on all DEKOIS targets:
python src/struct2query/run.py -m \
experiment=sitehopper_dekois \
data.target_idx="range(0,81)" \
paths.data_dir=./data \
paths.output_dir=./results/dekois_sitehopperThe -m flag enables Hydra's multirun mode, executing the workflow for each target sequentially.
Run SiteHopper workflow on all DUDE-Z targets:
python src/struct2query/run.py -m \
experiment=sitehopper_dudez \
data.target_idx="range(0,43)" \
paths.data_dir=./data \
paths.output_dir=./results/dudez_sitehopperTo use your own dataset, create a data loader class following the interface in src/struct2query/data/ and a corresponding config file in src/struct2query/configs/data/.
Each run produces the following files in the output directory:
| File | Description |
|---|---|
scores_<target_idx>.csv |
Scoring results with SMILES, similarity metrics, and active/decoy labels |
scores_<target_idx>.sdf |
Generated candidate ligands used to construct the query |
scores_<target_idx>_generator_stats.csv |
Statistics on filtered ligands (RMSD, scores, atom counts) |
The main output CSV contains:
smiles: SMILES string of the scored moleculeshape_score: Shape Tanimoto/Tversky scorecolor_score: Color Tanimoto/Tversky scorecombo_score: Combined shape + color scoreis_active: Boolean indicating active (1) or decoy (0)
uv run pytestThis project is licensed under the MIT License - see the LICENSE.txt file for details.
