Skip to content

Genentech/struct2query

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Struct2Query

Structure-based composite query generation for virtual screening.

Struct2Query workflow

Struct2Query generates composite molecule ROCS queries for virtual screening by leveraging structural similarity of protein binding sites. Given a target protein structure, it searches a database of protein-ligand complexes for proteins with similar pockets, retrieves their co-crystal ligands, and constructs a multi-molecule ROCS query that recapitulates binding modes across structurally similar binding sites.

Citation

If you use Struct2Query in your research, please cite:

[Citation details to be added upon publication]

Overview

Struct2Query bridges structure-based and ligand-based virtual screening by converting protein structural information into ligand-based queries suitable for high-throughput screening.

Workflow

  1. Input: Protein structure with defined binding pocket (the current implementation uses a bound ligand to define the pocket)
  2. Pocket Search: SiteHopper identifies structurally similar binding site patches in a curated database of protein-ligand complexes
  3. Ligand Retrieval: Co-crystallized ligands from similar pockets are retrieved and energy-minimized
  4. Filtering: Ligands are filtered by pose quality (RMSD and Chemgauss4 score)
  5. Query Construction: Retained ligands are combined into a composite ROCS query
  6. Virtual Screening: Compound libraries are scored against the query using shape and color overlap

Installation

Prerequisites

  • Python 3.11
  • OpenEye toolkits license

Set the OE_LICENSE environment variable to point to your license file:

export OE_LICENSE=/path/to/oe_license.txt

Using uv (recommended)

# Create virtual environment and install dependencies from lockfile
uv sync --extra dev --frozen

# Activate the environment
source .venv/bin/activate

The --frozen flag ensures exact versions from uv.lock are installed for reproducibility.

Data Setup

The data required to run Struct2Query is available on Zenodo. Download the archive and extract it in the project root directory.

DOI

Download: https://doi.org/10.5281/zenodo.18021612

The archive contains a data/ directory with:

  • sitehopper_databases/Struct2Query.shdb - Curated SiteHopper database of 78,806 protein-ligand complexes derived from PLINDER
  • dekois/ - Prepared DEKOIS 2.0 benchmark dataset (81 targets)
  • dudez/ - Prepared DUDE-Z benchmark dataset (43 targets)

Quick Start

Single-target SiteHopper workflow

Run Struct2Query on a single target from the DEKOIS benchmark:

python src/struct2query/run.py \
    experiment=sitehopper_dekois \
    data.target_idx=0 \
    paths.data_dir=./data

This will:

  1. Load target 0 from DEKOIS with its protein structure
  2. Search for similar binding sites (retrieving up to 75 hits with patch score >= 1.5)
  3. Filter candidates by RMSD (<= 2.0 A) and docking score (Chemgauss4 <= 0)
  4. Construct a composite ROCS query from filtered ligands
  5. Score active and decoy compounds using FitTversky (alpha=0.05)
  6. Output results to the outputs/ directory

ROCS-only workflow

For a simpler workflow using only the crystal ligand (without SiteHopper pocket search):

python src/struct2query/run.py \
    experiment=rocs_dekois \
    data.target_idx=0 \
    paths.data_dir=./data

Configuration

Struct2Query uses Hydra for configuration management. Configuration files are located in src/struct2query/configs/.

Key Parameters

The default parameters are based on ablation studies across DEKOIS 2.0 and DUDE-Z benchmarks (see manuscript for details).

Parameter Config Path Default Description
max_hits generation.max_hits 75 Maximum ligands retrieved from SiteHopper (N_SHDB in manuscript)
min_patch_score generation.filter_params.min_patch_score 1.5 Minimum binding site patch similarity (PS_min)
max_rmsd generation.filter_params.max_rmsd 2.0 Maximum pose RMSD after optimization in Angstroms (RMSD_MAX)
score_threshold generation.filter_params.score_threshold 0 Maximum Chemgauss4 score (Chemgauss4_MAX)
scoring_type scorer.scoring_type fit_tversky ROCS scoring function for composite queries

Scoring Functions

Struct2Query supports three ROCS scoring functions:

  • tanimoto (scorer=rocs): Symmetric Tanimoto scoring, weighs query and fit molecule equally. Standard choice for single-ligand queries.

  • fit_tversky (scorer=rocs_fit_tversky): Asymmetric Tversky scoring with alpha=0.05. Prioritizes fit molecule coverage. Recommended for composite queries because it rewards molecules that are well-covered by any conformer in the multi-conformer query.

  • ref_tversky (scorer=rocs_ref_tversky): Asymmetric Tversky scoring with alpha=0.95. Prioritizes reference/query coverage.

FitTversky is the default for SiteHopper experiments because composite queries have much larger self-overlap compared to single-molecule queries. Using Tanimoto would bias against certain scaffold classes (e.g., bridged polycyclics).

Configuration Overrides

Override any parameter via command line:

python src/struct2query/run.py \
    experiment=sitehopper_dekois \
    data.target_idx=0 \
    paths.data_dir=./data \
    generation.max_hits=100 \
    generation.filter_params.min_patch_score=2.0

Available Experiment Presets

Preset Dataset Method Description
sitehopper_dekois DEKOIS 2.0 SiteHopper Composite query with pocket similarity search
sitehopper_dudez DUDE-Z SiteHopper Composite query with pocket similarity search
rocs_dekois DEKOIS 2.0 ROCS Single-ligand query from crystal ligand
rocs_dudez DUDE-Z ROCS Single-ligand query from crystal ligand

Running on Benchmark Datasets

DEKOIS 2.0 (81 targets)

Run SiteHopper workflow on all DEKOIS targets:

python src/struct2query/run.py -m \
    experiment=sitehopper_dekois \
    data.target_idx="range(0,81)" \
    paths.data_dir=./data \
    paths.output_dir=./results/dekois_sitehopper

The -m flag enables Hydra's multirun mode, executing the workflow for each target sequentially.

DUDE-Z (43 targets)

Run SiteHopper workflow on all DUDE-Z targets:

python src/struct2query/run.py -m \
    experiment=sitehopper_dudez \
    data.target_idx="range(0,43)" \
    paths.data_dir=./data \
    paths.output_dir=./results/dudez_sitehopper

Custom Datasets

To use your own dataset, create a data loader class following the interface in src/struct2query/data/ and a corresponding config file in src/struct2query/configs/data/.

Output Files

Each run produces the following files in the output directory:

File Description
scores_<target_idx>.csv Scoring results with SMILES, similarity metrics, and active/decoy labels
scores_<target_idx>.sdf Generated candidate ligands used to construct the query
scores_<target_idx>_generator_stats.csv Statistics on filtered ligands (RMSD, scores, atom counts)

Score CSV Columns

The main output CSV contains:

  • smiles: SMILES string of the scored molecule
  • shape_score: Shape Tanimoto/Tversky score
  • color_score: Color Tanimoto/Tversky score
  • combo_score: Combined shape + color score
  • is_active: Boolean indicating active (1) or decoy (0)

Running Tests

uv run pytest

License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

About

Struct2Query: Generates composite molecule queries for virtual screening by searching structurally similar protein binding sites and constructing ROCS queries from retrieved ligands.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages