This repository contains pipelines used in the Structural NANOBODY® VHH and Antibody Complex Database (SNAC-DB) project. The goal of SNAC-DB is to extract all possible antibody (Ab) and NANOBODY® VHH (Nb) complexes from structural files (e.g., PDBs or CIFs), and curate and analyze them for downstream applications.
Specifically, the repository supports:
- SNAC-DB Pipeline: Extract and curate non-redundant Ab/Nb complexes from structural files.
- Test Dataset Pipeline: Create a structurally novel test dataset for benchmarking docking models.
- Finding Hits Pipeline: Identify structures in a SNAC-DB-curated dataset based on the structural similarity with some specified structure.
All pipelines are orchestrated via bash scripts, but underlying Python modules provide granular control for customization and debugging.
The curated version of the ready-to-use SNAC-DB dataset is available at: https://zenodo.org/records/16226208
- Overview
- Repository Structure
- Getting Started
- Usage
- Filling in Unresolved Residues
- Naming Convention of Complexes
- Pipeline Details
- Unit Testing
- Dependencies
- Acknowledgements
- Citations
.
├── src/ # Python source code
│ ├── snacdb/
│ │ ├──utils/
│ │ │ ├── parallelize.py
│ │ │ ├── pdb_utils_clean_parse.py
│ │ │ ├── residue_constants.py
│ │ │ ├── sequence_utils.py
│ │ │ ├── structure_utils.py
│ │ ├── curation_identify_complexes.py
│ │ ├── curation_process_PDBs.py
│ │ ├── curation_redundant.py
│ │ ├── curation_filter_complexes.py
│ │ ├── patch.py
│ ├── analysis_fill_in_unresolved_residues.py
│ ├── analysis_finding_hits.py
│ ├── curation_SNAC_DB_Pipeline.py
│ ├── testdata_classificaiton.py
│ ├── testdata_setup.py
│ ├── testdata_summary.py
├── unit_tests/ # Unit testing scripts
│ ├── data_curation_pipeline.py
│ ├── pdb_files_test # Sample pdb files
├── data_curation.sh # SNAC-DB pipeline launcher
├── test_dataset.sh # Test dataset curation launcher
├── finding_hits.sh # Hit-finding pipeline launcher
├── environment.yml # Conda environment setup
├── requirements.txt # pip requirements
├── setup.py # Setup file
├── pyproject.toml
├── README.md
-
Clone the Repository
git clone <repository-url> cd <repository-name>
-
Set Up the Environment
-
Create a conda environment:
conda create -n snacdb_env python=3.10 -y conda activate snacdb_env
-
Install pip dependencies from PyPI:
pip install -r requirements.txt
-
Install the pipeline package
pip install .
-
Install ANARCI from their github repository: https://github.com/oxpig/ANARCI
Please note that the original ANARCI tool does not support ambiguous residues represented by the amino acid
'X'
. To accommodate structural data containing unknown residues, we provide a simple patch to modify the installed ANARCI source to allow it to handle'X'
residues gracefully during sequence parsing and annotation.Run this patch (
src/snacdb/patch.py
) using the command:snacdb-patch-anarci
-
General Note: All three pipelines can be run via bash scripts. For deeper customization, you may directly execute specific Python modules.
-
SNAC-DB Pipeline
This pipeline allows you to curate a folder containing
.pdb
or.cif
structure files, and find / extract antibody and NANOBODY® VHH complexes. This is what we used to curate all the structures deposited in the Protein Data Bank (PDB) and create the SNAC-DB database.-
Run via Bash
bash data_curation.sh path/to/<input_directory>
Optional parameters:
- Show intermediate outputs:
True
- Remove redundant complexes:
True
Example with both options:
bash data_curation.sh path/to/<input_directory> True True
- Show intermediate outputs:
-
Advanced Control
You can run individual Python scripts from the
src/
directory:- The scripts are required to be run in sequential order. If you want to run curation_identify_complex script you must first run curation_process_PDBs
- Logs and summaries are generated at each step
-
-
Pipeline to Create Test/Benchmarking Dataset
This is an optional pipeline which could be used to create a structurally distinct benchmarking dataset for model evaluation.
-
First install FoldSeek: https://github.com/steineggerlab/foldseek.
-
Create a structurally novel test set:
bash test_dataset.sh path/to/<query_directory> path/to/<target_directory>
Outputs include:
Test_Data/
: curated non-redundant datasetTest_Data_Summary.csv
: where each complex passed and failed along with closest matches and their TM-scores
-
-
Pipeline to Find Hits
This is an optional pipeline which could be used to search for structurally similar complexes using FoldSeek multimer-search.
bash finding_hits.sh <reference_dir> <chain_type> [<structure_of_interest> <is_curated>]
-
Required:
reference_dir
: Curated dataset (output from SNAC-DB pipeline)chain_type
: Chain type to compare (ligand
,antigen
, orcomplex
)
-
Optional:
structure_of_interest
: Structure file to query against the referenceis_curated
: Set toTrue
if the query is SNAC-DB formatted
-
Behavior:
- If no structure is provided, a clustering operation is performed on the dataset.
- If a query is provided, matching structures will be identified and ranked by TM-score.
-
Outputs appear in:
<reference_dir>_<chain_type>_cluster/
Includes match directories and summary CSV files.
-
In addition to the three main pipelines, this repository includes a helpful script for improving the quality of curated complexes:
analysis_fill_in_unresolved_residues.py
.
This script attempts to resolve unknown residues (typically marked as 'X'
) by searching the SwissProt and UniRef90 databases using MMseqs2. The goal is to reduce ambiguity in antigen chains, which can improve downstream structural and sequence-based analyses.
-
Script purpose:
- Scans curated complexes for unresolved residues.
- Searches SwissProt and/or UniRef90 to find high-identity matches.
- Fills in missing residues based on sequence alignment and similarity.
- Provides the corrected fasta files corresponding to the complexes for which making a correction was possible.
-
Usage:
Install MMSEQS2:
conda install bioconda::mmseqs2
Run:
python src/analysis_fill_in_unresolved_residues.py \ --input_dir path/to/curated_complexes \ --swissprot /path/to/swissprot_db \ --uniref90 /path/to/uniref90_db
-
Arguments:
--input_dir
(str, required):
Path to a directory containing complexes curated by the SNAC-DB pipeline.--swissprot
(str, optional):
Path to a SwissProt MMseqs2 database.
If not provided or the path doesn’t exist, it will be downloaded automatically to the parent of the input directory.
⚠️ Do not set this toNone
— SwissProt is required.--uniref90
(str, optional):
Path to a UniRef90 MMseqs2 database.
If not provided or doesn't exist, it will be downloaded.
If explicitly set to"None"
, UniRef90 will be skipped.
-
Output:
<input_name>_cleaning_complexes/
:
Directory with corrected sequences, intermediate results, logs, temporary databases, and alignment files.
Not all unresolved residues will be filled—only those that match sufficiently to known entries. To track processing steps and potential issues, we recommend logging the output:
python src/analysis_fill_in_unresolved_residues.py \
--input_dir path/to/my_dir > path/to/my_dir_cleaning_complexes/fill_in_unresolved_residues.log 2>&1
Note: The user can utilize fasta files with the corrected sequences to update the PDB files corresponding to those complxes. Given the intricate nature of these corrections, we do not automatically create the updated PDB files.
Each PDB and NPY file follows a specific naming convention.
- Base Name: The identification name of the structure. Can be 4 letter PDB ID from the RCSB database or some unique name.
- Ex: 8FSL in 8FSL-ASU1-VHH_B-Ag_C
- Bioassembly Number: The bioassembly number of the structure. If no bioassembly number is provided in the original structure then a default value of 0 is given. A value of 0 usually refers to the asymmetric unit cell.
- Ex: 1 in 7zmr-ASU1-VHH_K-Ag_A
- Frame (Optional): When structures are collected through experimental techniques such as NMR, the experimentalist can capture multiple frames of the structure that they can input into the structure file. Since there are small changes in these frames, the pipeline preserves them as different complexes.
- Ex: 0 in 7y7m-ASU1-frame0-VH_F-VL_G-Ag_C
- Structure Information: This section describes the type of structure and the chains associated with it. The type of potential structures are antibodies (VH_VL) and nanobodies (VHH).
- Ex: VHH_G-Ag_B_C in 8k46-ASU0-VHH_G-Ag_B_C
- Replicate (Optional): This occurs when multiple ligand chains have the same chain ID in the structure file. In this case, to avoid overwriting a complex in case they bind to the same antigen the replicate keyword is added to ensure the naming scheme is unique.
- Ex: 0 in 6ul6-ASU0-VHH_B-Ag_A-replicate0
As mentioned above, the keywords Frame and Replicate are optional keywords in the naming scheme, and only apply in the cases specified above. It is important to note that unlike the other keywords, the replicate keyword is not gaurenteed to have the same integer value as part of the replicate keyword if you rerun the pipeline.
-
Process PDBs
Aim: Clean input structures, fill in missing residues, and identify VH/VL/Ag chains.
Output:
- Cleaned PDB files
.npy
files with annotations<input_dir>__parsed_file_chains.csv
-
Identify Complexes
Aim: Extracts antibody and NANOBODY® VHH complexes from processed structures.
Output:
- PDB files of isolated complexes
.npy
annotation files<input_dir>_complexes_curated.csv
-
Filter Complexes
Aim: Applies more stringent filtering to exclude non-interacting chains.
Output:
- Filtered complex structures
- Summary CSV:
<input_dir>_outputs_multichain_filter.csv
-
Remove Redundancies
Aim: Eliminates duplicate complexes based on contact map comparison.
Output:
- Updated complex directory and summary CSV
-
Novel Antigens
Aim: Identify complexes with structurally dissimilar antigens.
Output:
- Passed and failed directories based on TM-score
-
Novel Epitopes
Aim: Compare full complex structures to detect epitope novelty.
Output:
- Passed and failed directories based on TM-score
-
Novel Conformations
Aim: Detect small but meaningful differences in binding conformation.
Output:
- Passed and failed directories based on multi-chain TM-score
-
Unique Complexes
Aim: Ensure dataset is structurally diverse internally.
Output:
- Final dataset
Test_Data_Summary.csv
-
Create Proper Reference Directory
Aim: Extract chains (ligand/antigen) from curated complexes for clustering or comparison, unless the analysis is being done using the complex input (means looking at all the chains in a complex).
Output:
- Subset directories with only relevant chains
-
Finding Hits
Aim: Cluster structures or compare to a structure of interest using FoldSeek.
Output:
- Clustering summary or match summary (if no structure of interest is specified)
- Match files and summaries per structure of interest (if specified)
Run the SNAC-DB pipeline unit test:
pytest unit_tests/data_curation_pipeline.py
This checks:
- Expected outputs are created
- Naming conventions are preserved
- Python 3.10.16 (PSF License)
- Biopython (BSD 3-Clause License)
- FoldSeek (GNU General Public License ver. 3 (GPLv3))
- ANARCI (BSD 3-Clause License)
- SciPy (BSD 3-Clause License)
- TQDM (Mozilla Public License (MPL) v. 2.0)
- Pytest (MIT License)
- networkx (BSD 3-Clause License)
- Pandas (BSD 3-Clause License)
- MMSEQS2 (MIT License)
We would like to acknowledge the developers of the following tools and libraries that are integral to the functionality of this pipeline:
-
FoldSeek: Used for fast and sensitive structural alignment and clustering in the hit-finding and test dataset pipelines. FoldSeek enables high-throughput identification of structural matches, which is essential for both redundancy reduction and benchmarking tasks.
- Reference: van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
- License: GNU General Public License ver. 3 (GPLv3)
-
ANARCI: Used for antibody and NANOBODY® VHH sequence annotation and CDR identification.
- Reference: Dunbar, J., & Deane, C. (2015). ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2), 298–300.
- License: BSD 3-Clause License
-
MMSEQS2: Used for fast and sensitive many-against-Many sequence searching and clustering. MMSEQS2 enables us to fill missing residues based on hits against UniRef and SwissProt databases.
- Reference: Steinegger, M. and Söding, J., (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), pp.1026-1028.
- License: MIT License
This current work has been accepted as part of the workshop Data World in the International Machine Learning Conference (ICML). There is future plans to submit this work for publication as well. For now please use this citation when referencing our work:
@inproceedings{
gupta2025snacdb,
title={{SNAC}-{DB}: The Hitchhiker{\textquoteright}s Guide to Building Better Predictive Models of Antibody \& {NANOBODY}{\textregistered} {VHH}{\textendash}Antigen Complexes},
author={Abhinav Gupta and Bryan Munoz Rivero and Jorge Roel-Touris and Ruijiang Li and Norbert Furtmann and Yves Fomekong Nanfack and Maria Wendt and Yu Qiu},
booktitle={DataWorld: Unifying Data Curation Frameworks Across Domains, Workshop at the 42nd International Conference on Machine Learning (ICML 2025)},
year={2025},
address={Vancouver, Canada},
url={https://openreview.net/forum?id=68DcIpDaHK}
}