Data Curation Pipeline for Antibodies and Nanobodies

Overview

This repository contains pipelines used in the Structural NANOBODY® VHH and Antibody Complex Database (SNAC-DB) project. The goal of SNAC-DB is to extract all possible antibody (Ab) and NANOBODY® VHH (Nb) complexes from structural files (e.g., PDBs or CIFs), and curate and analyze them for downstream applications.

Specifically, the repository supports:

SNAC-DB Pipeline: Extract and curate non-redundant Ab/Nb complexes from structural files.
Test Dataset Pipeline: Create a structurally novel test dataset for benchmarking docking models.
Finding Hits Pipeline: Identify structures in a SNAC-DB-curated dataset based on the structural similarity with some specified structure.

All pipelines are orchestrated via bash scripts, but underlying Python modules provide granular control for customization and debugging.

The curated version of the ready-to-use SNAC-DB dataset is available at: https://zenodo.org/records/16226208

Repository Structure

.
├── src/                        # Python source code
│   ├── snacdb/
│   │    ├──utils/
│   │    │  ├── parallelize.py
│   │    │  ├── pdb_utils_clean_parse.py
│   │    │  ├── residue_constants.py
│   │    │  ├── sequence_utils.py
│   │    │  ├── structure_utils.py
│   │    ├── curation_identify_complexes.py 
│   │    ├── curation_process_PDBs.py 
│   │    ├── curation_redundant.py
│   │    ├── curation_filter_complexes.py
│   │    ├── patch.py
│   ├── analysis_fill_in_unresolved_residues.py  
│   ├── analysis_finding_hits.py  
│   ├── curation_SNAC_DB_Pipeline.py
│   ├── testdata_classificaiton.py 
│   ├── testdata_setup.py 
│   ├── testdata_summary.py 
├── unit_tests/                # Unit testing scripts
│   ├── data_curation_pipeline.py
│   ├── pdb_files_test         # Sample pdb files
├── data_curation.sh           # SNAC-DB pipeline launcher
├── test_dataset.sh            # Test dataset curation launcher
├── finding_hits.sh            # Hit-finding pipeline launcher
├── environment.yml            # Conda environment setup
├── requirements.txt           # pip requirements
├── setup.py                   # Setup file
├── pyproject.toml
├── README.md

Getting Started

Clone the Repository

git clone <repository-url>
cd <repository-name>

Set Up the Environment
- Create a conda environment:
```
conda create -n snacdb_env python=3.10 -y
conda activate snacdb_env
```
- Install pip dependencies from PyPI:
```
pip install -r requirements.txt
```
- Install the pipeline package
```
pip install .
```
- Install ANARCI from their github repository: https://github.com/oxpig/ANARCI
  
  Please note that the original ANARCI tool does not support ambiguous residues represented by the amino acid 'X'. To accommodate structural data containing unknown residues, we provide a simple patch to modify the installed ANARCI source to allow it to handle 'X' residues gracefully during sequence parsing and annotation.
  
  Run this patch (src/snacdb/patch.py) using the command:
```
snacdb-patch-anarci
```

Usage

General Note: All three pipelines can be run via bash scripts. For deeper customization, you may directly execute specific Python modules.

SNAC-DB Pipeline

This pipeline allows you to curate a folder containing .pdb or .cif structure files, and find / extract antibody and NANOBODY® VHH complexes. This is what we used to curate all the structures deposited in the Protein Data Bank (PDB) and create the SNAC-DB database.
- Run via Bash
```
bash data_curation.sh path/to/<input_directory>
```
  Optional parameters:
  - Show intermediate outputs: True
  - Remove redundant complexes: True
  Example with both options:
```
bash data_curation.sh path/to/<input_directory> True True
```
- Advanced Control
  
  You can run individual Python scripts from the src/ directory:
  - The scripts are required to be run in sequential order. If you want to run curation_identify_complex script you must first run curation_process_PDBs
  - Logs and summaries are generated at each step
Pipeline to Create Test/Benchmarking Dataset

This is an optional pipeline which could be used to create a structurally distinct benchmarking dataset for model evaluation.
- First install FoldSeek: https://github.com/steineggerlab/foldseek.
- Create a structurally novel test set:
```
bash test_dataset.sh path/to/<query_directory> path/to/<target_directory>
```
  Outputs include:
  - Test_Data/: curated non-redundant dataset
  - Test_Data_Summary.csv: where each complex passed and failed along with closest matches and their TM-scores
Pipeline to Find Hits

This is an optional pipeline which could be used to search for structurally similar complexes using FoldSeek multimer-search.
```
bash finding_hits.sh <reference_dir> <chain_type> [<structure_of_interest> <is_curated>]
```
- Required:
  - reference_dir: Curated dataset (output from SNAC-DB pipeline)
  - chain_type: Chain type to compare (ligand, antigen, or complex)
- Optional:
  - structure_of_interest: Structure file to query against the reference
  - is_curated: Set to True if the query is SNAC-DB formatted
- Behavior:
  - If no structure is provided, a clustering operation is performed on the dataset.
  - If a query is provided, matching structures will be identified and ranked by TM-score.
- Outputs appear in:
```
<reference_dir>_<chain_type>_cluster/
```
  Includes match directories and summary CSV files.

Filling in Unresolved Residues

In addition to the three main pipelines, this repository includes a helpful script for improving the quality of curated complexes:
analysis_fill_in_unresolved_residues.py.

This script attempts to resolve unknown residues (typically marked as 'X') by searching the SwissProt and UniRef90 databases using MMseqs2. The goal is to reduce ambiguity in antigen chains, which can improve downstream structural and sequence-based analyses.

Script purpose:
- Scans curated complexes for unresolved residues.
- Searches SwissProt and/or UniRef90 to find high-identity matches.
- Fills in missing residues based on sequence alignment and similarity.
- Provides the corrected fasta files corresponding to the complexes for which making a correction was possible.

Usage:

Install MMSEQS2: conda install bioconda::mmseqs2

Run:

python src/analysis_fill_in_unresolved_residues.py \
  --input_dir path/to/curated_complexes \
  --swissprot /path/to/swissprot_db \
  --uniref90 /path/to/uniref90_db

Arguments:
- --input_dir (str, required):
  Path to a directory containing complexes curated by the SNAC-DB pipeline.
- --swissprot (str, optional):
  Path to a SwissProt MMseqs2 database.
  If not provided or the path doesn’t exist, it will be downloaded automatically to the parent of the input directory.
  ⚠️ Do not set this to None — SwissProt is required.
- --uniref90 (str, optional):
  Path to a UniRef90 MMseqs2 database.
  If not provided or doesn't exist, it will be downloaded.
  If explicitly set to "None", UniRef90 will be skipped.
Output:
- <input_name>_cleaning_complexes/:
  Directory with corrected sequences, intermediate results, logs, temporary databases, and alignment files.

Not all unresolved residues will be filled—only those that match sufficiently to known entries. To track processing steps and potential issues, we recommend logging the output:

python src/analysis_fill_in_unresolved_residues.py \
  --input_dir path/to/my_dir > path/to/my_dir_cleaning_complexes/fill_in_unresolved_residues.log 2>&1

Note: The user can utilize fasta files with the corrected sequences to update the PDB files corresponding to those complxes. Given the intricate nature of these corrections, we do not automatically create the updated PDB files.

Naming Convention of Complexes

Each PDB and NPY file follows a specific naming convention.

Base Name: The identification name of the structure. Can be 4 letter PDB ID from the RCSB database or some unique name.
- Ex: 8FSL in 8FSL-ASU1-VHH_B-Ag_C
Bioassembly Number: The bioassembly number of the structure. If no bioassembly number is provided in the original structure then a default value of 0 is given. A value of 0 usually refers to the asymmetric unit cell.
- Ex: 1 in 7zmr-ASU1-VHH_K-Ag_A
Frame (Optional): When structures are collected through experimental techniques such as NMR, the experimentalist can capture multiple frames of the structure that they can input into the structure file. Since there are small changes in these frames, the pipeline preserves them as different complexes.
- Ex: 0 in 7y7m-ASU1-frame0-VH_F-VL_G-Ag_C
Structure Information: This section describes the type of structure and the chains associated with it. The type of potential structures are antibodies (VH_VL) and nanobodies (VHH).
- Ex: VHH_G-Ag_B_C in 8k46-ASU0-VHH_G-Ag_B_C
Replicate (Optional): This occurs when multiple ligand chains have the same chain ID in the structure file. In this case, to avoid overwriting a complex in case they bind to the same antigen the replicate keyword is added to ensure the naming scheme is unique.
- Ex: 0 in 6ul6-ASU0-VHH_B-Ag_A-replicate0

As mentioned above, the keywords Frame and Replicate are optional keywords in the naming scheme, and only apply in the cases specified above. It is important to note that unlike the other keywords, the replicate keyword is not gaurenteed to have the same integer value as part of the replicate keyword if you rerun the pipeline.

Pipeline Details

SNAC-DB Pipeline

Process PDBs

Aim: Clean input structures, fill in missing residues, and identify VH/VL/Ag chains.

Output:
- Cleaned PDB files
- .npy files with annotations
- <input_dir>__parsed_file_chains.csv
Identify Complexes

Aim: Extracts antibody and NANOBODY® VHH complexes from processed structures.

Output:
- PDB files of isolated complexes
- .npy annotation files
- <input_dir>_complexes_curated.csv
Filter Complexes

Aim: Applies more stringent filtering to exclude non-interacting chains.

Output:
- Filtered complex structures
- Summary CSV: <input_dir>_outputs_multichain_filter.csv
Remove Redundancies

Aim: Eliminates duplicate complexes based on contact map comparison.

Output:
- Updated complex directory and summary CSV

Pipeline to Create Test/Benchmarking Dataset

Novel Antigens

Aim: Identify complexes with structurally dissimilar antigens.

Output:
- Passed and failed directories based on TM-score
Novel Epitopes

Aim: Compare full complex structures to detect epitope novelty.

Output:
- Passed and failed directories based on TM-score
Novel Conformations

Aim: Detect small but meaningful differences in binding conformation.

Output:
- Passed and failed directories based on multi-chain TM-score
Unique Complexes

Aim: Ensure dataset is structurally diverse internally.

Output:
- Final dataset
- Test_Data_Summary.csv

Pipeline to Find Hits

Create Proper Reference Directory

Aim: Extract chains (ligand/antigen) from curated complexes for clustering or comparison, unless the analysis is being done using the complex input (means looking at all the chains in a complex).

Output:
- Subset directories with only relevant chains
Finding Hits

Aim: Cluster structures or compare to a structure of interest using FoldSeek.

Output:
- Clustering summary or match summary (if no structure of interest is specified)
- Match files and summaries per structure of interest (if specified)

Unit Testing

Run the SNAC-DB pipeline unit test:

pytest unit_tests/data_curation_pipeline.py

This checks:

Expected outputs are created
Naming conventions are preserved

Dependencies

Python 3.10.16 (PSF License)
Biopython (BSD 3-Clause License)
FoldSeek (GNU General Public License ver. 3 (GPLv3))
ANARCI (BSD 3-Clause License)
SciPy (BSD 3-Clause License)
TQDM (Mozilla Public License (MPL) v. 2.0)
Pytest (MIT License)
networkx (BSD 3-Clause License)
Pandas (BSD 3-Clause License)
MMSEQS2 (MIT License)

Acknowledgements

We would like to acknowledge the developers of the following tools and libraries that are integral to the functionality of this pipeline:

FoldSeek: Used for fast and sensitive structural alignment and clustering in the hit-finding and test dataset pipelines. FoldSeek enables high-throughput identification of structural matches, which is essential for both redundancy reduction and benchmarking tasks.
- Reference: van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
- License: GNU General Public License ver. 3 (GPLv3)
ANARCI: Used for antibody and NANOBODY® VHH sequence annotation and CDR identification.
- Reference: Dunbar, J., & Deane, C. (2015). ANARCI: antigen receptor numbering and receptor classification. Bioinformatics, 32(2), 298–300.
- License: BSD 3-Clause License
MMSEQS2: Used for fast and sensitive many-against-Many sequence searching and clustering. MMSEQS2 enables us to fill missing residues based on hits against UniRef and SwissProt databases.
- Reference: Steinegger, M. and Söding, J., (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), pp.1026-1028.
- License: MIT License

Citations

This current work has been accepted as part of the workshop Data World in the International Machine Learning Conference (ICML). There is future plans to submit this work for publication as well. For now please use this citation when referencing our work:

@inproceedings{
  gupta2025snacdb,
  title={{SNAC}-{DB}: The Hitchhiker{\textquoteright}s Guide to Building Better Predictive Models of Antibody \& {NANOBODY}{\textregistered} {VHH}{\textendash}Antigen Complexes},
  author={Abhinav Gupta and Bryan Munoz Rivero and Jorge Roel-Touris and Ruijiang Li and Norbert Furtmann and Yves Fomekong Nanfack and Maria Wendt and Yu Qiu},
  booktitle={DataWorld: Unifying Data Curation Frameworks Across Domains, Workshop at the 42nd International Conference on Machine Learning (ICML 2025)},
  year={2025},
  address={Vancouver, Canada},
  url={https://openreview.net/forum?id=68DcIpDaHK}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Curation Pipeline for Antibodies and Nanobodies

Overview

The curated version of the ready-to-use SNAC-DB dataset is available at: https://zenodo.org/records/16226208

Table of Contents

Repository Structure

Getting Started

Usage

Filling in Unresolved Residues

Naming Convention of Complexes

Pipeline Details

SNAC-DB Pipeline

Pipeline to Create Test/Benchmarking Dataset

Pipeline to Find Hits

Unit Testing

Dependencies

Acknowledgements

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
unit_tests		unit_tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
data_curation.sh		data_curation.sh
finding_hits.sh		finding_hits.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
test_dataset.sh		test_dataset.sh

License

Sanofi-Public/SNAC-DB

Folders and files

Latest commit

History

Repository files navigation

Data Curation Pipeline for Antibodies and Nanobodies

Overview

The curated version of the ready-to-use SNAC-DB dataset is available at: https://zenodo.org/records/16226208

Table of Contents

Repository Structure

Getting Started

Usage

Filling in Unresolved Residues

Naming Convention of Complexes

Pipeline Details

SNAC-DB Pipeline

Pipeline to Create Test/Benchmarking Dataset

Pipeline to Find Hits

Unit Testing

Dependencies

Acknowledgements

Citations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages