aposteriori is a library for the voxelization of protein structures for protein design. It uses conventional PDB files to create fixed discretized areas of space called "frames". The atoms belonging to the side-chain of the residues are removed so to allow a Deep Learning classifier to determine the identity of the frames based solely on the protein backbone structure.
pip install aposterioriClone the repository and install manually:
git clone https://github.com/wells-wood-research/aposteriori.git
cd aposteriori/
pip install .You can create a dataset using aposteriori in two ways:
- Python API:
aposteriori.make_frame_dataset - Command-Line Interface (CLI):
make-frame-dataset
make-frame-dataset /path/to/pdb_folderIf you want to try out an example, run:
make-frame-dataset tests/testing_files/pdb_files/To view all options:
make-frame-dataset --helpTo predict the identity of a frame, you can use several models from TIMED-Design.
The resulting dataset is stored in an HDF5 file and follows this structure:
└── [PDB Code] # Each protein structure
└── [Chain ID] # Each chain in the structure
└── [Residue ID] # Each residue as a voxelized 3D frame
├── Voxel Data (NxNxNxC array)
├── .attrs['label'] # Residue three-letter code
├── .attrs['encoded_residue'] # One-hot encoded residue identity
└── .attrs['make_frame_dataset_ver'] # Version info
└── .attrs['frame_dims'] # Voxel grid dimensions
└── .attrs['atom_encoder'] # Atom encoding scheme
└── .attrs['frame_edge_length'] # Frame size in Å
└── .attrs['voxels_as_gaussian'] # Whether voxels store Gaussian density maps
import h5py
with h5py.File("frame_dataset.hdf5", "r") as dataset:
frame = dataset["1CTF"]["A"]["58"][:] # Get voxelized framemake-frame-dataset [OPTIONS] STRUCTURE_FILE_FOLDER| Option | Description |
|---|---|
-o, --output-folder PATH |
Output directory (default: current) |
-n, --name TEXT |
Dataset name (default: frame_dataset.hdf5) |
-e, --extension TEXT |
File extension to process (default: .pdb) |
--frame-edge-length FLOAT |
Frame size in Å (default: 12.0) |
--voxels-per-side INTEGER |
Voxel grid size (default: 21) |
-p, --processes INTEGER |
Number of parallel processes (default: 1) |
-g, --voxels_as_gaussian BOOLEAN |
Store as Gaussian densities instead of binary (default: False) |
-b, --blacklist_csv PATH |
Exclude structures in a CSV file |
-d, --download_file PATH |
Download PDB structures from a CSV list |
-r, --recursive |
Include subdirectories |
make-frame-dataset tests/testing_files/pdb_files/Biological Units are functionally relevant protein structures that avoid artifacts like solvent-exposed hydrophobic residues.
make-frame-dataset /path/to/biounits/ -r In this case the recursive flag -r tells aposteriori to look in subfolders.
Download the datasets from:
- European Bioinformatics Institute
- Alternative Sources
For more details, see:
PISCES provides curated protein subsets based on resolution, identity, and quality.
To voxelize structures from a PISCES file:
make-frame-dataset /path/to/biounits/ --pieces-filter-file path/to/pisces/cullpdb_pc90_res1.6_R0.25_d190114_chains8082The easiest way to install a development version of aposteriori is using Conda:
Create and activate a development environment:
conda create -n aposteriori python=3.8
conda activate aposterioriClone and install dependencies:
git clone https://github.com/wells-wood-research/aposteriori.git
cd aposteriori/
pip install -r dev-requirements.txt
pip install .Run tests:
pytest tests/
... make-frame-dataset --help ...
If you use aposteriori in your research, please cite it appropriately.
@article{timed,
author = {Castorina, Leonardo V and Ünal, Suleyman Mert and Subr, Kartic and Wood, Christopher W},
title = "{TIMED-Design: Flexible and Accessible Protein Sequence Design with Convolutional Neural Networks}",
journal = {Protein Engineering, Design and Selection},
pages = {gzae002},
year = {2024},
month = {01},
abstract = "{Sequence design is a crucial step in the process of designing or engineering proteins. Traditionally, physics-based methods have been used to solve for optimal sequences, with the main disadvantages being that they are computationally intensive for the end user. Deep learning based methods offer an attractive alternative, outperforming physics-based methods at a significantly lower computational cost.In this paper, we explore the application of Convolutional Neural Networks (CNNs) for sequence design. We describe the development and benchmarking of a range of networks, as well as reimplementations of previously described CNNs. We demonstrate the flexibility of representing proteins in a three-dimensional voxel grid by encoding additional design constraints into the input data. Finally, we describe TIMED-Design, a web application and command line tool for exploring and applying the models described in this paper.The User Interface (UI) will be available at the URL: https://pragmaticproteindesign.bio.ed.ac.uk/timed. The source code for TIMED-Design is available at https://github.com/wells-wood-research/timed-design.chris.wood@ed.ac.ukSupplementary data are available at Journal Name online.}",
issn = {1741-0126},
doi = {10.1093/protein/gzae002},
url = {https://doi.org/10.1093/protein/gzae002},
eprint = {https://academic.oup.com/peds/advance-article-pdf/doi/10.1093/protein/gzae002/56453873/gzae002.pdf},
}

