Skip to content

Latest commit

 

History

History
32 lines (16 loc) · 2.08 KB

File metadata and controls

32 lines (16 loc) · 2.08 KB

protein-scatter

A map of proteins for exploration and discovery. Specifically to explore many local parts of proteins all at once. Accompanied by the paper version explaining everything in depth.

Screen.Recording.2024-03-11.at.3.33.39.PM.mov

This code uses Foldseek's 3Di representation instead of amino acids to train a sequence model. The embeddings from the sequence model are then fed into UMAP for a global visualization.

What makes this system different? Here I explicitly model each protein as the interactions of it's internal 3D structure. I then compare across many different proteins for a global visualization.

Models and Datasets

If you want to reproduce these results check the training code in the training/ directory.

Note that UMAP transformation was does in python notebooks not in the python code.

The weights are saved in checkpoint-large-3.pt in this Google Drive as well as additional training data.

Code References

See the paper protein-scatter.pdf for more references that aren't just code references.