AI-for-Bio/biollms at main · dlopezyse/AI-for-Bio · GitHub

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Large Language Models in Biology 🤖

Name	Description
BioBERT	BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to the paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details.
BioGPT	The BioGPT model was proposed in BioGPT: generative pre-trained transformer for biomedical text generation and mining by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.
BioinspiredLLM	BioinspiredLLM was finetuned with a corpus of over a thousand peer-reviewed articles in the field of structural biological and bio-inspired materials and can be prompted to recall information, assist with research tasks, and function as an engine for creativity. The model has proven that it is able to accurately recall information about biological materials and is further enhanced with enhanced reasoning ability, as well as with retrieval-augmented generation to incorporate new data during generation that can also help to traceback sources, update the knowledge base, and connect knowledge domains.
BioMedGPT	BioMedGPT is the first commercial-friendly multimodal biomedical foundation model jointly released by PharMolix and the Institute of AI Industry Research (AIR). It aligns the language of life (molecular structures and protein sequences) with human natural language, performing on par with human experts on biomedical QA benchmarks and demonstrating powerful performance in cross-modal molecule and protein question-answering tasks.
Chemformer	The Chemformer project aimed to pre-train a BART transformer language model [3] on molecular SMILES strings [4] by optimising a de-noising objective. We hypothesized that pre-training would lead to improved generalisation, performance, training speed and validity on downstream fine-tuned tasks. The pre-trained model was tested on downstream tasks such as reaction prediction, retrosynthetic prediction, molecular optimisation and molecular property prediction
ChemLLM	ChemLLM is a comprehensive framework that features the first LLM dedicated to chemistry. It also includes ChemData, a dataset specifically designed for instruction tuning, and ChemBench, a robust benchmark covering nine essential chemistry tasks. ChemLLM is adept at performing various tasks across chemical disciplines with fluid dialogue interaction.
LlaSMol	LlaSMol (large language models for small molecules) is a series of LLMs built for conducting various chemistry tasks. Specifically, we use Galactica, Llama 2, Code Llama, and Mistral as the base models, and conduct instruction tuning with LoRA on our SMolInstruct dataset.
Mol-BERT	We present a novel end-to-end deep learning framework, named Mol-BERT, that combines an effective molecular representation with pretrained BERT model tailored for molecular property prediction. Specifically, a large-scale prediction BERT model is pretrained to generate the embedding of molecular substructures, by using four million unlabeled drug SMILES (i.e., ZINC 15 and ChEMBL 27). Then, the pretrained BERT model can be fine-tuned on various molecular property prediction tasks.
Mol-GPT	The representation of molecules in SMILES notation as a string of characters enables the usage of state of the art models in natural language processing, such as Transformers, for molecular design in general. Inspired by generative pre-training (GPT) models that have been shown to be successful in generating meaningful text, we train a transformer-decoder on the next token prediction task using masked self-attention for the generation of druglike molecules in this study. We show that our model, MolGPT, performs on par with other previously proposed modern machine learning frameworks for molecular generation in terms of generating valid, unique, and novel molecules. Furthermore, we demonstrate that the model can be trained conditionally to control multiple properties of the generated molecules.
OpenBioLLM	OpenBioLLM is an advanced open-source language model designed specifically for the biomedical domain. Developed by Saama AI Labs, this model leverages cutting-edge techniques to achieve state-of-the-art performance on a wide range of biomedical tasks.
ProGen	Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ~280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.
ProteinBERT	ProteinBERT is a protein language model pretrained on ~106M proteins from UniRef90. The pretrained model can be fine-tuned on any protein-related task in a matter of minutes. ProteinBERT achieves state-of-the-art performance on a wide range of benchmarks. ProteinBERT is built on Keras/TensorFlow. ProteinBERT's deep-learning architecture is inspired by BERT, but contains several innovations such as global-attention layers that have linear complexity for sequence length (compared to self-attention's quadratic/n^2 growth). As a result, the model can process protein sequences of almost any length, including extremely long protein sequences (of over tens of thousands of amino acids). The model takes protein sequences as inputs, and can also take protein GO annotations as additional inputs (to help the model infer about the function of the input protein and update its internal representations and outputs accordingly).
ProtGPT2	ProtGPT2 is based on the GPT2 Transformer architecture and contains 36 layers with a model dimensionality of 1280, totalling 738 million parameters. ProtGPT2 is a decoder-only transformer model pre-trained on the protein space, database UniRef50 (version 2021_04). The pre-training was done on the raw sequences without FASTA headers. ProtGPT2 was trained in a self-supervised fashion, i.e., the raw sequence data was used during training without including the annotation of sequences. In particular, ProtGPT2 was trained using a causal modelling objective, in which the model is trained to predict the next token (or, in this case, oligomer) in the sequence. By doing so, the model learns an internal representation of proteins and is able to speak the protein language.
SaProt	SaProt is a large-scale general-purpose PLM trained on an extensive dataset comprising approximately 40 million protein sequences and structures. We propose a structure-aware vocabulary for protein language modeling. The vocabulary is constructed by encoding the protein structure into discrete 3D tokens by using the foldseek. We combine the residue tokens and the structure tokens to form a structure-aware sequence. Through large-scale pre-training, our model, i.e. SaProt, can learn the relationship between the structure and the sequence.
SELFormer	One approach to efficiently learning molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Most of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact, and expressive notation, SELFIES, as input in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks.
Text2Mol	We propose a new task, Text2Mol, to retrieve molecules using natural language descriptions as queries. Natural language and molecules encode information in very different ways, which leads to the exciting but challenging problem of integrating these two very different modalities. Although some work has been done on text-based retrieval and structure-based retrieval, this new task requires integrating molecules and natural language more directly. Moreover, this can be viewed as an especially challenging cross-lingual retrieval problem by considering the molecules as a language with a very unique grammar. We construct a paired dataset of molecules and their corresponding text descriptions, which we use to learn an aligned common semantic embedding space for retrieval. We extend this to create a cross-modal attention-based model for explainability and reranking by interpreting the attentions as association rules.
Uni-Mol	Uni-Mol is a universal 3D molecular pretraining framework that significantly expands representation capacity and application scope in drug design. The framework comprises two models: a molecular pretraining model that has been trained using 209M molecular 3D conformations, and a pocket pretraining model that has been trained using 3M candidate protein pocket data. These two models can be used independently for different tasks and are combined for protein-ligand binding tasks. Uni-Mol has demonstrated superior performance compared to the state-of-the-art (SOTA) in 14 out of 15 molecular property prediction tasks. Moreover, Uni-Mol has achieved exceptional accuracy in 3D spatial tasks, such as protein-ligand binding pose prediction and molecular conformation generation.