A hub for InstaDeep's cutting-edge deep learning models and research for genomics, originating from the Nucleotide Transformer and its evolutions.
Welcome to the InstaDeep AI for Genomics repository! This is where we feature our collection of transformer-based genomic language models and innovative downstream applications. Our work in the genomics space began with The Nucleotide Transformer, developed in collaboration with Nvidia and TUM and trained on Cambridge-1, and has expanded to include projects like the Agro Nucleotide Transformer (in collaboration with Google, trained on TPU-v4 accelerators), SegmentNT, and ChatNT.
Our mission is to provide the scientific community with powerful, reproducible, and accessible tools to unlock new insights from biological sequences. This repository serves as the central place for sharing our models, inference code, pre-trained weights, and research contributions in the genomics domain, with explorations into future areas like single-cell transcriptomics.
We are thrilled to open-source these works and provide the community with access to the code and pre-trained weights for our diverse set of genomics language models and segmentation models.
This section highlights the key models and research directions from our team. Each entry provides a brief overview and links to detailed documentation, publications, and resources. (Detailed code examples, setup for specific models, and in-depth figures are now located in their respective documentation pages within the ./docs
folder.)
Our foundational language models leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide range of species. These models provide extremely accurate molecular phenotype prediction compared to existing methods. This family includes multiple variants (e.g., 500M_human_ref, 2B5_1000G, NT-v2 series) which are detailed further in the specific documentation.
- Keywords: Foundational Model, Genomics, DNA/RNA, Pre-trained, Sequence Embeddings, Phenotype Prediction
- ➡️ Model Details, Variants & Usage
- 📜 Read the Paper (Nature Methods 2025)
- 🤗 Hugging Face Collection
- 🚀 Fine-tuning Notebooks (HF): (LoRA and regular)
A novel foundational large language model trained on reference genomes from 48 plant species, with a predominant focus on crop species. AgroNT demonstrates state-of-the-art performance across several prediction tasks ranging from regulatory features, RNA processing, and gene expression in plants.
- Keywords: Plant Genomics, Foundational Model, Crop Science, Gene Expression, Agriculture AI
- ➡️ Model Details & Usage
- 📜 Read the Paper (Communications Biology 2024)
- 🤗 Hugging Face Collection
Segmentation models using transformer backbones (Nucleotide Transformers, Enformer, Borzoi) for predicting genomic elements at single-nucleotide resolution. SegmentNT, for instance, predicts 14 different classes of human genomic elements in sequences up to 30kb (generalizing to 50kbp) and demonstrates superior performance.
- Keywords: Genome Segmentation, Single-Nucleotide Resolution, Genomic Elements, U-Net, Enformer, Borzoi
- ➡️ Model Details & Usage (Covers SegmentNT, SegmentEnformer, SegmentBorzoi)
- 📜 Read the Paper (bioRxiv preprint)
- 🤗 Hugging Face Collection
- 🚀 SegmentNT Inference Notebook (HF)
A multimodal conversational agent designed with a deep understanding of DNA biological sequences, enabling interactive exploration and analysis of genomic data through natural language.
- Keywords: Conversational AI, Multimodal, DNA Analysis, Genomics Chatbot, Interactive Biology
- ➡️ Model Details & Usage
- 📜 Read the Paper (Nature Machine Intelligence 2025)
- 🤗 ChatNT on Hugging Face
- 🚀 ChatNT Inference Notebook (Jax)
A Nucleotide Transformer model variant trained on 3-mers (codons). This work investigates alternative tokenization strategies for genomic language models and their impact on downstream performance and interpretability.
- Keywords: Genomics, Language Model, Codon, Tokenization, 3-mers, Nucleotide Transformer Variant
- ➡️ Model Details & Usage
- 📜 Read the Paper (Bioinformatics 2024)
- 🤗 Hugging Face Link
A model designed for learning isoform-aware embeddings directly from RNA-seq data, enabling a deeper understanding of transcript-specific expression and regulation.
- Keywords: RNA-seq, Transcriptomics, Isoforms, Gene Expression, Embeddings
- ➡️ Model Details & Usage
- 📜 Read the Paper (NeurIPS 2024)
- 🤗 Hugging Face Link
Our foundational transformer model for single-cell and spatial transcriptomics data. sCT aims to learn rich representations from complex, high-dimensional single-cell datasets to improve various downstream analytical tasks.
- Keywords: Single-cell RNA-seq, Spatial Transcriptomics, Foundational Model, Transformer, Gene Expression
- ➡️ Model Details & Usage
- 📜 Read the Paper (OpenReview preprint)
- 🤗 Hugging Face Link
- 🚀 sCT Inference Notebook (HF)
- Built on Strong Foundations: Leveraging large-scale pre-training and diverse genomic datasets.
- Cutting-Edge Research: Incorporating the latest advancements in deep learning for biological sequence analysis.
- High Performance: Designed and validated to achieve state-of-the-art results on challenging genomic tasks.
- Open and Accessible: We provide pre-trained weights, usage examples, and aim for easy integration into research workflows.
- Collaborative Spirit: Developed with leading academic and industry partners.
- Focused Expertise: Created by a dedicated team specializing in AI for genomics at InstaDeep.
To begin using models from this repository:
- Clone the repository:
git clone https://github.com/instadeepai/nucleotide-transformer.git cd nucleotide-transformer
- Set up your environment (virtual environment recommended):
python -m venv .venv source .venv/bin/activate # On Windows use `source .venv\Scripts\activate`
- Install the package and dependencies:
pip install . # Installs the local package # Or, for a general requirements file if you have one: # pip install -r requirements.txt
For detailed instructions on individual models, including specific dependencies, downloading pre-trained weights, and Python usage examples, please refer to their dedicated documentation pages linked in the "Featured Models & Research Evolutions" section above (e.g., ./docs/nucleotide_transformer.md
).
- Questions & Bug Reports: Please use the GitHub Issues page.
- Discussions: For broader discussions or questions, please use the GitHub Discussions tab (if enabled).
- Stay Updated: Follow InstaDeep's official channels for announcements on new model releases and research updates.