Skip to content

instadeepai/nucleotide-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InstaDeep AI for Genomics Logo

AI Foundation Models for Genomics

A hub for InstaDeep's cutting-edge deep learning models and research for genomics, originating from the Nucleotide Transformer and its evolutions.

License: CC BY-NC-SA 4.0 Python 3.8 Jax 0.3.25+ Hugging Face Models


🎯 Our Focus: Advancing Genomics with AI

Welcome to the InstaDeep AI for Genomics repository! This is where we feature our collection of transformer-based genomic language models and innovative downstream applications. Our work in the genomics space began with The Nucleotide Transformer, developed in collaboration with Nvidia and TUM and trained on Cambridge-1, and has expanded to include projects like the Agro Nucleotide Transformer (in collaboration with Google, trained on TPU-v4 accelerators), SegmentNT, and ChatNT.

Our mission is to provide the scientific community with powerful, reproducible, and accessible tools to unlock new insights from biological sequences. This repository serves as the central place for sharing our models, inference code, pre-trained weights, and research contributions in the genomics domain, with explorations into future areas like single-cell transcriptomics.

We are thrilled to open-source these works and provide the community with access to the code and pre-trained weights for our diverse set of genomics language models and segmentation models.

✨ Featured Models & Research Evolutions

This section highlights the key models and research directions from our team. Each entry provides a brief overview and links to detailed documentation, publications, and resources. (Detailed code examples, setup for specific models, and in-depth figures are now located in their respective documentation pages within the ./docs folder.)


🧬 The Nucleotide Transformer (NT)

Our foundational language models leverage DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide range of species. These models provide extremely accurate molecular phenotype prediction compared to existing methods. This family includes multiple variants (e.g., 500M_human_ref, 2B5_1000G, NT-v2 series) which are detailed further in the specific documentation.


🌾 Agro Nucleotide Transformer (AgroNT)

A novel foundational large language model trained on reference genomes from 48 plant species, with a predominant focus on crop species. AgroNT demonstrates state-of-the-art performance across several prediction tasks ranging from regulatory features, RNA processing, and gene expression in plants.


🧩 SegmentNT (& family: SegmentEnformer, SegmentBorzoi)

Segmentation models using transformer backbones (Nucleotide Transformers, Enformer, Borzoi) for predicting genomic elements at single-nucleotide resolution. SegmentNT, for instance, predicts 14 different classes of human genomic elements in sequences up to 30kb (generalizing to 50kbp) and demonstrates superior performance.


💬 ChatNT

A multimodal conversational agent designed with a deep understanding of DNA biological sequences, enabling interactive exploration and analysis of genomic data through natural language.


3️⃣ Codon-NT (Exploring 3-mer Tokenization)

A Nucleotide Transformer model variant trained on 3-mers (codons). This work investigates alternative tokenization strategies for genomic language models and their impact on downstream performance and interpretability.


🧬 Isoformer

A model designed for learning isoform-aware embeddings directly from RNA-seq data, enabling a deeper understanding of transcript-specific expression and regulation.


🔬 sCT (single-Cell Transformer)

Our foundational transformer model for single-cell and spatial transcriptomics data. sCT aims to learn rich representations from complex, high-dimensional single-cell datasets to improve various downstream analytical tasks.


💡 Why Choose InstaDeep's Genomic Models?

  • Built on Strong Foundations: Leveraging large-scale pre-training and diverse genomic datasets.
  • Cutting-Edge Research: Incorporating the latest advancements in deep learning for biological sequence analysis.
  • High Performance: Designed and validated to achieve state-of-the-art results on challenging genomic tasks.
  • Open and Accessible: We provide pre-trained weights, usage examples, and aim for easy integration into research workflows.
  • Collaborative Spirit: Developed with leading academic and industry partners.
  • Focused Expertise: Created by a dedicated team specializing in AI for genomics at InstaDeep.

🚀 Getting Started

To begin using models from this repository:

  1. Clone the repository:
    git clone https://github.com/instadeepai/nucleotide-transformer.git
    cd nucleotide-transformer
  2. Set up your environment (virtual environment recommended):
    python -m venv .venv
    source .venv/bin/activate # On Windows use `source .venv\Scripts\activate`
  3. Install the package and dependencies:
    pip install . # Installs the local package
    # Or, for a general requirements file if you have one:
    # pip install -r requirements.txt 

For detailed instructions on individual models, including specific dependencies, downloading pre-trained weights, and Python usage examples, please refer to their dedicated documentation pages linked in the "Featured Models & Research Evolutions" section above (e.g., ./docs/nucleotide_transformer.md).

🤝 Community & Support

  • Questions & Bug Reports: Please use the GitHub Issues page.
  • Discussions: For broader discussions or questions, please use the GitHub Discussions tab (if enabled).
  • Stay Updated: Follow InstaDeep's official channels for announcements on new model releases and research updates.