Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
CM_llm_genomics.pdf		CM_llm_genomics.pdf
CM_llm_genomics.pptx		CM_llm_genomics.pptx
README.md		README.md
dna_seq_200b.png		dna_seq_200b.png
illustration.odg		illustration.odg
illustration.png		illustration.png

Repository files navigation

Training on large language models for genomics

Overview

In this repository, we will follow a training for large language models (LLMs) for genomics. The training comprises a short lecture and several lab classes.

Lecture notes

You can download the lecture note here.

Video of the lecture

Lab classes

Data to pretrain the model

The data can be found in the file:

data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz

The file contains 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome. For instance, here is one DNA sequence of 200 bases:

Pretraining of an LLM on DNA sequences

We will pretrain an LLM from scratch (a simplified mistral model, see folder data/models/Mixtral-8x7B-v0.1/) on the 100,000 DNA sequences from the human genome. The LLM is pretrained with causal language modeling using 200b DNA sequences from the human genome hg38 assembly.

Script on Google Colab

Commented script

Video of the tutorial

Finetuning of an LLM for DNA sequence classification

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) and finetune it for DNA sequence classification. The aim is to classify a DNA sequence depending on whether it binds a protein or not (transcription factor), or if a histone mark is present, or if a promoter is active.

Script on Google Colab

Commented script

Video of the tutorial

Zeroshot learning prediction of mutation effect

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) to predict the impact of mutations with zeroshot learning (directly using the pretrained model for DNA sequences). Here, we compute the embedding of the wild type sequence and compare it to the embedding of the mutated sequence, and then compute a L2 distance between the two embeddings. We expect that the higher the distance, the larger the mutation effect.

Script on Google Colab

Commented script

Video of the tutorial

Synthetic DNA sequence generation

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-138M-yeast) to generate artificial yeast DNA sequences.

Script on Google Colab.

Commented script

Video of the tutorial

DNA sequence optimization

We will use a finetuned LLM for promoter or transcription factor binding.

Script on Google Colab.

Video of the tutorial

About

Tutorial on large language models for genomics

Report repository

Releases

No releases published

Packages

No packages published