🤗 Models For Genomic Sequences

Organism	Tokenized dataset	Language model
Homo sapiens	Human_DNA_v0_DNABert6tokenized_stride1	DNADebertaK6b
Mus musculus	Mouse_DNA_v0_tokenized_kmer6_stride1	DNADebertaK6_Mouse
Danio rerio	Zebrafish_DNA_v0_tokenized_kmer6_stride1	DNADebertaK6_Zebrafish
Drosophila melanogaster	Worm_DNA_v0_tokenized_kmer6_stride1	DNADebertaK6_Fruitfly
C' elegans	Worm_DNA_v0_tokenized_kmer6_stride1	DNADebertaK6_Worm
Arabidopsis thaliana	Arabidopsis_thaliana_DNA_v0_tokenized_kmer6_stride1	DNADebertaK6_Arabidopsis

All models in active use are based on KMER (k=6) DNA_bert_6 tokenizer with the stride 1.

Experiments & plans

How much is the language of DNA universal: DebertaSmall model is trained on the same size of training set for several organisms and the resulting models are compared.
What is the best architecture: Several MaskedLM architectures trained on human genome and the results are compared.
Loss on different types of DNA seqs: LM works better on low-complexity sequences
Optimal K-mer and stride: comparison of K-mer tokemizers on one downstream task (prediction of human promotors), K is from 3 to 9, stride is either 1 or K
Comparison on genomic benchmarks: this script examined the chosen model over a set of genomic benchmarks and report metrics
Custom DataCollator: for stride 1, each individual mask token can be predicted from its neighbors, so the token must be masked in blocks
Search for hyperparameters: experimenting with training parameters
Other tokenizers: like BPE... (t.b.d.)
Comparison to DNABert (t.b.d.)
Training more with parts of the genome that is harder to predict
Experimenting with DNAPerceiver, RoBERTa (t.b.d.)

Notebooks

Human_DNA_small: DeBERTa small model trained over Human_DNA_v0 dataset (10 epochs)
DNA data: Reshaping Human genome (DNA) into HF dataset, there is also a version with stride 1
Custom tokenizer: finding a way to create KMER tokenizer for K>6
DNA data configurable: Configurable script for downloading, processing, and uploading of DNA data from fasta files to HuggingFace (HF) datasets
Architecture pretraining: Script for pretraining various architectures on human DNA
Human_DNA_Deberta: training (full) Deberta model, too small LR
Training_with_cDNA: Current training script demonstrated on BERT architecture and cDNA dataset, not very useful
env_init: Internal script for installation needed on our virtual machines (E-INFRA HUB)

Datasets

Human_DNA_v0: DNA splitted into 10kb pieces
Human_DNA_v0_DNABert6tokenized: DNA tokenized and ready for language model training (tensors of 512 tokens)
simecek/Human_DNA_v0_DNABert6tokenized_stride1: same as Human_DNA_v0_DNABert6tokenized but stride 1 instead of 6, also the versions simecek/Human_DNA_v0_K7tokenized_stride1 and simecek/Human_DNA_v0_K8tokenized_stride1 for K equals 7 and 8, respectively
Human_cdna: Homo_sapiens.GRCh38.cdna.abinitio.fa.gz reshaped into HF dataset
Other organisms HF datasets of other organisms can be found here (mouse, fruit fly, roundworm, zebra fish, arabidopsis)
simecek/Human_DNA_v0_Perceiver1tokenized: Human_DNA_v0 tokenized for Perceiver model (1 token = 1 bp)

Models

DNADebertaSmall2: currently the best model, DebertaSmall, pretrained by on Human_DNA_v0 for 30 epochs
DNADebertaSmall: DebertaSmall, pretrained by Human_DNA_small on Human_DNA_v0 for 10 epochs
DNAMobileBert: MobileBERT, pretrained on Human_DNA_v0 for 10 epochs
Other organisms: naming scheme {Organism}DNADeberta, DebertaSmall, 25_000 steps (~3 epochs of mouse genome)
Other architectures: naming scheme humandna_{architecture}_1epoch
cDNABERT_v0: the output of Training_with_cDNA script, not very useful model

Tokenizers

DNA_bert_6: we are currently using this tokenize (the sequence needs to be preprocessed before using it)
for KMER tokenizer, K equals 7 and 8 (stride 1), we currently use DNA_bert_6 with added token

Other(s)

Setting up INFRA hub environment: original David's notebook, currently not used

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
data		data
experiments		experiments
models		models
README.md		README.md
env_init.ipynb		env_init.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤗 Models For Genomic Sequences

Experiments & plans

Notebooks

Datasets

Models

Tokenizers

Other(s)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

ML-Bioinfo-CEITEC/DNA-pretraining

Folders and files

Latest commit

History

Repository files navigation

🤗 Models For Genomic Sequences

Experiments & plans

Notebooks

Datasets

Models

Tokenizers

Other(s)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages