Organism | Tokenized dataset | Language model |
---|---|---|
Homo sapiens | Human_DNA_v0_DNABert6tokenized_stride1 | DNADebertaK6b |
Mus musculus | Mouse_DNA_v0_tokenized_kmer6_stride1 | DNADebertaK6_Mouse |
Danio rerio | Zebrafish_DNA_v0_tokenized_kmer6_stride1 | DNADebertaK6_Zebrafish |
Drosophila melanogaster | Worm_DNA_v0_tokenized_kmer6_stride1 | DNADebertaK6_Fruitfly |
C' elegans | Worm_DNA_v0_tokenized_kmer6_stride1 | DNADebertaK6_Worm |
Arabidopsis thaliana | Arabidopsis_thaliana_DNA_v0_tokenized_kmer6_stride1 | DNADebertaK6_Arabidopsis |
All models in active use are based on KMER (k=6) DNA_bert_6 tokenizer with the stride 1.
- How much is the language of DNA universal: DebertaSmall model is trained on the same size of training set for several organisms and the resulting models are compared.
- What is the best architecture: Several MaskedLM architectures trained on human genome and the results are compared.
- Loss on different types of DNA seqs: LM works better on low-complexity sequences
- Optimal K-mer and stride: comparison of K-mer tokemizers on one downstream task (prediction of human promotors), K is from 3 to 9, stride is either 1 or K
- Comparison on genomic benchmarks: this script examined the chosen model over a set of genomic benchmarks and report metrics
- Custom DataCollator: for stride 1, each individual mask token can be predicted from its neighbors, so the token must be masked in blocks
- Search for hyperparameters: experimenting with training parameters
- Other tokenizers: like BPE... (t.b.d.)
- Comparison to DNABert (t.b.d.)
- Training more with parts of the genome that is harder to predict
- Experimenting with DNAPerceiver, RoBERTa (t.b.d.)
- Human_DNA_small: DeBERTa small model trained over Human_DNA_v0 dataset (10 epochs)
- DNA data: Reshaping Human genome (DNA) into HF dataset, there is also a version with stride 1
- Custom tokenizer: finding a way to create KMER tokenizer for K>6
- DNA data configurable: Configurable script for downloading, processing, and uploading of DNA data from fasta files to HuggingFace (HF) datasets
- Architecture pretraining: Script for pretraining various architectures on human DNA
- Human_DNA_Deberta: training (full) Deberta model, too small LR
- Training_with_cDNA: Current training script demonstrated on BERT architecture and cDNA dataset, not very useful
- env_init: Internal script for installation needed on our virtual machines (E-INFRA HUB)
- Human_DNA_v0: DNA splitted into 10kb pieces
- Human_DNA_v0_DNABert6tokenized: DNA tokenized and ready for language model training (tensors of 512 tokens)
- simecek/Human_DNA_v0_DNABert6tokenized_stride1: same as Human_DNA_v0_DNABert6tokenized but stride 1 instead of 6, also the versions simecek/Human_DNA_v0_K7tokenized_stride1 and simecek/Human_DNA_v0_K8tokenized_stride1 for K equals 7 and 8, respectively
- Human_cdna:
Homo_sapiens.GRCh38.cdna.abinitio.fa.gz
reshaped into HF dataset - Other organisms HF datasets of other organisms can be found here (mouse, fruit fly, roundworm, zebra fish, arabidopsis)
- simecek/Human_DNA_v0_Perceiver1tokenized: Human_DNA_v0 tokenized for Perceiver model (1 token = 1 bp)
- DNADebertaSmall2: currently the best model, DebertaSmall, pretrained by on Human_DNA_v0 for 30 epochs
- DNADebertaSmall: DebertaSmall, pretrained by Human_DNA_small on Human_DNA_v0 for 10 epochs
- DNAMobileBert: MobileBERT, pretrained on Human_DNA_v0 for 10 epochs
- Other organisms: naming scheme {Organism}DNADeberta, DebertaSmall, 25_000 steps (~3 epochs of mouse genome)
- Other architectures: naming scheme humandna_{architecture}_1epoch
- cDNABERT_v0: the output of Training_with_cDNA script, not very useful model
- DNA_bert_6: we are currently using this tokenize (the sequence needs to be preprocessed before using it)
- for KMER tokenizer, K equals 7 and 8 (stride 1), we currently use DNA_bert_6 with added token
- Setting up INFRA hub environment: original David's notebook, currently not used