Skip to content

ML-Bioinfo-CEITEC/DNA-pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤗 Models For Genomic Sequences

Organism Tokenized dataset Language model
Homo sapiens Human_DNA_v0_DNABert6tokenized_stride1 DNADebertaK6b
Mus musculus Mouse_DNA_v0_tokenized_kmer6_stride1 DNADebertaK6_Mouse
Danio rerio Zebrafish_DNA_v0_tokenized_kmer6_stride1 DNADebertaK6_Zebrafish
Drosophila melanogaster Worm_DNA_v0_tokenized_kmer6_stride1 DNADebertaK6_Fruitfly
C' elegans Worm_DNA_v0_tokenized_kmer6_stride1 DNADebertaK6_Worm
Arabidopsis thaliana Arabidopsis_thaliana_DNA_v0_tokenized_kmer6_stride1 DNADebertaK6_Arabidopsis

All models in active use are based on KMER (k=6) DNA_bert_6 tokenizer with the stride 1.

Experiments & plans

Notebooks

Datasets

Models

Tokenizers

  • DNA_bert_6: we are currently using this tokenize (the sequence needs to be preprocessed before using it)
  • for KMER tokenizer, K equals 7 and 8 (stride 1), we currently use DNA_bert_6 with added token

Other(s)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •