Skip to content

drostlab/deepclust-data

 
 

Repository files navigation

Scripts to reproduce the results of Buchfink et al., "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust".

Data files

Data files should be placed in the root of the repository.

  • arch80_all.tsv: File mapping NCBI accessions to Pfam domain architectures.
  • nr_accessions.tsv.zst: List of all accessions in NR database that was used for the main benchmark.
  • reps_1M_ge5.tsv.zst: List of accessions of randomly samples representatives of the big ~19bn clustering run of clusters of size >= 5.
  • clan2acc.tsv: File mapping Pfam accessions to clans.
  • reps_10k.ge2.faa: Sequences of 10k randomly samples representatives of the big ~19bn clustering run of clusters of size >= 2.
  • reps_10k.ge3.faa: Sequences of 10k randomly samples representatives of the big ~19bn clustering run of clusters of size >= 3.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 39.9%
  • R 34.1%
  • C++ 16.0%
  • Python 9.8%
  • CMake 0.2%