Skip to content

Releases: huggingface/datatrove

v0.9.0

04 Mar 13:50
87f7bad

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.9.0

v0.8.0

19 Jan 16:03

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.7.0...v0.8.0

v0.7.0

19 Jan 14:19

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.0...v0.7.0

v0.6.0

07 Aug 19:03
a0bda98

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.6.0

v0.5.0

01 May 14:52
99206aa

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.0...v0.5.0

v0.4.0

06 Dec 18:43
842b241

Choose a tag to compare

What's Changed

  • Readme nits by @hynky1999 in #280
  • Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
  • Fix languages listify bug by @BramVanroy in #294
  • [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
  • [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
  • FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
    • upgrades to support numpy 2.0
    • added additional word tokenizers and revamped word tokenizer assignment mechanism
    • MinHash optimizations + new rust tool to speed up step3
    • MinHash cluster sizes feature
    • fixed memory leaks from some word tokenizers
    • updated url blocklists
    • added caching to some word tokenization calls
    • glotlid support
    • general bugfixes

New Contributors

Full Changelog: v0.3.0...v0.4.0

v0.3.0

28 Aug 15:47
d95e0ee

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

22 Apr 17:18
6d06210

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.0.1...v0.2.0

v0.0.1

07 Feb 15:10
bd3c89a

Choose a tag to compare

First release