Releases: huggingface/datatrove
Releases · huggingface/datatrove
v0.9.0
What's Changed
- Fix CI test hangs in inference pipeline by @JoelNiklaus in #420
- Remove 'file_path' from document metadata when checkpointing by @lewtun in #412
- Speed up Benchmark Submission by @JoelNiklaus in #422
- Standardize Paths by @JoelNiklaus in #423
- Add benchmark mode by @JoelNiklaus in #424
- Truncate context by @JoelNiklaus in #425
- Mention optimized-parquet in readme by @lhoestq in #421
- Fix Dataset name in test by @JoelNiklaus in #427
- Readme and Dependencies by @JoelNiklaus in #428
- Fix vLLM cache corruption on shared filesystems by @JoelNiklaus in #426
- Miscellaneous changes to Inference Benchmarking by @JoelNiklaus in #429
- Speed up Benchmark Submission by @JoelNiklaus in #430
- Benchmark analysis improvements by @JoelNiklaus in #431
- Fix ModuleNotFoundError when unpickling pipeline in SLURM jobs by @JoelNiklaus in #432
- Extend Benchmarking Framework by @JoelNiklaus in #433
- Simplify analysis by @JoelNiklaus in #434
- Add vLLM server metrics to benchmark analysis by @JoelNiklaus in #435
- Improve Benchmark Reliability and Add Features by @JoelNiklaus in #436
- Optimize Benchmarking Infrastructure by @JoelNiklaus in #437
- Fix memory unit mismatch in distributed Ray helpers by @JoelNiklaus in #439
- Cleanup dependencies by @JoelNiklaus in #440
- Benchmark quality of life by @JoelNiklaus in #441
- Track inference time by @JoelNiklaus in #442
- Add token estimation script for large HF datasets by @JoelNiklaus in #444
- Finalize benchmark by @JoelNiklaus in #445
- Add smol_data example for 100B dataset workflows by @JoelNiklaus in #449
- Add skip_bad_requests option to InferenceRunner by @JoelNiklaus in #450
- Make inference server startup policy configurable by @JoelNiklaus in #451
- Handle HF Hub commit-race retries in DiskWriter by @JoelNiklaus in #448
- Stabilize tokenizer and manager teardown for CI test shutdown by @JoelNiklaus in #452
- Apply max_examples globally across parallel tasks by @JoelNiklaus in #453
- Add hyperlinks to model and dataset in dataset card template by @JoelNiklaus in #454
- Add lfs-verify to retryable HF Hub upload errors by @JoelNiklaus in #455
- Glob all parquet files by @lewtun in #456
- Fix SLURM CPU binding error in inference jobs by @JoelNiklaus in #457
- Use public sharding API for streaming datasets by @JoelNiklaus in #458
- Pin huggingface-hub to transformers-compatible range by @JoelNiklaus in #459
- Feat/support dataset configs by @JoelNiklaus in #447
- Refactor inference dataset card update flow by @JoelNiklaus in #460
- Split full and changed-file style targets by @JoelNiklaus in #461
- Add standalone FinePhrase inference example by @JoelNiklaus in #462
- Isolate Xet cache per Slurm task by @JoelNiklaus in #465
- Preserve checkpoint progress for skipped bad requests by @JoelNiklaus in #464
- Harden HF writer retry behavior by @JoelNiklaus in #463
- bump version and adapt authors by @JoelNiklaus in #466
New Contributors
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's Changed
- Finepdfs by @hynky1999 in #402
- Optimized-parquet by @lhoestq in #414
- Add inference dataset card generator by @JoelNiklaus in #415
- Enable Boolean Kwargs by @JoelNiklaus in #417
- Add Progress Monitoring and Example Script by @JoelNiklaus in #416
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #411
- Add Inference Benchmark Tools by @JoelNiklaus in #418
- Fix issues with the CI/CD by @JoelNiklaus in #419
- Add slots=True to dataclasses by @Rolv-Arild in #405
New Contributors
- @lhoestq made their first contribution in #414
- @salmanmkc made their first contribution in #411
- @Rolv-Arild made their first contribution in #405
Full Changelog: v0.7.0...v0.8.0
v0.7.0
What's Changed
- Fix typos by @omahs in #380
- filters: fix C4BadWordsFilter _get_badwords lang shadowing (fixes #377) by @dipampaul17 in #379
- chore: fixed a broken link in the documentation summary_stats by @Olexandr88 in #382
- [BUG Fix] Launching dependent
LocalPipelineExecutors withskip_completed=Falselead to waiting by @silverriver in #300 - Allow the postprocess_fn to take self as a parameter by @JoelNiklaus in #391
- fix: typos by @DeVikingMark in #386
- docs: fixed a broken link in the documentation stats by @Olexandr88 in #383
- Add support to load HF dataset from disk by @iamgroot42 in #385
- ensure folder_path has consistent usage by @hynky1999 in #366
- fixes #388 by @zinccat in #389
- Fix sentinel condition in rust mh3 by @jordane95 in #394
- bugfixes + warnings + callback option for inference runner by @guipenedo in #395
- general bugfixes by @guipenedo in #396
- add additional verification after minhash step1 by @guipenedo in #404
- Inference runner refactoring: rollouts, gen params, etc by @guipenedo in #398
- Fix missing param by @shallyan in #407
- Multi-Node Distributed Inference Support by @hynky1999 in #406
- ray nits by @hynky1999 in #403
- fix parquet last batch + slurm srun option by @guipenedo in #409
New Contributors
- @omahs made their first contribution in #380
- @dipampaul17 made their first contribution in #379
- @Olexandr88 made their first contribution in #382
- @DeVikingMark made their first contribution in #386
- @iamgroot42 made their first contribution in #385
- @zinccat made their first contribution in #389
- @shallyan made their first contribution in #407
Full Changelog: v0.6.0...v0.7.0
v0.6.0
What's Changed
- Fixed type annotations for
stop_charsandexclusion_writerinFineWebQualityFilterby @LeMoussel in #369 - update ray doc by @Tavish9 in #371
- Adds inference with vllm and sglang by @guipenedo in #378
New Contributors
- @LeMoussel made their first contribution in #369
- @Tavish9 made their first contribution in #371
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's Changed
- fix by @kylematoba in #314
- Changed FTFY defaults by @guipenedo in #319
- Adding Megatron Tokenization pipeline by @TJ-Solergibert in #304
- Add
job_id_positionParameter tolaunch_slurm_jobMethod by @StephenRebel in #282 - load_tokenizer can now load local hf folder by @ceferisbarov in #306
- Add glob pattern for hash index by @jordane95 in #313
- fix(utils): Enhance the dependencies check to include pip distribution by @aiqwe in #317
- Update README.md by @saforem2 in #323
- Fix issues with URL Deduplication when using the Index by @muzzynine in #327
- Add customization for fetching SLURM job id by @BramVanroy in #320
- fixes stopwors implementation by @guipenedo in #329
- Allow custom parquet schema by @BramVanroy in #330
- [draft] Add chunking option to DocumentTokenizer by @craffel in #342
- Revert "[draft] Add chunking option to DocumentTokenizer" by @guipenedo in #343
- fix: root condition for SENTINEL by @jordane95 in #349
- correct metadata parsing for finemath by @VivienCabannes in #355
- add oom score + shorter polling by @hynky1999 in #361
- Resolve issue 308 by @habanoz in #309
- [draft] Add chunking option to DocumentTokenizer by @craffel in #344
- Add RayPipelineExecutor by @nelson-liu in #331
- Bump ring from 0.17.8 to 0.17.14 in /src/datatrove/tools/fast_mh3 by @dependabot in #363
- Bump tokio from 1.41.1 to 1.43.1 in /src/datatrove/tools/fast_mh3 by @dependabot in #362
- Fix signatures priority queue initialization in MinhashBuildIndex by @nelson-liu in #334
- Shuffle by chunks support in DocumentTokenizerMerger by @guipenedo in #364
- return positions based on .index if return_positions=True in the data… by @guipenedo in #356
New Contributors
- @kylematoba made their first contribution in #314
- @StephenRebel made their first contribution in #282
- @ceferisbarov made their first contribution in #306
- @saforem2 made their first contribution in #323
- @muzzynine made their first contribution in #327
- @craffel made their first contribution in #342
- @VivienCabannes made their first contribution in #355
- @habanoz made their first contribution in #309
- @nelson-liu made their first contribution in #331
- @dependabot made their first contribution in #363
Full Changelog: v0.4.0...v0.5.0
v0.4.0
What's Changed
- Readme nits by @hynky1999 in #280
- Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
- Fix languages listify bug by @BramVanroy in #294
- [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
- [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
- FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
- upgrades to support numpy 2.0
- added additional word tokenizers and revamped word tokenizer assignment mechanism
- MinHash optimizations + new rust tool to speed up step3
- MinHash cluster sizes feature
- fixed memory leaks from some word tokenizers
- updated url blocklists
- added caching to some word tokenization calls
- glotlid support
- general bugfixes
New Contributors
- @lyuwen made their first contribution in #286
- @BramVanroy made their first contribution in #294
- @silverriver made their first contribution in #296
- @Youggls made their first contribution in #307
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
- Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in #160
- Add a skip parameter to all readers (defaults to zero) by @rantav in #167
- Adds n-gram based decontamination by @guipenedo in #172
- Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in #139
- Adds
tasks_per_jobto slurm executor by @guipenedo in #153 - Unsigned int tokenizer and srun args by @marianna13 in #154
- Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in #169
- remove ListFilter from the process_common_crawl_dump example by @QasidSaleem in #181
- Hf dataset update by @hynky1999 in #170
- Optimize URLFilter and add option to disable integrated wordlists by @its5Q in #174
- Add progres for files by @hynky1999 in #176
- Make colorization configurable for both files and console output by @guipenedo in #185
- Migrate dedup to xxhash by @guipenedo in #179
- [WIP] Multi-Lingual Tokenization by @beme248 in #147
- Add more word tokenizers by @vsabolcec in #187
- Speed up CI with uv by @guipenedo in #188
- Url Index + missing hash_config struct inference by @hynky1999 in #191
- Migrate pipeline blocks to new word tokenizers by @guipenedo in #189
- Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in #192
- Extend randomize_start feature to local executor by @justHungryMan in #193
- Add description for randomize_start by @justHungryMan in #194
- Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in #199
- Issues w/ DatatroveFolderDataset by @TJ-Solergibert in #203
- code consistency about radomize_start_duration by @justHungryMan in #207
- feat(ci): add trufflehog secrets detection by @McPatate in #211
- fix(ci): remove unnecessary permissions by @McPatate in #212
- Add label_only option to LanguageFilter by @justHungryMan in #210
- Fixes text normalization by @hynky1999 in #218
- Summary stats by @hynky1999 in #158
- Speedup json writer by @its5Q in #175
- add alternative fasttext lid models by @guipenedo in #226
- Adds paths_file to readers by @guipenedo in #228
- Add an example for filtering an HF dataset and push to hub by @loubnabnl in #201
- checks if min_num_sentences is disabled or not before computing the n… by @QasidSaleem in #232
- DocumentTokenizerContextShuffler fixes by @sippycoder in #229
- add dependencies lid.py, io.py #239 by @aiqwe in #241
- Add withdirs to extra_options only when not using glob_pattern by @olga1988olga in #244
- Add token and char count to histogram stats by @guipenedo in #251
- fix correct type inference for cached filesystems by @hynky1999 in #257
- Simple enhancement for readibility by @aiqwe in #253
- Fix
test_basic_article_trafilaturatest failure by @tylerjthomas9 in #264 - Update MinhashConfig with detailed settings and add default language … by @justHungryMan in #252
- Update README.md by @shizhediao in #276
- Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in #230
- Update filter_hf_dataset.py by @shizhediao in #274
- Add expand_metadata Option to JsonlWriter by @justHungryMan in #268
- Add shuffle option on huggingface reader by @justHungryMan in #224
New Contributors
- @rantav made their first contribution in #167
- @QasidSaleem made their first contribution in #181
- @its5Q made their first contribution in #174
- @beme248 made their first contribution in #147
- @vsabolcec made their first contribution in #187
- @TJ-Solergibert made their first contribution in #203
- @McPatate made their first contribution in #211
- @loubnabnl made their first contribution in #201
- @sippycoder made their first contribution in #229
- @aiqwe made their first contribution in #241
- @olga1988olga made their first contribution in #244
- @tylerjthomas9 made their first contribution in #264
- @shizhediao made their first contribution in #276
Full Changelog: v0.2.0...v0.3.0
v0.2.0
What's Changed
- Adds multi node parallelism to local executor by @guipenedo in #85
- Changed fsx default filepath for logging output to user's home by @Anacheron51 in #86
- [
Docs] Fix typos by @StandardAI in #91 - bugfix stats file not being saved to s3 by @guipenedo in #92
- Fix url stats by @thomwolf in #89
- Efficiency: np.fromiter instead of np.array by @giorgioangel in #88
- Adds language option for nltk by @guipenedo in #94
- Fix compression type by @jordane95 in #95
- Decoupled reading logic from DedupReader by @guipenedo in #98
- Support for arbitrary fasttext models by @guipenedo in #99
- Adds citation by @guipenedo in #101
- Adds parquet writer by @guipenedo in #103
- Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in #105
- Adding doc strings + adding a faster tokenized doc merger by @thomwolf in #90
- Add email on slurm and extend fasttext filter functionalities by @thomwolf in #111
- Add
jobs_statuscommand. by @lvwerra in #113 - Re-enable
datasetstest by @mariosasko in #114 - Update warc.py by @jordane95 in #115
- Bug fix: when file is empty by @jordane95 in #126
- Load tokenizer using
from_fileby @guipenedo in #122 - Adds
depends=to LocalPipelineExecutor by @guipenedo in #100 - Improve C4 filter and dedup by @guipenedo in #124
- Adds option to shuffle input files in readers by @guipenedo in #128
- update Trafilatura version by @adbar in #130
- Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in #133
- Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in #134
- add requeue and QOS slurm options by @marianna13 in #144
- Fix substring dedup range by @jordane95 in #132
- Line dedup min remove words option by @guipenedo in #146
- New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in #151
- Url deduplication by @hynky1999 in #145
- Fix race conditions during download/extraction by @hynky1999 in #155
- Adds PII removal by @guipenedo in #156
- Pypi Publish Action by @hynky1999 in #159
New Contributors
- @Anacheron51 made their first contribution in #86
- @StandardAI made their first contribution in #91
- @giorgioangel made their first contribution in #88
- @lvwerra made their first contribution in #113
- @adbar made their first contribution in #130
- @justHungryMan made their first contribution in #134
- @marianna13 made their first contribution in #144
- @hynky1999 made their first contribution in #145
Full Changelog: v0.0.1...v0.2.0