Skip to content

fix(map): fix progress bar exceeding total when load_from_cache_file=False#8170

Open
Nitin-Rajasekar wants to merge 2 commits intohuggingface:mainfrom
Nitin-Rajasekar:fix/map-progress-bar-load-from-cache-file
Open

fix(map): fix progress bar exceeding total when load_from_cache_file=False#8170
Nitin-Rajasekar wants to merge 2 commits intohuggingface:mainfrom
Nitin-Rajasekar:fix/map-progress-bar-load-from-cache-file

Conversation

@Nitin-Rajasekar
Copy link
Copy Markdown

Summary

When load_from_cache_file=False, the progress bar in Dataset.map() displayed a count that exceeded the dataset size (e.g. "800 examples" for a 400-row dataset).

The bug: pbar_initial was computed using len(existing_cache_files), which counts cache files on disk regardless of whether they would be used. With load_from_cache_file=False, those files exist but are ignored, yet the bar was initialized to the full dataset size as if all work was already done. The actual processing then added on top of that, pushing the counter past the total and causing tqdm to drop the percentage display entirely.

Fix: use num_shards - len(unprocessed_kwargs_per_job) instead, which is 0 when load_from_cache_file=False, so the bar starts and counts correctly.

Issue

Fixes #8167

Local verification

py -3.14 -m pytest tests/test_arrow_dataset.py -k "test_map_load_from_cache_file_false_progress_bar_starts_at_zero" -v

2 passed

Risk

Low risk, a one-line change to how pbar_initial is computed. No behaviour changes outside of the progress bar display.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Abnormal progress bar in dataset.map when load_from_cache_file=False

2 participants