Fix IterableDataset state_dict shard_example_idx counting #7539

Harry-Yang0518 · 2025-04-27T20:41:18Z

Fix IterableDataset's state_dict shard_example_idx reporting

Description

This PR fixes issue #7475 where the shard_example_idx value in IterableDataset's state_dict() always equals the number of samples in a shard, even if only a few examples have been consumed.

The issue is in the _iter_arrow method of the ArrowExamplesIterable class where it updates the shard_example_idx state by the full length of the batch (len(pa_table)) even when we're only partway through processing the examples.

Changes

Modified the _iter_arrow method of ArrowExamplesIterable to:

Track the actual number of examples processed
Only increment the shard_example_idx by the number of examples actually yielded
Handle partial batches correctly

How to Test

I've included a simple test case that demonstrates the fix:

from datasets import Dataset

# Create a test dataset
ds = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=1)

# Iterate through part of the dataset
for idx, example in enumerate(ds):
    print(example)
    if idx == 2:  # Stop after 3 examples (0, 1, 2)
        state_dict = ds.state_dict()
        print("Checkpoint state_dict:", state_dict)
        break

# Before the fix, the output would show shard_example_idx: 6
# After the fix, it shows shard_example_idx: 3, correctly reflecting the 3 processed examples

Implementation Details

Added logic to track the number of examples actually seen in the current shard
Modified the state update to only count examples actually yielded
Improved handling of partial batches and skipped examples

This fix ensures that checkpointing and resuming works correctly with exactly the expected number of examples, rather than skipping ahead to the end of the batch.

HuggingFaceDocBuilderDev · 2025-04-28T11:32:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

Cool ! I left some comments :)

Feel free to also update your branch to include CI fixes from main

lhoestq · 2025-04-28T13:25:32Z

src/datasets/iterable_dataset.py

@@ -317,16 +317,40 @@ def __iter__(self):

    def _iter_arrow(self):
        shard_idx_start = self._state_dict["shard_idx"] if self._state_dict else 0
-        for gen_kwags in islice(_split_gen_kwargs(self.kwargs, max_num_jobs=self.num_shards), shard_idx_start, None):
+        kwargs_with_shuffled_shards = (
+            _shuffle_gen_kwargs(self.generator, self.kwargs) if hasattr(self, "generator") else self.kwargs


is this expected ? I think ShuffledDataSourcesArrowExamplesIterable has its own _iter_arrow() implementation

lhoestq · 2025-04-28T13:35:30Z

src/datasets/iterable_dataset.py

            shard_example_idx_start = self._state_dict["shard_example_idx"] if self._state_dict else 0
            shard_example_idx = 0
+
+            examples_seen_in_current_shard = 0


how is it different from shard_example_idx ?

lhoestq · 2025-04-28T13:36:42Z

src/datasets/iterable_dataset.py

+                if shard_example_idx < shard_example_idx_start:
+                    offset = shard_example_idx_start - shard_example_idx
+                    pa_table = pa_table.slice(offset)
+                    examples_seen_in_current_shard = offset


is this needed ? we always yield full tables, so it's unlikely we end up with a shard_example_idx that doesn't land exactly on a table boundary (except if the dataset state is manually crafted maybe)

lhoestq · 2025-05-06T14:06:31Z

Hi ! FYI I made a PR to fix #7538 and it also fixed #7475, so if I'm not mistaken this PR is not needed anymore

Fix IterableDataset state_dict shard_example_idx counting

4802548

lhoestq reviewed Apr 28, 2025

View reviewed changes

Harry-Yang0518 closed this May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix IterableDataset state_dict shard_example_idx counting #7539

Fix IterableDataset state_dict shard_example_idx counting #7539

Uh oh!

Harry-Yang0518 commented Apr 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2025

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

lhoestq Apr 28, 2025

Uh oh!

lhoestq commented May 6, 2025

Uh oh!

Uh oh!

Fix IterableDataset state_dict shard_example_idx counting #7539

Fix IterableDataset state_dict shard_example_idx counting #7539

Uh oh!

Conversation

Harry-Yang0518 commented Apr 27, 2025

Fix IterableDataset's state_dict shard_example_idx reporting

Description

Changes

How to Test

Implementation Details

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2025

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq commented May 6, 2025

Uh oh!

Uh oh!