Fix 'standardize_data_formats' when using iterable datasets #126

marcandrelarochelle · 2025-04-30T15:22:44Z

Adds support to Iterable datasets for the 'standardize_data_formats'

num_proc is not supported by IterableDataset and cannot be accessed via index

PS: Sorry had to recreated my fork, did a mistake on my side

danielhanchen · 2025-05-25T10:25:30Z

Much apologies on the delay @marcandrelarochelle ! Thanks for the PR!

danielhanchen

@Erland366 Could you take a final review and confirm if iterable datasets work fine? Appreciate it

danielhanchen · 2025-05-25T10:26:03Z

unsloth_zoo/dataset_utils.py

@@ -405,10 +405,10 @@ def standardize_data_formats(
    if "conversations" not in column_names:
        return dataset

-    convos = dataset[:10]["conversations"]
+    examples = itertools.islice(dataset, 10)


Is this specifically for iterable datasets? Ie itertools.islice works for iterable datasets whilst dataset[:10] does not?

Exactly, you can't use an index or in this case [:10] on iterators

danielhanchen · 2025-05-25T10:26:56Z

unsloth_zoo/dataset_utils.py

-    for convo in convos:
-        for message in convo:
+    for example in examples:
+        for message in example["conversations"]:


The main reason why I did "conversations" outside is to make it somewhat faster, but I guess since its 10 examples, no big deal

danielhanchen · 2025-05-25T10:27:39Z

unsloth_zoo/dataset_utils.py

+        return dataset.map(
+            _standardize_dataset,
+            batched = True,
+            batch_size = dataset._ex_iterable.batch_size,


Interesting on ._ex_iterable - I actually am not super interested with this

The reason I did this is to pass the batch_size, if you are using a iterable dataset, you might have a resource restrained system and by default it would use a batch size of 1000, but this way it will keep the same batch size it had before the _standardize_datasetformatting.

Fix standardize_data_formats when using iterable datasets

61494c3

danielhanchen reviewed May 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix 'standardize_data_formats' when using iterable datasets #126

Fix 'standardize_data_formats' when using iterable datasets #126

Uh oh!

marcandrelarochelle commented Apr 30, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

danielhanchen left a comment

Uh oh!

danielhanchen May 25, 2025

Uh oh!

marcandrelarochelle May 26, 2025

Uh oh!

danielhanchen May 25, 2025

Uh oh!

danielhanchen May 25, 2025

Uh oh!

marcandrelarochelle May 26, 2025

Uh oh!

Uh oh!

Fix 'standardize_data_formats' when using iterable datasets #126

Are you sure you want to change the base?

Fix 'standardize_data_formats' when using iterable datasets #126

Uh oh!

Conversation

marcandrelarochelle commented Apr 30, 2025

Uh oh!

danielhanchen commented May 25, 2025

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen May 25, 2025

Choose a reason for hiding this comment

Uh oh!

marcandrelarochelle May 26, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen May 25, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen May 25, 2025

Choose a reason for hiding this comment

Uh oh!

marcandrelarochelle May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!