add streaming datasets blogpost by fracapuano · Pull Request #3084 · huggingface/blog

fracapuano · 2025-09-18T11:08:51Z

Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.

Preparing the Article

You're not quite done yet, though. Please make sure to follow this process (as documented here):

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

Vaibhavs10

Didn't complete the review but please do run a spell check on this, and overall convert the tense from present continuous to present.

We can make this a bit simpler and just add it as part of the original v3 blogpost.

Vaibhavs10 · 2025-09-18T14:31:23Z

lerobot-dataset-streaming.md

+- user: aractingi
+---
+
+**TL;DR** We introduce streaming mode for `LeRobotDataset`, allowing users to iterate over massive robotics datasets without ever having to download them. `StreamingLeRobotDataset` is a new dataset class fully integrated with `lerobot` enabling fast, random sampling and on-the-fly video decoding to deliver high throughput with a small memory footprint. We also add native support for time-window queries via `delta_timestamps`, powered by a custom backtrackable iterator that steps both backward and forward efficiently. All datasets currently released in `LeRobotDataset:v3.0` can be used in streaming mode, by simply using `StreamingLeRobotDataset`.


this could be a list instead, large chunk of paragraphs can be distracting (Specially as a TL;DR)

Vaibhavs10 · 2025-09-18T14:32:05Z

lerobot-dataset-streaming.md

+## Installing `lerobot`
+
+[`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms.
+The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.


Suggested change

The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.

The library allows you to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.

Vaibhavs10 · 2025-09-18T14:32:28Z

lerobot-dataset-streaming.md

+
+[`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms.
+The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
+You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).


Suggested change

You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).

You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).

Vaibhavs10 · 2025-09-18T14:33:25Z

lerobot-dataset-streaming.md

+The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
+You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).
+
+We [recently introduced](https://huggingface.co/blog/lerobot-datasets-v3) a new dataset format enabling streaming mode. Both functionalities will ship with `lerobot-v0.4.0`, and you can access it right now building the library from source! You can find the installation instructions for lerobot [here](https://huggingface.co/docs/lerobot/en/installation).


It'd be better if you put this all in one blogpost v3: https://huggingface.co/blog/lerobot-datasets-v3

Vaibhavs10 · 2025-09-18T14:35:24Z

lerobot-dataset-streaming.md

+## Why Streaming Datasets
+
+Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data.
+For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.


Suggested change

For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.

For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), contains 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.

Vaibhavs10 · 2025-09-18T14:43:04Z

lerobot-dataset-streaming.md

+- On-the-fly video decoding using the [`torchcodec`](https://docs.pytorch.org/torchcodec/stable/generated_examples/decoding/file_like.html) library
+
+
+These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.


Suggested change

These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.

These two factors allow us to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.

Vaibhavs10 · 2025-09-18T14:44:08Z

lerobot-dataset-streaming.md

+</center>
+</p>
+
+Indeed, we can measure the correlation coefficient of the streamed `index` and the `iteration_index` to measure the randomness of the streaming procedure, where high levels of randomness correspond to a low (absolute) correlation coefficient and low levels of randomness result in high (either positive or negative) correlation.


maybe flesh this out a bit more - this would be incomprehensible to someone who isn't as initiated about robotics

Vaibhavs10 · 2025-09-18T14:44:24Z

lerobot-dataset-streaming.md

+
+Low randomness when streaming frames is very problematic in those use cases where datasets are processed for training purposes.
+In such context, items need to typically be shuffled so to mitigate the inherent inter-dependancy between successive frames recorded via demonstrations.
+Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).


Suggested change

Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).

Similar to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).

Vaibhavs10 · 2025-09-18T14:44:55Z

lerobot-dataset-streaming.md

+
+![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/random-from-iterable.png)
+
+Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.


Suggested change

Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.

Since the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.

Vaibhavs10 · 2025-09-18T14:45:52Z

lerobot-dataset-streaming.md

+```
+While we expect our randomness measurements to be robust across deployment scenarios, the samples throughput is likely going to vary depending on the connection speed.
+
+## Starting simple: Streaming Single Frames


for both the variants of streaming it'd be better to explain them visually as well a bit

sandhawalia · 2025-09-25T07:49:36Z

lerobot-dataset-streaming.md

+
+## Why Streaming Datasets
+
+Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data.


We update L2D yesterday with our next release R3 of 100K episodes in dataset v3 format. Rough size estimate is 20M * 6 (6 cameras) frames and 4.8 T. R3 works with StreamingLeRobotDataset. Shall we add a usage here ? @fracapuano

https://huggingface.co/datasets/yaak-ai/L2D

Actually, this is going to be merged with the other datasets blogpost, we're we are already mentioning you guys so I guess we should be fine :)) Wdyt?

Ah righto. Yes ofc :))

Francesco Capuano added 5 commits September 18, 2025 13:05

add: streaming blogpost

59c7287

fix: align external and internal titles for v3 blogpost

b5bd13d

add: blog entry

c0dda23

fix: name

198f948

fix: names

2b44966

$@fracapuano$ fracapuano requested a review from lhoestq September 18, 2025 14:09

fix: profiler command

6e407dd

Vaibhavs10 reviewed Sep 18, 2025

View reviewed changes

sandhawalia reviewed Sep 25, 2025

View reviewed changes

	The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.
	The library allows you to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub.

	You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).
	You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending).

	For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.
	For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), contains 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions.

		- On-the-fly video decoding using the [`torchcodec`](https://docs.pytorch.org/torchcodec/stable/generated_examples/decoding/file_like.html) library


		These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.

	These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.
	These two factors allow us to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory.

	Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).
	Similar to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs).


		![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/random-from-iterable.png)

		Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.

	Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.
	Since the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled.


		## Why Streaming Datasets

		Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data.

Conversation

fracapuano commented Sep 18, 2025

Preparing the Article

Getting a Review

Uh oh!

Vaibhavs10 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

$@fracapuano$ fracapuano commented Sep 18, 2025