add streaming datasets blogpost#3084
Conversation
Vaibhavs10
left a comment
There was a problem hiding this comment.
Didn't complete the review but please do run a spell check on this, and overall convert the tense from present continuous to present.
We can make this a bit simpler and just add it as part of the original v3 blogpost.
| - user: aractingi | ||
| --- | ||
|
|
||
| **TL;DR** We introduce streaming mode for `LeRobotDataset`, allowing users to iterate over massive robotics datasets without ever having to download them. `StreamingLeRobotDataset` is a new dataset class fully integrated with `lerobot` enabling fast, random sampling and on-the-fly video decoding to deliver high throughput with a small memory footprint. We also add native support for time-window queries via `delta_timestamps`, powered by a custom backtrackable iterator that steps both backward and forward efficiently. All datasets currently released in `LeRobotDataset:v3.0` can be used in streaming mode, by simply using `StreamingLeRobotDataset`. |
There was a problem hiding this comment.
this could be a list instead, large chunk of paragraphs can be distracting (Specially as a TL;DR)
| ## Installing `lerobot` | ||
|
|
||
| [`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms. | ||
| The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub. |
There was a problem hiding this comment.
| The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub. | |
| The library allows you to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub. |
|
|
||
| [`lerobot`](https://github.com/huggingface/lerobot) is the end-to-end robotics library developed at Hugging Face, supporting real-world robotics as well as state of the art robot learning algorithms. | ||
| The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub. | ||
| You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending). |
There was a problem hiding this comment.
| You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending). | |
| You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending). |
| The library allows to record datasets locally directly on real-world robots, and to store datasets on the Hugging Face Hub. | ||
| You can read more about the robots we currently support [here](https://huggingface.co/docs/lerobot/), and browse the thousands of datasets already contributed by the open-source community on the Hugging Face Hub [here 🤗](https://huggingface.co/datasets?modality=modality:timeseries&task_categories=task_categories:robotics&sort=trending). | ||
|
|
||
| We [recently introduced](https://huggingface.co/blog/lerobot-datasets-v3) a new dataset format enabling streaming mode. Both functionalities will ship with `lerobot-v0.4.0`, and you can access it right now building the library from source! You can find the installation instructions for lerobot [here](https://huggingface.co/docs/lerobot/en/installation). |
There was a problem hiding this comment.
It'd be better if you put this all in one blogpost v3: https://huggingface.co/blog/lerobot-datasets-v3
| ## Why Streaming Datasets | ||
|
|
||
| Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data. | ||
| For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions. |
There was a problem hiding this comment.
| For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), containing 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions. | |
| For instance, a popular manipulation dataset like [DROID](https://huggingface.co/datasets/lerobot/droid_1.0.1/tree/main), contains 130K+ episodes amounting to a total of 26M+ frames results in 4TB of space: a disk and memory requirement which is simply unattainable for most institutions. |
| - On-the-fly video decoding using the [`torchcodec`](https://docs.pytorch.org/torchcodec/stable/generated_examples/decoding/file_like.html) library | ||
|
|
||
|
|
||
| These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory. |
There was a problem hiding this comment.
| These two factors allow to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory. | |
| These two factors allow us to step through an iterable, retrieving frames on the fly and exclusively locally via a series of `.next()` calls, without ever loading the dataset into memory. |
| </center> | ||
| </p> | ||
|
|
||
| Indeed, we can measure the correlation coefficient of the streamed `index` and the `iteration_index` to measure the randomness of the streaming procedure, where high levels of randomness correspond to a low (absolute) correlation coefficient and low levels of randomness result in high (either positive or negative) correlation. |
There was a problem hiding this comment.
maybe flesh this out a bit more - this would be incomprehensible to someone who isn't as initiated about robotics
|
|
||
| Low randomness when streaming frames is very problematic in those use cases where datasets are processed for training purposes. | ||
| In such context, items need to typically be shuffled so to mitigate the inherent inter-dependancy between successive frames recorded via demonstrations. | ||
| Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs). |
There was a problem hiding this comment.
| Similarily to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs). | |
| Similar to the `datasets 🤗` library, we solve this issue maintaining a buffer of frames in memory, typically much smaller than the original datasets (1000s of frames versus 100Ms or 1Bs). |
|
|
||
|  | ||
|
|
||
| Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled. |
There was a problem hiding this comment.
| Because the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled. | |
| Since the `.next()` call for the dataset is now stacked on top of a process to fill in an intermediate buffer, an initialization overhead is introduced, to allow the buffer to be filled. |
| ``` | ||
| While we expect our randomness measurements to be robust across deployment scenarios, the samples throughput is likely going to vary depending on the connection speed. | ||
|
|
||
| ## Starting simple: Streaming Single Frames |
There was a problem hiding this comment.
for both the variants of streaming it'd be better to explain them visually as well a bit
|
|
||
| ## Why Streaming Datasets | ||
|
|
||
| Training robot learning algorithms using large-scale robotics datasets can mean having to process terabytes of multi-modal data. |
There was a problem hiding this comment.
We update L2D yesterday with our next release R3 of 100K episodes in dataset v3 format. Rough size estimate is 20M * 6 (6 cameras) frames and 4.8 T. R3 works with StreamingLeRobotDataset. Shall we add a usage here ? @fracapuano
There was a problem hiding this comment.
Actually, this is going to be merged with the other datasets blogpost, we're we are already mentioning you guys so I guess we should be fine :)) Wdyt?
There was a problem hiding this comment.
Ah righto. Yes ofc :))
Congratulations! You've made it this far! Once merged, the article will appear at https://huggingface.co/blog. Official articles
require additional reviews. Alternatively, you can write a community article following the process here.
Preparing the Article
You're not quite done yet, though. Please make sure to follow this process (as documented here):
mdfile. You can also specifyguestororgfor the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.