Custom video dataset encoding/serialize uses all memory, process killed. How to fix?

**What I need help with / What I was wondering**

I want to load a dataset containing these
![image](https://github.com/tensorflow/datasets/assets/122366389/0213f11e-c48f-4bb8-bfa1-2433fefd0cb3)
without this happening ([Colab notebook for replicating](https://colab.research.google.com/drive/1wWL6aVDsAazYXjwMiPJta7oTLOTJ_XXk?usp=sharing))
![image](https://github.com/tensorflow/datasets/assets/122366389/95dd8eb5-8c9d-46e3-aaf4-2fa504252568)

...How can I edit my dataset loader to use less memory when encoding videos? 

Background:
I am trying to load a [custom dataset with a Video feature.](https://github.com/sign-language-processing/datasets/tree/master/sign_language_datasets/datasets/dgs_corpus) 
When I try to tfds.load() it, or even just `download_and_prepare`, RAM usage goes up very high and then the process gets killed.
[For example this notebook](https://colab.research.google.com/drive/1_vWFvWo0ZMg5_6AFU6Ln2LPHwm9TW_Rz?usp=sharing) will crash if allowed to run, though with a High-RAM instance it may not.
It seems it is using over 30GB of memory to encode one or two 10 MB videos.
I would like to know how to edit/update this custom dataset so that it will not use so much memory.

**What I've tried so far**
![image](https://github.com/tensorflow/datasets/assets/122366389/7fedf45d-4369-42e1-84d1-e6d9cafbe540)

I did a bunch of debugging and tracing of the problem with memray, etc. [See this notebook](https://colab.research.google.com/drive/1uLbbnfOXqe4eXhVDtKkTvdysegOpxCzq?usp=sharing) and [this issue ](https://github.com/sign-language-processing/datasets/issues/68)for detailed analysis including a copy of the memray report. 

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the 
See [this comment](https://github.com/sign-language-processing/datasets/issues/68#issuecomment-2027576197), which was allocating many GiB of memory to encode even one 10MB video. 

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the 
serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed. 

Relevant items:
* [The data loader in question, dgs_corpus.py](https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py)
* The full memray report: [memray_output_file.tar.gz](https://github.com/user-attachments/files/16087055/memray_output_file.tar.gz)
* Encoding path: The dataset [uses a custom VideoFeature as well, defined here](https://github.com/sign-language-processing/datasets/blob/54495b6230647a7001d1aecef3d68199dc699129/sign_language_datasets/datasets/config.py#L68). The memray showsthat encode_example here ends up allocating 14.5 GiB
    ![image](https://github.com/tensorflow/datasets/assets/122366389/5617cb6a-666c-4fc1-9db5-4a5ad8d2970a)
* Serialization: The memray shows that the other path that uses memory is serialization: split_builder.py [here](https://github.com/tensorflow/datasets/blob/99822262e23b7813b9f59817c7beaaf9d74b7608/tensorflow_datasets/core/split_builder.py#L404) which calls [writer.py's serialization](https://github.com/tensorflow/datasets/blob/99822262e23b7813b9f59817c7beaaf9d74b7608/tensorflow_datasets/core/writer.py#L275) 
    ![image](https://github.com/tensorflow/datasets/assets/122366389/f8768d49-5bcb-4d1e-9218-381f5bc64611)


**It would be nice if...**

* ...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient. 
* ...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
* ...there were some way to set a memory limit, and just have it process more slowly within that limit. 
* ...there were a way to separate the download and prepare processes. A download_only option, [like `--download_only` in the CLI](https://www.tensorflow.org/datasets/cli)
* ...there were a warning that the dataset was using a lot of memory in processing, _before_ the OS kills the process. 
* ...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
* ...it was possible to download only part of a dataset. It's possible to _load_ a slice, but only _after_ download_and_prepare does its whole thing.
* ...more explanation of what serialization and encoding are for, maybe? What are they?

**Environment information**
I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development