Description
What I need help with / What I was wondering
I want to load a dataset containing these
without this happening (Colab notebook for replicating)
...How can I edit my dataset loader to use less memory when encoding videos?
Background:
I am trying to load a custom dataset with a Video feature.
When I try to tfds.load() it, or even just download_and_prepare
, RAM usage goes up very high and then the process gets killed.
For example this notebook will crash if allowed to run, though with a High-RAM instance it may not.
It seems it is using over 30GB of memory to encode one or two 10 MB videos.
I would like to know how to edit/update this custom dataset so that it will not use so much memory.
I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.
Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()
Finally I traced the problem to serializing and encoding steps under the
See this comment, which was allocating many GiB of memory to encode even one 10MB video.
I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the
serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.
Relevant items:
- The data loader in question, dgs_corpus.py
- The full memray report: memray_output_file.tar.gz
- Encoding path: The dataset uses a custom VideoFeature as well, defined here. The memray showsthat encode_example here ends up allocating 14.5 GiB
- Serialization: The memray shows that the other path that uses memory is serialization: split_builder.py here which calls writer.py's serialization
It would be nice if...
- ...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient.
- ...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
- ...there were some way to set a memory limit, and just have it process more slowly within that limit.
- ...there were a way to separate the download and prepare processes. A download_only option, like
--download_only
in the CLI - ...there were a warning that the dataset was using a lot of memory in processing, before the OS kills the process.
- ...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
- ...it was possible to download only part of a dataset. It's possible to load a slice, but only after download_and_prepare does its whole thing.
- ...more explanation of what serialization and encoding are for, maybe? What are they?
Environment information
I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.
Activity