Skip to content

Custom video dataset encoding/serialize uses all memory, process killed. How to fix? #5499

Open
@cleong110

Description

What I need help with / What I was wondering

I want to load a dataset containing these
image
without this happening (Colab notebook for replicating)
image

...How can I edit my dataset loader to use less memory when encoding videos?

Background:
I am trying to load a custom dataset with a Video feature.
When I try to tfds.load() it, or even just download_and_prepare, RAM usage goes up very high and then the process gets killed.
For example this notebook will crash if allowed to run, though with a High-RAM instance it may not.
It seems it is using over 30GB of memory to encode one or two 10 MB videos.
I would like to know how to edit/update this custom dataset so that it will not use so much memory.

What I've tried so far
image

I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the
See this comment, which was allocating many GiB of memory to encode even one 10MB video.

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the
serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.

Relevant items:

It would be nice if...

  • ...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient.
  • ...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
  • ...there were some way to set a memory limit, and just have it process more slowly within that limit.
  • ...there were a way to separate the download and prepare processes. A download_only option, like --download_only in the CLI
  • ...there were a warning that the dataset was using a lot of memory in processing, before the OS kills the process.
  • ...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
  • ...it was possible to download only part of a dataset. It's possible to load a slice, but only after download_and_prepare does its whole thing.
  • ...more explanation of what serialization and encoding are for, maybe? What are they?

Environment information
I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions