Skip to content

[BE] Unify buffer_size across datapipes #335

Open
@pmeier

Description

The buffer_size parameter is currently fairly inconsistent across datapipes:

name default buffer_size infinite buffer_size warn on infinite
Demultiplexer 1e3 -1 yes
Forker 1e3 -1 yes
Grouper 1e4 N/A N/A
Shuffler 1e4 N/A N/A
MaxTokenBucketizer 1e3 N/A N/A
UnZipper 1e3 -1 yes
IterKeyZipper 1e4 None no

Here are my suggestion on how to unify this:

  • Use the same default buffer_size everywhere. It makes little difference whether we use 1e3 or 1e4 given that it is tightly coupled with the data we know nothing about. Given today's hardware / datasets, I would go with 1e4, but no strong opinion.
  • Give every datapipe with buffer the ability for an infinite buffer. Otherwise users will just be annoyed and use a workaround. For example, torchvision simply uses INFINITE_BUFFER_SIZE = 1_000_000_000, which for all intents and purposes lives up to its name. Which sentinel we use, i.e. -1 or None, again makes little difference. I personally would use None to have a clear separation, but again no strong opinion other than being consistent.
  • Do not warn on infinite buffer sizes. Especially since infinite buffer is not the default behavior, the user is expected to know what they are doing when setting buffer_size=None. I'm all for having a warning like this in the documentation, but I'm strongly against a runtime warning. For example, torchvision datasets need to use an infinite buffer everywhere. Thus, by using the infinite buffer sentinel, users would always get runtime warnings although neither them nor we did anything wrong.

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions