Skip to content

BeamWriter hits "Record exceeds maximum record size" in Dataflow with autosharding #10995

Open
@carlthome

Description

For TFDS 4.9.7 on Dataflow 2.60.0, I have a company-internal Dataflow job that fails. Given the input collection:

Elements added 332,090
Estimated size 1.74 TB

to train_write/GroupShards, where the output collection reports:

Elements added 2
Estimated size 1.8 GB

it then fails on the next element with

"E0123 207 recordwriter.cc:401] Record exceeds maximum record size (1096571470 > 1073741823)."

Workaround

By installing the TFDS prerelease after 3700745 and controlling --num_shards=4096 (auto-detection choose 2048), the DatasetBuilder runs to completion on Dataflow. I'm curious why the auto-detection didn't choose more file shards however, as all training examples should be roughly the same size in this DatasetBuilder.

Suggested fix

Maybe this

max_shard_size = 0.9 * cls.max_shard_size
is too little headroom for the training examples. The FeatureDict in this particular DatasetBuilder is large, and perhaps the key overhead is unusually large. Should that number be 0.8 instead? Or whether should be larger when the FeatureDict contains many keys?

Side remark

Surprisingly Dataflow limits mention

Maximum size for a single element (except where stricter conditions apply, for example Streaming Engine). 2 GB

which doesn't seem to be true in practice since the GroupBy fails on ~1 GB as per the logged error.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions