BeamWriter hits "Record exceeds maximum record size" in Dataflow with autosharding

For TFDS 4.9.7 on Dataflow 2.60.0, I have a company-internal Dataflow job that fails. Given the input collection:

>Elements added 332,090
>Estimated size 1.74 TB

to `train_write/GroupShards`, where the output collection reports:

>Elements added 2
>Estimated size 1.8 GB

it then fails on the next element with

> "E0123 207 recordwriter.cc:401] Record exceeds maximum record size (1096571470 > 1073741823)."

## Workaround

By installing the TFDS prerelease after https://github.com/tensorflow/datasets/commit/37007453e963424b5cd5f81f19e0c4698dbed8e3 and controlling `--num_shards=4096` (auto-detection choose 2048), the DatasetBuilder runs to completion on Dataflow. I'm curious why the auto-detection didn't choose more file shards however, as all training examples should be roughly the same size in this DatasetBuilder.

## Suggested fix

Maybe this https://github.com/tensorflow/datasets/blob/9969ce542f4b0e1cbf0a085e8e0df11bccea5c17/tensorflow_datasets/core/utils/shard_utils.py#L79 is too little headroom for the training examples. The FeatureDict in this particular DatasetBuilder is large, and perhaps the key overhead is unusually large. Should that number be 0.8 instead? Or whether https://github.com/tensorflow/datasets/blob/9969ce542f4b0e1cbf0a085e8e0df11bccea5c17/tensorflow_datasets/core/utils/shard_utils.py#L54 should be larger when the FeatureDict contains many keys?

## Side remark

Surprisingly [Dataflow limits](https://cloud.google.com/dataflow/quotas#limits) mention

>Maximum size for a single element (except where stricter conditions apply, for example [Streaming Engine](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-engine)).	2 GB

which doesn't seem to be true in practice since the GroupBy fails on ~1 GB as per the logged error.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BeamWriter hits "Record exceeds maximum record size" in Dataflow with autosharding #10995

Workaround

Suggested fix

Side remark

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development