Description
For TFDS 4.9.7 on Dataflow 2.60.0, I have a company-internal Dataflow job that fails. Given the input collection:
Elements added 332,090
Estimated size 1.74 TB
to train_write/GroupShards
, where the output collection reports:
Elements added 2
Estimated size 1.8 GB
it then fails on the next element with
"E0123 207 recordwriter.cc:401] Record exceeds maximum record size (1096571470 > 1073741823)."
Workaround
By installing the TFDS prerelease after 3700745 and controlling --num_shards=4096
(auto-detection choose 2048), the DatasetBuilder runs to completion on Dataflow. I'm curious why the auto-detection didn't choose more file shards however, as all training examples should be roughly the same size in this DatasetBuilder.
Suggested fix
Maybe this
is too little headroom for the training examples. The FeatureDict in this particular DatasetBuilder is large, and perhaps the key overhead is unusually large. Should that number be 0.8 instead? Or whether should be larger when the FeatureDict contains many keys?Side remark
Surprisingly Dataflow limits mention
Maximum size for a single element (except where stricter conditions apply, for example Streaming Engine). 2 GB
which doesn't seem to be true in practice since the GroupBy fails on ~1 GB as per the logged error.
Activity