Describe the bug
The cw-ci-test integration test fails with a tensorstore Malformed StorageGeneration error when reading tokenized zarr3 data from R2 (S3-compatible) storage. The failure is in the quickstart-tests/tokenized_2fdf2156 step of the marin-itest pipeline.
This failure is unrelated to the PR under test (#4370, K8s stream_logs OOM fix) — it appears to be a flaky interaction between tensorstore and R2.
To Reproduce
- Push any PR that triggers the
cw-ci-test workflow.
- The
marin-itest pipeline submits a quickstart tokenization job.
- A subsequent step tries to open the tokenized zarr3 array and fails with
Malformed StorageGeneration.
Expected behavior
The tokenized zarr3 array should be readable after the tokenization step completes.
Additional context
Error from CI run:
ValueError: Error opening "zarr3" driver: Error reading
"temp/ci/marin-itest-22712054/quickstart-tests/tokenized-20383a/train/part-00000-of-00001/input_ids/offsets/zarr.json":
Malformed StorageGeneration
The tensorstore spec uses "recheck_cached_data": false, which may cause stale cache entries to collide with freshly written data. The kvstore driver is s3 pointed at the R2 endpoint (74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com, bucket marin-na).
Possible causes:
- R2 returning an unexpected ETag format that tensorstore cannot parse as a StorageGeneration.
- A race between the writer finishing and the reader opening the array when caching is aggressive (
recheck_cached_data: false).
- Transient R2 API inconsistency during the CI window.
Describe the bug
The
cw-ci-testintegration test fails with a tensorstoreMalformed StorageGenerationerror when reading tokenized zarr3 data from R2 (S3-compatible) storage. The failure is in thequickstart-tests/tokenized_2fdf2156step of themarin-itestpipeline.This failure is unrelated to the PR under test (#4370, K8s stream_logs OOM fix) — it appears to be a flaky interaction between tensorstore and R2.
To Reproduce
cw-ci-testworkflow.marin-itestpipeline submits a quickstart tokenization job.Malformed StorageGeneration.Expected behavior
The tokenized zarr3 array should be readable after the tokenization step completes.
Additional context
Error from CI run:
The tensorstore spec uses
"recheck_cached_data": false, which may cause stale cache entries to collide with freshly written data. The kvstore driver iss3pointed at the R2 endpoint (74981a43be0de7712369306c7b19133d.r2.cloudflarestorage.com, bucketmarin-na).Possible causes:
recheck_cached_data: false).