Use shared ts.Transaction for metadata consolidation#4105
Conversation
|
cc: @dlwh, it's not a large PR but quite a few changes. FYI I'm looking more into the deets and testing this patch on my tokenize nemotron job. |
Replace per-shard asyncio.run + per-shard transactions in consolidate_shard_caches with a single shared ts.Transaction that coalesces all metadata writes. Delete the now-unused per-shard _extend_cache_metadata_with_other. - O(num_shards) read-modify-write cycles → O(num_write_chunks) - Use info["ledger"].total_num_rows instead of redundant async_len() reads - Fix unsliced shapes write: shapes[:source_num_rows] vs bare shapes Closes #4100 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b13c63b to
3da360e
Compare
Empirical validation: 2755-shard tokenization job on R2Ran a full Nemotron CC Stats written to Before/after: metadata consolidation
Measurement caveatWe did not have precise timing instrumentation deployed on the cluster (the
Both consolidation passes (tokenization phase and cache-copy phase) completed within their respective timestamp gaps, consistent with the single-transaction approach finishing in seconds rather than hours. Supporting log evidenceTokenization phase completion:
Cache-copy phase completion:
Previous run (before #4105):
Other observations
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3da360eb3e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| row = merged[info["row_offset"]] | ||
| assert row["input_ids"][0] == i, f"shard {i} data mismatch" | ||
|
|
||
| return merged |
There was a problem hiding this comment.
Keep merged store reads inside TemporaryDirectory scope
_build_and_consolidate returns a live TreeStore handle from inside a with tempfile.TemporaryDirectory(...) block, so the backing directory is deleted before callers use it. In test_consolidate_metadata_shaped, merged[0] is read after return; this can fail or become flaky when TensorStore needs to fetch uncached data from disk. This makes the new test unreliable across environments and cache behavior.
Useful? React with 👍 / 👎.
## Summary - Parallelizes `TreeStore.open` + `data_size` reads across shards using `ThreadPoolExecutor(max_workers=32)` in `consolidate_shard_caches` - These reads are independent and dominated by remote-storage round-trip latency; parallelizing them removes the main bottleneck in the pre-copy phase - Cumulative offset computation remains serial (order-dependent) Stacked on #4105. Refs #4100. Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Replace per-shard `asyncio.run` + per-shard `ts.Transaction` in `consolidate_shard_caches` with a single shared `ts.Transaction` wrapping all metadata writes. - Delete dead `_extend_cache_metadata_with_other` and `_virtual_offset`. - Use `info["ledger"].total_num_rows` (already in memory) instead of redundant `source.async_len()` R2 reads. - Eager numpy reads for source offsets/shapes instead of lazy `ts.virtual_chunked` views (avoids weak-reference lifetime issues across loop iterations). - Fix unsliced shapes write: `source_array.shapes[:source_num_rows]` instead of bare `source_array.shapes`. - Preserve rate-limit retry with exponential backoff on the shared transaction. **Why**: each of N shards committed its own transaction on the same zarr3 write chunks → O(N) read-modify-write cycles (tensorstore#202). With 2755 shards: 4-11 hours on R2. A single transaction coalesces to ~6 chunk writes. ## Test plan - [x] `test_consolidate_metadata_flat` — round-trip with single rank-1 field (no shapes metadata) - [x] `test_consolidate_metadata_shaped` — round-trip with multi-field exemplar including rank-2 leaf (exercises shapes metadata path) - [x] Existing `test_tree_store.py` (17 tests) and `test_new_cache.py` (10 tests) pass - [ ] Run on a real multi-shard tokenization job on R2 to validate wall-time improvement Closes #4100 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Parallelizes `TreeStore.open` + `data_size` reads across shards using `ThreadPoolExecutor(max_workers=32)` in `consolidate_shard_caches` - These reads are independent and dominated by remote-storage round-trip latency; parallelizing them removes the main bottleneck in the pre-copy phase - Cumulative offset computation remains serial (order-dependent) Stacked on #4105. Refs #4100. Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
asyncio.run+ per-shardts.Transactioninconsolidate_shard_cacheswith a single sharedts.Transactionwrapping all metadata writes._extend_cache_metadata_with_otherand_virtual_offset.info["ledger"].total_num_rows(already in memory) instead of redundantsource.async_len()R2 reads.ts.virtual_chunkedviews (avoids weak-reference lifetime issues across loop iterations).source_array.shapes[:source_num_rows]instead of baresource_array.shapes.Why: each of N shards committed its own transaction on the same zarr3 write chunks → O(N) read-modify-write cycles (tensorstore#202). With 2755 shards: 4-11 hours on R2. A single transaction coalesces to ~6 chunk writes.
Test plan
test_consolidate_metadata_flat— round-trip with single rank-1 field (no shapes metadata)test_consolidate_metadata_shaped— round-trip with multi-field exemplar including rank-2 leaf (exercises shapes metadata path)test_tree_store.py(17 tests) andtest_new_cache.py(10 tests) passCloses #4100
🤖 Generated with Claude Code