[zephyr] Eliminate coordinator-side scatter manifest consolidation#4853
Open
ravwojdyla-agent wants to merge 1 commit intomainfrom
Open
[zephyr] Eliminate coordinator-side scatter manifest consolidation#4853ravwojdyla-agent wants to merge 1 commit intomainfrom
ravwojdyla-agent wants to merge 1 commit intomainfrom
Conversation
The coordinator no longer reads per-mapper .scatter_meta sidecars or writes a consolidated scatter_metadata manifest. Instead, the list of scatter data-file paths is passed directly to every reducer. Each reducer reads the sidecars itself via a 32-thread pool and filters for its own target shard. This eliminates a scaling bottleneck: with thousands of mappers the coordinator spent significant time and memory reading all sidecars, serializing a multi-MB JSON manifest, and writing it to GCS — all while reducers waited. The coordinator was also a single point of failure during this phase (OOM or preemption killed the manifest write and lost all scatter progress). Now the coordinator's only scatter-stage responsibility is collecting the list of data-file paths (already in memory from task results) and fanning them out. Sidecar I/O is distributed across all reducer workers in parallel. Changes: - execution.py: _regroup_result_refs passes scatter paths directly via MemChunk instead of writing a manifest - shuffle.py: ScatterReader.from_sidecars() replaces from_manifest(); _read_sidecars_parallel() does the 32-thread fan-out; _write_scatter_manifest, _read_scatter_manifest, and _build_scatter_shard_from_manifest removed - plan.py: reduce stage calls ScatterShard.from_sidecars() - test_shuffle.py: tests use direct sidecar paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rjpower
approved these changes
Apr 17, 2026
|
|
||
|
|
||
| # Per-worker caches for sidecar + manifest reads. | ||
| # Per-worker cache for sidecar reads. |
Collaborator
There was a problem hiding this comment.
Why do we need these caches? Doesn't this belong in the ScatterReader?
Collaborator
There was a problem hiding this comment.
If we need a cache let's put it in the worker itself, e.g. maybe a scatter reader can be reused.
Contributor
|
For posterity this change is a decision based on @rjpower benchmark from #4782 (review). I'm re-running normalization of a nemotron split, to confirm this change is not causing regressions at scale. Will update. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.scatter_metasidecars directly via a 32-thread poolMotivation
During nemotron-v1 normalization (1.3B records, 6307 output shards), the coordinator:
With the coordinator on preemptible VMs, this phase was the #1 cause of pipeline restarts — each preemption lost all scatter progress and forced a full redo.
Changes
execution.py:_regroup_result_refspasses scatter paths directly viaMemChunkinstead of writing a manifestshuffle.py:ScatterReader.from_sidecars()replacesfrom_manifest();_read_sidecars_parallel()does the 32-thread fan-out; removed_write_scatter_manifest,_read_scatter_manifest,_build_scatter_shard_from_manifestplan.py: reduce stage callsScatterShard.from_sidecars()test_shuffle.py: tests use direct sidecar pathsTest plan
🤖 Generated with Claude Code