-
Notifications
You must be signed in to change notification settings - Fork 49
Open
Description
Distribute Gather Across Ranks & Make gather_result Consumers Partial-Aware
Summary
Implement distributed gather across all ranks and update every consumer of gather_result to correctly handle partial results as well as a merged aggregate. Maintain current output schema and downstream semantics.
Objectives
- Distribute peer fetching and validation evenly across ranks using deterministic partitioning without duplicates.
- Produce per-rank partial results and a canonical merged aggregate for global metrics and artifact emission.
- Enable outer update to process partial results incrementally under a bounded memory budget.
- Ensure all consumer paths operate correctly with either partials or an aggregate.
Scope
- Gather execution: partitioning, per-rank fetch/validate, partial outputs, merge/reduce, artifact emission on a single rank.
- Consumers: outer update, index-overlap checks, per-param norms, quality metrics (intended vs actual, success rate, skipped), logging.
- Dedupe policy for repeated UIDs across partitions or retries.
- Synchronization and small control flags for readiness and skip decisions.
Deliverables
-
Partitioning
- Deterministic mapping from peer list to ranks.
- Guardrails preventing duplicate downloads when using reserves or retries.
-
Per-Rank Partial Gather
- Fetch and validate assigned peers.
- Emit partial
gather_resultwithuids,skipped_uids,success_rate_part, and per-param payloads. - Compute lightweight per-rank index-overlap signatures.
-
Merge & Global Metrics
- All-gather of partials and merge on one rank.
- Compute global
uids,skipped_uids, success rate, intended vs actual, and overlap candidates. - Emit/upload the canonical aggregate artifact from a single rank.
-
Partial-Aware Consumers
- Outer update accepts a sequence of partial results and applies them incrementally in a deterministic order with a configurable memory budget.
- Index-overlap detection reduces per-rank signatures to global findings.
- Norms and quality metrics reduce across partials and match single-rank semantics.
-
Synchronization & Control
- Barriers at gather completion and pre-update.
- Minimal readiness/skip flags broadcast to all ranks.
Risks
- Duplicate work without strict partitioning and reserve handling.
- Metric drift if reductions do not union sets consistently.
- Double application if the same UID appears in multiple partials.
- Memory pressure during concurrent decompress/validate without chunking.
- Non-deterministic application order causing minor numerical divergence.
Acceptance Criteria
-
Gather+validate wall-clock time decreases with the number of ranks until network limits dominate.
-
Peak memory during outer update remains within the configured budget and does not scale with peer count.
-
For the same peer set, distributed flow produces outputs equivalent to the single-rank baseline for:
uids,skipped_uids, success rate, intended vs actual- Per-param norms and index-overlap results
- Applied model update (within expected floating-point tolerance)
-
No duplicate downloads or double applications; reserve backfills behave correctly.
Metadata
Metadata
Assignees
Labels
No labels