Skip to content

Distribute Gather Across All Ranks #598

@joellidin

Description

@joellidin

Distribute Gather Across Ranks & Make gather_result Consumers Partial-Aware

Summary

Implement distributed gather across all ranks and update every consumer of gather_result to correctly handle partial results as well as a merged aggregate. Maintain current output schema and downstream semantics.

Objectives

  • Distribute peer fetching and validation evenly across ranks using deterministic partitioning without duplicates.
  • Produce per-rank partial results and a canonical merged aggregate for global metrics and artifact emission.
  • Enable outer update to process partial results incrementally under a bounded memory budget.
  • Ensure all consumer paths operate correctly with either partials or an aggregate.

Scope

  • Gather execution: partitioning, per-rank fetch/validate, partial outputs, merge/reduce, artifact emission on a single rank.
  • Consumers: outer update, index-overlap checks, per-param norms, quality metrics (intended vs actual, success rate, skipped), logging.
  • Dedupe policy for repeated UIDs across partitions or retries.
  • Synchronization and small control flags for readiness and skip decisions.

Deliverables

  1. Partitioning

    • Deterministic mapping from peer list to ranks.
    • Guardrails preventing duplicate downloads when using reserves or retries.
  2. Per-Rank Partial Gather

    • Fetch and validate assigned peers.
    • Emit partial gather_result with uids, skipped_uids, success_rate_part, and per-param payloads.
    • Compute lightweight per-rank index-overlap signatures.
  3. Merge & Global Metrics

    • All-gather of partials and merge on one rank.
    • Compute global uids, skipped_uids, success rate, intended vs actual, and overlap candidates.
    • Emit/upload the canonical aggregate artifact from a single rank.
  4. Partial-Aware Consumers

    • Outer update accepts a sequence of partial results and applies them incrementally in a deterministic order with a configurable memory budget.
    • Index-overlap detection reduces per-rank signatures to global findings.
    • Norms and quality metrics reduce across partials and match single-rank semantics.
  5. Synchronization & Control

    • Barriers at gather completion and pre-update.
    • Minimal readiness/skip flags broadcast to all ranks.

Risks

  • Duplicate work without strict partitioning and reserve handling.
  • Metric drift if reductions do not union sets consistently.
  • Double application if the same UID appears in multiple partials.
  • Memory pressure during concurrent decompress/validate without chunking.
  • Non-deterministic application order causing minor numerical divergence.

Acceptance Criteria

  • Gather+validate wall-clock time decreases with the number of ranks until network limits dominate.

  • Peak memory during outer update remains within the configured budget and does not scale with peer count.

  • For the same peer set, distributed flow produces outputs equivalent to the single-rank baseline for:

    • uids, skipped_uids, success rate, intended vs actual
    • Per-param norms and index-overlap results
    • Applied model update (within expected floating-point tolerance)
  • No duplicate downloads or double applications; reserve backfills behave correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions