Skip to content

Conversation

@minsii
Copy link
Contributor

@minsii minsii commented Jan 31, 2026

Summary:
Implements native AVG support for the PAT (Parallel All-to-All Transpose) algorithm in ReduceScatter.

Problem

Baseline NCCL doesn't support PAT + AVG op.

  • if NCCL_ALGO=reducescatter:pat is NOT SET: NCCL would fallback back to Ring algorithm
  • if NCCL_ALGO=reducescatter:pat is SET: fail with ncclInvalidUsage - Error : no algorithm/protocol available for function

Solution

This diff enables native AVG with PAT algorithm for reduce scatter. It divides integer nRanks at last step when writing final sum result into recvbuf.

Documentation Added:

  • meta/collectives/docs/ReduceScatterPat.md - Comprehensive PAT algorithm
    documentation including 5-phase breakdown and 8-rank visualization
  • meta/collectives/docs/ReduceScatterPatAvg.md - PAT AVG design details,
    multi-chunk handling, and implementation notes

Key Implementation:

  • Add isFinalWrite flag to ncclPatStep struct (set in Phase 4) to correctly
    apply division for all chunks in multi-chunk transfers (fixes large message bug)
  • Add FuncPatAvg template that uses FuncSum for reduction and applies
    division as a postOp in final write step
  • Add ncclDevPatAvg enum for kernel dispatch
  • Update generate.py and def_build.bzl for PatAvg kernel generation
  • Enable via NCCL_ALGO=reducescatter:pat_postdiv

Meta overlay pattern used to minimize upstream changes:

  • meta/device/FuncPatAvg.cuh: Full implementation (~120 lines)
  • meta/collectives/PatAvgAlgoHelper.h: Helper functions with lazy env detection
  • All src/ changes (~15 lines) are marked with [META:PAT_AVG] comments for
    rebasing tracking

Differential Revision: D91948601

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 31, 2026

@minsii has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91948601.

Summary:

Implements native AVG support for the PAT (Parallel All-to-All Transpose) algorithm in ReduceScatter. 

## Problem
Baseline NCCL doesn't support PAT + AVG op.
- if `NCCL_ALGO=reducescatter:pat` is NOT SET: NCCL would fallback back to Ring algorithm
- if `NCCL_ALGO=reducescatter:pat` is SET: fail with `ncclInvalidUsage - Error : no algorithm/protocol available for function`

## Solution
This diff enables native AVG with PAT algorithm for reduce scatter. It divides integer nRanks at last step when writing final sum result into recvbuf.

**Documentation Added:**
- `meta/collectives/docs/ReduceScatterPat.md` - Comprehensive PAT algorithm
  documentation including 5-phase breakdown and 8-rank visualization
- `meta/collectives/docs/ReduceScatterPatAvg.md` - PAT AVG design details,
  multi-chunk handling, and implementation notes

**Key Implementation:**
- Add `isFinalWrite` flag to `ncclPatStep` struct (set in Phase 4) to correctly
  apply division for all chunks in multi-chunk transfers (fixes large message bug)
- Add FuncPatAvg<T> template that uses FuncSum for reduction and applies
  division as a postOp in final write step
- Add ncclDevPatAvg enum for kernel dispatch
- Update generate.py and def_build.bzl for PatAvg kernel generation
- Enable via NCCL_ALGO=reducescatter:pat_postdiv

**Meta overlay pattern used to minimize upstream changes:**
- meta/device/FuncPatAvg.cuh: Full implementation (~120 lines)
- meta/collectives/PatAvgAlgoHelper.h: Helper functions with lazy env detection
- All src/ changes (~15 lines) are marked with `[META:PAT_AVG]` comments for
  rebasing tracking

Differential Revision: D91948601
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant