Skip to content

Conversation

@minsii
Copy link
Contributor

@minsii minsii commented Feb 1, 2026

Summary:
Move CtranKernelAllGatherArgs from gpe/CtranGpeDev.h to algos/AllGather/Types.h
as ctran::allgather::KernelArgs. Part of KernelElem cleanup Phase 1.

Naming follows the convention:

  • Remove "Ctran" prefix
  • Keep "Kernel" prefix
  • Omit algorithm name since namespace provides context

Differential Revision: D91983718

Summary:

Implements native AVG support for the PAT (Parallel All-to-All Transpose) algorithm in ReduceScatter. 

## Problem
Baseline NCCL doesn't support PAT + AVG op.
- if `NCCL_ALGO=reducescatter:pat` is NOT SET: NCCL would fallback back to Ring algorithm
- if `NCCL_ALGO=reducescatter:pat` is SET: fail with `ncclInvalidUsage - Error : no algorithm/protocol available for function`

## Solution
This diff enables native AVG with PAT algorithm for reduce scatter. It divides integer nRanks at last step when writing final sum result into recvbuf.

**Documentation Added:**
- `meta/collectives/docs/ReduceScatterPat.md` - Comprehensive PAT algorithm
  documentation including 5-phase breakdown and 8-rank visualization
- `meta/collectives/docs/ReduceScatterPatAvg.md` - PAT AVG design details,
  multi-chunk handling, and implementation notes

**Key Implementation:**
- Add `isFinalWrite` flag to `ncclPatStep` struct (set in Phase 4) to correctly
  apply division for all chunks in multi-chunk transfers (fixes large message bug)
- Add FuncPatAvg<T> template that uses FuncSum for reduction and applies
  division as a postOp in final write step
- Add ncclDevPatAvg enum for kernel dispatch
- Update generate.py and def_build.bzl for PatAvg kernel generation
- Enable via NCCL_ALGO=reducescatter:pat_postdiv

**Meta overlay pattern used to minimize upstream changes:**
- meta/device/FuncPatAvg.cuh: Full implementation (~120 lines)
- meta/collectives/PatAvgAlgoHelper.h: Helper functions with lazy env detection
- All src/ changes (~15 lines) are marked with `[META:PAT_AVG]` comments for
  rebasing tracking

Differential Revision: D91948601
Summary:
Add a Claude Code agent definition for reviewing ctran code changes. The agent:
- Reviews diffs for correctness (thread safety, test coverage, code abstraction)
- Performs performance review (benchmark requirements, roofline analysis)
- References CLAUDE.md as the authoritative source for coding standards
- Outputs structured feedback with clear recommendations (APPROVE/NEEDS_CHANGES/NEEDS_HUMAN_REVIEW)

This agent should be invoked after code changes are made to provide automated review feedback before human review.

Differential Revision: D91963243
Summary:
Move CtranKernelSendArgs, CtranKernelRecvArgs, and CtranKernelSendRecvArgs from
gpe/CtranGpeDev.h to algos/SendRecv/Types.h as ctran::sendrecv::KernelSendArgs,
KernelRecvArgs, and KernelSendRecvArgs respectively.

Part of KernelElem cleanup Phase 1.

Naming follows the convention:
- Remove "Ctran" prefix
- Keep "Kernel" prefix
- Keep "Send/Recv/SendRecv" suffix since they're distinct types

Differential Revision: D91983715
Summary:
Move CtranKernelAllGatherArgs from gpe/CtranGpeDev.h to algos/AllGather/Types.h
as ctran::allgather::KernelArgs. Part of KernelElem cleanup Phase 1.

Naming follows the convention:
- Remove "Ctran" prefix
- Keep "Kernel" prefix
- Omit algorithm name since namespace provides context

Differential Revision: D91983718
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant