Skip to content

Commit edcb0c8

Browse files
committed
Fact-check fixes + version note + §4 API framing
- §4.5 proxy thread: ncclProxyService → ncclProxyProgress (data plane). Add a one-line note that ncclProxyService stays on control plane (setup/connect), ncclProxyProgress drives data progress. - §5.0 P2P channel heuristic: scheduleP2pTasksToPlan → addP2pToPlan, citing the actual nChannels = std::min(nChannelsMin, divUp(...)) line. - §1: add a one-line version baseline (NCCL master, 2026-04, v2.30). - §4 intro: "공개 API ... 세 부류" → "통신 API ... 세 가지" with two-sided P2P / one-sided RMA split, noting the official-docs nesting.
1 parent 67228a5 commit edcb0c8

2 files changed

Lines changed: 10 additions & 6 deletions

File tree

_posts/2026-04-21-nccl-collectives.ko.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ Parallel computing 은 그래서 집단 단위 통신 패턴 (collective) 을
2020

2121
이 글은 NCCL 기준이지만 어휘 자체는 MPI 와 호환된다. AllReduce, AllGather 같은 이름이 똑같고, 알고리즘 선택도 비슷한 cost model 을 쓴다.
2222

23+
> 코드 인용과 함수 이름은 NCCL master (2026-04 시점, v2.30) 기준.
24+
2325
## 2. MPI 와 NCCL
2426

2527
| 기준 | MPI | NCCL |
@@ -67,7 +69,7 @@ AllReduce 는 Reduce + Broadcast 로 짜도 되고 ReduceScatter + AllGather 로
6769

6870
## 4. NCCL Primitive 카탈로그
6971

70-
NCCL 의 공개 API 는 collective, P2P, 그리고 1-sided RMA 세 부류로 나뉜다.
72+
NCCL 의 통신 API 는 세 가지로 나뉜다. Collective, two-sided P2P, one-sided RMA. 공식 docs 는 뒤 둘을 P2P 하위로 묶지만, rendezvous 결합 유무가 다른 별개 모델이라 이 글은 셋을 따로 본다.
7173

7274
### 4.1 Collective 8 종
7375

@@ -197,7 +199,7 @@ GPU kernel ─→ GPU vidmem ─→ (CPU proxy thread) ─→ NIC ─→ wire
197199
└─→ RDMA write (IB/RoCE) 또는 socket send
198200
```
199201

200-
- GPU kernel 이 buffer 를 채우면, **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
202+
- GPU kernel 이 buffer 를 채우면, CPU `ncclProxyProgress` thread (`src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. NCCL 은 proxy 를 두 thread 로 나눠 두는데, setup / connect 같은 control plane 메시지는 `ncclProxyService` 가, 실제 데이터 진행은 `ncclProxyProgress` 가 맡는다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
201203
- **GPUDirect RDMA 가능** (NIC 와 GPU 가 같은 PCIe switch 또는 그 안의 multiple bridges, default `PATH_PXB`) 하면 intermediate buffer 가 GPU vidmem 에 올라가고 NIC 가 GPU memory 를 직접 read/write. 게이트는 `ncclTopoCheckGdr` (`src/graph/paths.cc`) 가 결정하고, `NCCL_NET_GDR_LEVEL` 환경변수로 override 가능.
202204
- **불가능**하면 host pinned memory 에 staging: GPU → host copy → NIC RDMA → 반대편 host → GPU copy. PCIe 를 두 번 더 건너는 셈.
203205
- 양쪽이 buffer readiness 를 합의하는 **rendezvous** 가 데이터 전송 앞에 깔린다.
@@ -237,7 +239,7 @@ ZeRO-3 / FSDP 의 통신 설계가 첫 번째 decomposition (AR = RS + AG) 을
237239
- kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). channel 1 개 = CUDA block 1 개
238240
- 입력 버퍼: channel 별 disjoint contiguous region 으로 partition
239241
- 각 channel 은 자기 ring (또는 tree) 인스턴스를 *독립적으로* 돌림
240-
- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::scheduleP2pTasksToPlan`)
242+
- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::addP2pToPlan` 안의 `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` 라인)
241243

242244
§5.2 의 `runRing`*한 channel 의* ring 실행이고, 같은 kernel launch 안에서 nChannels 개의 block 이 같은 코드를 다른 데이터 segment 에 대해 동시 실행한다. 이 구조는 §7 Layer 2 의 single-kernel 모델과 모순이 아니라 보강이다. kernel launch 는 1 회, 그 안의 grid 가 nChannels 만큼.
243245

_posts/2026-04-21-nccl-collectives.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ So parallel computing exposes group-level communication patterns (collectives) a
2020

2121
This post is NCCL-centric, but the vocabulary is MPI-compatible. Names like AllReduce, AllGather are identical, and the algorithm-selection logic uses a similar cost model.
2222

23+
> Code references and function names follow NCCL master as of 2026-04 (v2.30).
24+
2325
## 2. MPI vs NCCL
2426

2527
| Aspect | MPI | NCCL |
@@ -67,7 +69,7 @@ AllReduce can be implemented as Reduce + Broadcast or as ReduceScatter + AllGath
6769

6870
## 4. NCCL Primitive Catalog
6971

70-
NCCL's public API splits into three groups: collectives, P2P, and one-sided RMA.
72+
NCCL's communication API splits into three paradigms: collectives, two-sided P2P, and one-sided RMA. The official docs nest the last two under 'P2P' as sub-categories; this post puts them at the same level because the rendezvous coupling differs between them.
7173

7274
### 4.1 Eight Collectives
7375

@@ -197,7 +199,7 @@ GPU kernel → GPU vidmem → CPU proxy thread → NIC → wire → NIC → ...
197199
└→ RDMA write (IB/RoCE) or socket send
198200
```
199201

200-
- Once a GPU kernel fills a buffer, the **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) posts the NIC's RDMA write or socket send. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
202+
- Once a GPU kernel fills a buffer, the CPU's `ncclProxyProgress` thread (`src/proxy.cc`) posts the NIC's RDMA write or socket send. NCCL splits the proxy into two threads: `ncclProxyService` handles control-plane setup and connect messages, and `ncclProxyProgress` drives the data side. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
201203
- **GPUDirect RDMA available** (NIC and GPU share a PCIe switch or sit within the same complex of bridges; gated by `ncclTopoCheckGdr` in `src/graph/paths.cc`) means the intermediate buffer lives in GPU vidmem and the NIC reads/writes GPU memory directly. `NCCL_NET_GDR_LEVEL` tunes the threshold.
202204
- **Unavailable** routes through host pinned memory: GPU → host copy → NIC RDMA → peer host → GPU copy. Two extra PCIe traversals.
203205
- A **rendezvous** where the two sides agree on buffer readiness precedes every data transfer.
@@ -237,7 +239,7 @@ So far we've been drawing "ring" as a single path, but NCCL actually splits one
237239
- kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). One channel = one CUDA block.
238240
- input buffer: partitioned into per-channel disjoint contiguous regions.
239241
- each channel runs its own ring (or tree) instance *independently*.
240-
- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::scheduleP2pTasksToPlan`).
242+
- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::addP2pToPlan`, the `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` line).
241243

242244
So the `runRing` we'll meet in §5.2 is the ring run *for one channel*, and within the same kernel launch nChannels blocks run the same code over different data segments simultaneously. This doesn't contradict the single-kernel model of §7 Layer 2; it sharpens it. One kernel launch, with a grid of nChannels inside.
243245

0 commit comments

Comments
 (0)