You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- GPU kernel 이 buffer 를 채우면, **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
202
+
- GPU kernel 이 buffer 를 채우면, CPU 의 `ncclProxyProgress`thread (`src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. NCCL 은 proxy 를 두 thread 로 나눠 두는데, setup / connect 같은 control plane 메시지는 `ncclProxyService` 가, 실제 데이터 진행은 `ncclProxyProgress` 가 맡는다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
201
203
-**GPUDirect RDMA 가능** (NIC 와 GPU 가 같은 PCIe switch 또는 그 안의 multiple bridges, default `PATH_PXB`) 하면 intermediate buffer 가 GPU vidmem 에 올라가고 NIC 가 GPU memory 를 직접 read/write. 게이트는 `ncclTopoCheckGdr` (`src/graph/paths.cc`) 가 결정하고, `NCCL_NET_GDR_LEVEL` 환경변수로 override 가능.
202
204
-**불가능**하면 host pinned memory 에 staging: GPU → host copy → NIC RDMA → 반대편 host → GPU copy. PCIe 를 두 번 더 건너는 셈.
203
205
- 양쪽이 buffer readiness 를 합의하는 **rendezvous** 가 데이터 전송 앞에 깔린다.
@@ -237,7 +239,7 @@ ZeRO-3 / FSDP 의 통신 설계가 첫 번째 decomposition (AR = RS + AG) 을
237
239
- kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). channel 1 개 = CUDA block 1 개
238
240
- 입력 버퍼: channel 별 disjoint contiguous region 으로 partition
239
241
- 각 channel 은 자기 ring (또는 tree) 인스턴스를 *독립적으로* 돌림
240
-
- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::scheduleP2pTasksToPlan`)
242
+
- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::addP2pToPlan` 안의 `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` 라인)
241
243
242
244
§5.2 의 `runRing` 도 *한 channel 의* ring 실행이고, 같은 kernel launch 안에서 nChannels 개의 block 이 같은 코드를 다른 데이터 segment 에 대해 동시 실행한다. 이 구조는 §7 Layer 2 의 single-kernel 모델과 모순이 아니라 보강이다. kernel launch 는 1 회, 그 안의 grid 가 nChannels 만큼.
Copy file name to clipboardExpand all lines: _posts/2026-04-21-nccl-collectives.md
+5-3Lines changed: 5 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,8 @@ So parallel computing exposes group-level communication patterns (collectives) a
20
20
21
21
This post is NCCL-centric, but the vocabulary is MPI-compatible. Names like AllReduce, AllGather are identical, and the algorithm-selection logic uses a similar cost model.
22
22
23
+
> Code references and function names follow NCCL master as of 2026-04 (v2.30).
24
+
23
25
## 2. MPI vs NCCL
24
26
25
27
| Aspect | MPI | NCCL |
@@ -67,7 +69,7 @@ AllReduce can be implemented as Reduce + Broadcast or as ReduceScatter + AllGath
67
69
68
70
## 4. NCCL Primitive Catalog
69
71
70
-
NCCL's public API splits into three groups: collectives, P2P, and one-sided RMA.
72
+
NCCL's communication API splits into three paradigms: collectives, two-sided P2P, and one-sided RMA. The official docs nest the last two under 'P2P' as sub-categories; this post puts them at the same level because the rendezvous coupling differs between them.
71
73
72
74
### 4.1 Eight Collectives
73
75
@@ -197,7 +199,7 @@ GPU kernel → GPU vidmem → CPU proxy thread → NIC → wire → NIC → ...
197
199
└→ RDMA write (IB/RoCE) or socket send
198
200
```
199
201
200
-
- Once a GPU kernel fills a buffer, the **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) posts the NIC's RDMA write or socket send. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
202
+
- Once a GPU kernel fills a buffer, the CPU's `ncclProxyProgress` thread (`src/proxy.cc`) posts the NIC's RDMA write or socket send. NCCL splits the proxy into two threads: `ncclProxyService` handles control-plane setup and connect messages, and `ncclProxyProgress` drives the data side. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
201
203
-**GPUDirect RDMA available** (NIC and GPU share a PCIe switch or sit within the same complex of bridges; gated by `ncclTopoCheckGdr` in `src/graph/paths.cc`) means the intermediate buffer lives in GPU vidmem and the NIC reads/writes GPU memory directly. `NCCL_NET_GDR_LEVEL` tunes the threshold.
202
204
-**Unavailable** routes through host pinned memory: GPU → host copy → NIC RDMA → peer host → GPU copy. Two extra PCIe traversals.
203
205
- A **rendezvous** where the two sides agree on buffer readiness precedes every data transfer.
@@ -237,7 +239,7 @@ So far we've been drawing "ring" as a single path, but NCCL actually splits one
237
239
- kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). One channel = one CUDA block.
238
240
- input buffer: partitioned into per-channel disjoint contiguous regions.
239
241
- each channel runs its own ring (or tree) instance *independently*.
240
-
- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::scheduleP2pTasksToPlan`).
242
+
- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::addP2pToPlan`, the `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` line).
241
243
242
244
So the `runRing` we'll meet in §5.2 is the ring run *for one channel*, and within the same kernel launch nChannels blocks run the same code over different data segments simultaneously. This doesn't contradict the single-kernel model of §7 Layer 2; it sharpens it. One kernel launch, with a grid of nChannels inside.
0 commit comments