Fact-check fixes + version note + §4 API framing

roycho96 · roycho96 · commit edcb0c8e8558 · 2026-04-28T12:18:13.000+09:00
- §4.5 proxy thread: ncclProxyService → ncclProxyProgress (data plane).
  Add a one-line note that ncclProxyService stays on control plane
  (setup/connect), ncclProxyProgress drives data progress.
- §5.0 P2P channel heuristic: scheduleP2pTasksToPlan → addP2pToPlan,
  citing the actual nChannels = std::min(nChannelsMin, divUp(...)) line.
- §1: add a one-line version baseline (NCCL master, 2026-04, v2.30).
- §4 intro: "공개 API ... 세 부류" → "통신 API ... 세 가지" with
  two-sided P2P / one-sided RMA split, noting the official-docs nesting.
diff --git a/_posts/2026-04-21-nccl-collectives.ko.md b/_posts/2026-04-21-nccl-collectives.ko.md
@@ -20,6 +20,8 @@ Parallel computing 은 그래서 집단 단위 통신 패턴 (collective) 을 
 
 이 글은 NCCL 기준이지만 어휘 자체는 MPI 와 호환된다. AllReduce, AllGather 같은 이름이 똑같고, 알고리즘 선택도 비슷한 cost model 을 쓴다.
 
+> 코드 인용과 함수 이름은 NCCL master (2026-04 시점, v2.30) 기준.
+
 ## 2. MPI 와 NCCL
 
 | 기준 | MPI | NCCL |
@@ -67,7 +69,7 @@ AllReduce 는 Reduce + Broadcast 로 짜도 되고 ReduceScatter + AllGather 로
 
 ## 4. NCCL Primitive 카탈로그
 
-NCCL 의 공개 API 는 collective, P2P, 그리고 1-sided RMA 세 부류로 나뉜다.
+NCCL 의 통신 API 는 세 가지로 나뉜다. Collective, two-sided P2P, one-sided RMA. 공식 docs 는 뒤 둘을 P2P 하위로 묶지만, rendezvous 결합 유무가 다른 별개 모델이라 이 글은 셋을 따로 본다.
 
 ### 4.1 Collective 8 종
 
@@ -197,7 +199,7 @@ GPU kernel ─→ GPU vidmem ─→ (CPU proxy thread) ─→ NIC ─→ wire 
                                   └─→ RDMA write (IB/RoCE) 또는 socket send
 ```
 
-- GPU kernel 이 buffer 를 채우면, **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
+- GPU kernel 이 buffer 를 채우면, CPU 의 `ncclProxyProgress` thread (`src/proxy.cc`) 가 NIC 의 RDMA write 또는 socket send 를 post 한다. NCCL 은 proxy 를 두 thread 로 나눠 두는데, setup / connect 같은 control plane 메시지는 `ncclProxyService` 가, 실제 데이터 진행은 `ncclProxyProgress` 가 맡는다. CPU 가 데이터 자체를 만지지는 않지만 NIC 작업 orchestration 은 host thread 의 일.
 - **GPUDirect RDMA 가능** (NIC 와 GPU 가 같은 PCIe switch 또는 그 안의 multiple bridges, default `PATH_PXB`) 하면 intermediate buffer 가 GPU vidmem 에 올라가고 NIC 가 GPU memory 를 직접 read/write. 게이트는 `ncclTopoCheckGdr` (`src/graph/paths.cc`) 가 결정하고, `NCCL_NET_GDR_LEVEL` 환경변수로 override 가능.
 - **불가능**하면 host pinned memory 에 staging: GPU → host copy → NIC RDMA → 반대편 host → GPU copy. PCIe 를 두 번 더 건너는 셈.
 - 양쪽이 buffer readiness 를 합의하는 **rendezvous** 가 데이터 전송 앞에 깔린다.
@@ -237,7 +239,7 @@ ZeRO-3 / FSDP 의 통신 설계가 첫 번째 decomposition (AR = RS + AG) 을 
 - kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). channel 1 개 = CUDA block 1 개
 - 입력 버퍼: channel 별 disjoint contiguous region 으로 partition
 - 각 channel 은 자기 ring (또는 tree) 인스턴스를 *독립적으로* 돌림
-- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::scheduleP2pTasksToPlan`)
+- channel 별 chunk 가 너무 작아지면 NIC FIFO 가 덜 차서 network throughput 저하. 작은 메시지에서는 NCCL 이 휴리스틱으로 `nChannels` 를 줄임 (`enqueue.cc::addP2pToPlan` 안의 `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` 라인)
 
 §5.2 의 `runRing` 도 *한 channel 의* ring 실행이고, 같은 kernel launch 안에서 nChannels 개의 block 이 같은 코드를 다른 데이터 segment 에 대해 동시 실행한다. 이 구조는 §7 Layer 2 의 single-kernel 모델과 모순이 아니라 보강이다. kernel launch 는 1 회, 그 안의 grid 가 nChannels 만큼.
 
diff --git a/_posts/2026-04-21-nccl-collectives.md b/_posts/2026-04-21-nccl-collectives.md
@@ -20,6 +20,8 @@ So parallel computing exposes group-level communication patterns (collectives) a
 
 This post is NCCL-centric, but the vocabulary is MPI-compatible. Names like AllReduce, AllGather are identical, and the algorithm-selection logic uses a similar cost model.
 
+> Code references and function names follow NCCL master as of 2026-04 (v2.30).
+
 ## 2. MPI vs NCCL
 
 | Aspect | MPI | NCCL |
@@ -67,7 +69,7 @@ AllReduce can be implemented as Reduce + Broadcast or as ReduceScatter + AllGath
 
 ## 4. NCCL Primitive Catalog
 
-NCCL's public API splits into three groups: collectives, P2P, and one-sided RMA.
+NCCL's communication API splits into three paradigms: collectives, two-sided P2P, and one-sided RMA. The official docs nest the last two under 'P2P' as sub-categories; this post puts them at the same level because the rendezvous coupling differs between them.
 
 ### 4.1 Eight Collectives
 
@@ -197,7 +199,7 @@ GPU kernel → GPU vidmem → CPU proxy thread → NIC → wire → NIC → ...
                                   └→ RDMA write (IB/RoCE) or socket send
 ```
 
-- Once a GPU kernel fills a buffer, the **CPU proxy thread** (`ncclProxyService`, `src/proxy.cc`) posts the NIC's RDMA write or socket send. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
+- Once a GPU kernel fills a buffer, the CPU's `ncclProxyProgress` thread (`src/proxy.cc`) posts the NIC's RDMA write or socket send. NCCL splits the proxy into two threads: `ncclProxyService` handles control-plane setup and connect messages, and `ncclProxyProgress` drives the data side. The CPU never touches the data itself, but orchestrating NIC operations is host-thread work.
 - **GPUDirect RDMA available** (NIC and GPU share a PCIe switch or sit within the same complex of bridges; gated by `ncclTopoCheckGdr` in `src/graph/paths.cc`) means the intermediate buffer lives in GPU vidmem and the NIC reads/writes GPU memory directly. `NCCL_NET_GDR_LEVEL` tunes the threshold.
 - **Unavailable** routes through host pinned memory: GPU → host copy → NIC RDMA → peer host → GPU copy. Two extra PCIe traversals.
 - A **rendezvous** where the two sides agree on buffer readiness precedes every data transfer.
@@ -237,7 +239,7 @@ So far we've been drawing "ring" as a single path, but NCCL actually splits one
 - kernel grid: `dim3 grid = {(unsigned)nChannels, 1, 1};` (`src/enqueue.cc`). One channel = one CUDA block.
 - input buffer: partitioned into per-channel disjoint contiguous regions.
 - each channel runs its own ring (or tree) instance *independently*.
-- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::scheduleP2pTasksToPlan`).
+- if per-channel chunks get too small, NIC FIFOs sit underfilled and network throughput tanks. For small messages NCCL heuristically reduces `nChannels` (`enqueue.cc::addP2pToPlan`, the `nChannels[dir] = std::min<int>(nChannelsMin, divUp(bytes[dir], minPartSize));` line).
 
 So the `runRing` we'll meet in §5.2 is the ring run *for one channel*, and within the same kernel launch nChannels blocks run the same code over different data segments simultaneously. This doesn't contradict the single-kernel model of §7 Layer 2; it sharpens it. One kernel launch, with a grid of nChannels inside.