From 0b2b632037f352874228e3dbbe5f99ccd30f7056 Mon Sep 17 00:00:00 2001 From: Natalia Date: Mon, 9 Mar 2026 18:10:47 -0700 Subject: [PATCH] fix tables --- .../static/news/2026-03-09-reward-hacking.md | 137 ++++++++++++++---- 1 file changed, 110 insertions(+), 27 deletions(-) diff --git a/kernelboard/static/news/2026-03-09-reward-hacking.md b/kernelboard/static/news/2026-03-09-reward-hacking.md index 1dd283c..e331a7d 100644 --- a/kernelboard/static/news/2026-03-09-reward-hacking.md +++ b/kernelboard/static/news/2026-03-09-reward-hacking.md @@ -70,12 +70,32 @@ From the GPU's perspective, **only** one kernel ran. From the eval's perspective **Benchmark Cases (provided by [the eval harness](https://github.com/gpu-mode/reference-kernels/blob/main/problems/nvidia/nvfp4_group_gemm/task.yml))** -| Case | Groups | N | K | Honest Tiles | Grid (honest) | -|------|--------|------|------|-------------|----------------| -| 1 | 8 | 4096 | 7168 | ~148 | 148 (all SMs) | -| 2 | 8 | 7168 | 2048 | ~148 | 148 (all SMs) | -| 3 | 2 | 3072 | 4096 | 120 | 120 (28 idle SMs) | -| 4 | 2 | 4096 | 1536 | 128 | 128 (20 idle SMs) | + + + + + + + + + + + + + + + + + + + + + + + + + +
CaseGroupsNKHonest TilesGrid (honest)
1840967168~148148 (all SMs)
2871682048~148148 (all SMs)
3230724096120120 (28 idle SMs)
4240961536128128 (20 idle SMs)
Interestingly, in [submission.py](https://drive.google.com/file/d/1qBUUJSbv4V7Y-brxWjz5H2Th0cgIOyEo/view?usp=sharing) the agent included a bail-out for cases where using the super-batching would hurt. For large K (when `K > 4096`), the exploit disables itself and falls back to the legitimate path. @@ -91,21 +111,56 @@ The fact that the exploit was conditional shows that the agent clearly used the **Experiment 1: Individual vs Super-batch (All 4 Cases)** -| Case | Individual | Superbatch | Ratio | Reported (÷15) | Fake Speedup | -|------|-----------|-----------|-------|----------------|-------------| -| 1 (K=7168) | 55.07 μs | 56.48 μs | 1.03× | 3.77 μs | 1.0× (skipped) | -| 2 (K=2048) | 39.23 μs | 345.98 μs | 8.82× | 23.07 μs | **1.70×** | -| 3 (K=4096) | 21.34 μs | 126.98 μs | 5.95× | 8.47 μs | **2.52×** | -| 4 (K=1536) | 18.75 μs | 72.51 μs | 3.87× | 4.83 μs | **3.88×** | + + + + + + + + + + + + + + + + + + + + + + + + + +
CaseIndividualSuperbatchRatioReported (÷15)Fake Speedup
1 (K=7168)55.07 μs56.48 μs1.03×3.77 μs1.0× (skipped)
2 (K=2048)39.23 μs345.98 μs8.82×23.07 μs1.70×
3 (K=4096)21.34 μs126.98 μs5.95×8.47 μs2.52×
4 (K=1536)18.75 μs72.51 μs3.87×4.83 μs3.88×
**Key finding:** For case 1, the submission did not take the superbatch path since `K > 4096` so there was no speedup. Cases 2-4 show increasing "speedup" for smaller problems. **Experiment 2: Forcing Super-batch on Case 1 (Removing the K>4096 Skip)** -| Mode | Duration | DRAM Throughput | SM Busy | IPC | -|------|---------|----------------|---------|-----| -| Individual | 55.49 μs | 43.9% | 40.2% | 0.31 | -| Superbatch | 770.43 μs | 82.9% | 43.3% | 0.21 | + + + + + + + + + + + + + + + + + + +
ModeDurationDRAM ThroughputSM BusyIPC
Individual55.49 μs43.9%40.2%0.31
Superbatch770.43 μs82.9%43.3%0.21
**Ratio:** 770/55 = 13.9× for 15× work -> only 1.08× per-tile efficiency gain. @@ -115,17 +170,45 @@ The fact that the exploit was conditional shows that the agent clearly used the The submission's CUTLASS kernel uses persistent scheduling: Grid=(148,1,1), one CTA per SM, each CTA processing multiple tiles sequentially. We profiled nine configurations on B200 using `ncu --set full`, each as a single kernel launch changing only the number of tiles. GPU timing events capture everything on the stream, including gaps between kernel launches. We wanted to quantify the associated overhead on the GPU stream and thus figure out whether there was something more to the superbatch choice than just exploiting the "dividing by 15" timing measurement. -| Tiles | Duration | SM Busy | Instructions | -|-------|---------|---------|-------------| -| 1 | 19.55 μs | 0.15% | 4,601 | -| 2 | 19.58 μs | 0.33% | 9,202 | -| 4 | 19.94 μs | 0.63% | 18,404 | -| 8 | 19.74 μs | 1.30% | 36,808 | -| 16 | 19.74 μs | 2.50% | 73,616 | -| 48 | 20.96 μs | 7.53% | 220,848 | -| 120 | 21.86 μs | 18.66% | 541,564 | -| 148 | 24.26 μs | 22.16% | 680,948 | -| 240 | 31.04 μs | 27.24% | 945,776 | + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TilesDurationSM BusyInstructions
119.55 μs0.15%4,601
219.58 μs0.33%9,202
419.94 μs0.63%18,404
819.74 μs1.30%36,808
1619.74 μs2.50%73,616
4820.96 μs7.53%220,848
12021.86 μs18.66%541,564
14824.26 μs22.16%680,948
24031.04 μs27.24%945,776
**Key finding:** The ~19.5 μs startup cost is **CONSTANT** regardless of tile count. With 1 tile, the kernel takes 19.55 μs at 0.15% SM Busy so the overwhelming amount of time is spent entirely on GPU overhead. Even in the 148-tile and 240-tile cases where CTAs must process more than 1 tile, we get a per-tile work cost of ~0.074 μs (31.04 - 24.26 = 6.78 μs for 92 extra tiles). For 148 tiles, fixed startup overhead alone accounts for ~80% of its total runtime. This includes TMEM allocation, barrier setup, TMA descriptor initialization, tensormap creation, pipeline state machine initialization. All executed before the persistent loop processes any tiles.