Skip to content

Commit 8af0800

Browse files
ruvnetruvnet
andcommitted
docs(adr): update ADR-258 with final measured decode speedups
Add decode performance table: CPU: 73.4ms → 62.3ms (-15%) CUDA: 48.9ms → 44.3ms (-9.4%) Update build notes: CUDA 13.0 now supported natively with candle 0.9 + cudarc 0.19. Co-Authored-By: claude-flow <ruv@ruv.net>
1 parent d774f42 commit 8af0800

1 file changed

Lines changed: 11 additions & 2 deletions

File tree

docs/adr/ADR-258-ruvllm-rdt-gpu-optimization.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ CUDA benches require `--features candle,cuda` with CUDA 12.8 (cudarc 0.13.9 does
7171

7272
## Consequences
7373

74-
### Performance (RTX 5080, SM 12.0, CUDA 12.8)
74+
### Prefill performance (RTX 5080, SM 12.0, CUDA 12.8)
7575

7676
| Model | Seq | CPU F32 | CUDA F32 | CUDA BF16 | GPU Speedup |
7777
|-------|-----|---------|----------|-----------|-------------|
@@ -98,9 +98,18 @@ On CPU, the vectorized path adds small overhead for tensor-operator dispatch ver
9898

9999
### Build notes
100100

101-
- Requires CUDA 12.8 for the `cuda` feature on this workstation (cudarc 0.13.9 panics on CUDA 13.0; `/usr/local/cuda-12.8` is available).
101+
- After upgrading to candle 0.9 + cudarc 0.19 (see post-merge section), CUDA 13.0 is supported natively — no `CUDA_HOME` workaround needed.
102102
- All 1582 tests pass under both `candle` and `candle,cuda` feature flags.
103103

104+
### Decode performance (after post-merge optimizations)
105+
106+
| Benchmark | Before | After | Δ |
107+
|-----------|--------|-------|---|
108+
| CPU decode prompt32_gen16 | 73.4 ms | 62.3 ms | **-15%** |
109+
| CUDA/BF16 decode prompt32_gen16 | 48.9 ms | 44.3 ms | **-9.4%** |
110+
111+
Primary sources: KV cache pre-allocation via `scatter_set` (O(N²)→O(N) cat bandwidth + eliminate `cuMemAlloc` per step); on-device argmax (128KB→4B per greedy step); GPU top-k sort for sampling (128KB→320B per sampling step).
112+
104113
---
105114

106115
## Post-merge optimizations (main, 2026-06-18)

0 commit comments

Comments
 (0)