Attributing and optimizing LLM decode at the CUDA-kernel level on an RTX 4070 Laptop GPU.
TL;DR · Key Results · Method · Reproduce · Project Structure
On an RTX 4070 Laptop GPU (Ada AD106, 8 GB, ~250 GB/s measured DRAM ceiling), batch-1 LLM decode is dominated by weight-streaming GEMVs. Nsight shows ~81% of decode GPU-kernel time in cuBLAS GEMV kernels, with the largest MLP projections running at ~95% of the memory roofline. A custom fused INT4 dequant + GEMV kernel moves ~8.7× fewer bytes than a literal two-op dequant baseline and reaches ~223 GB/s in
ncu, but the win is honestly regime-specific: it is a batch-1 large-GEMV primitive, not a batched GEMM replacement.
Decode-phase inference at batch size 1 turns transformer linear layers into matrix-vector products. On a mobile RTX 4070, those GEMVs are not limited by FLOPs; they are limited by how fast weights can be streamed from DRAM.
This project asks a narrow systems question:
Can we attribute decode cost to individual CUDA kernels, prove the dominant kernels are memory-bound, and then reduce bytes moved with a fused dequantization kernel?
The answer is yes, with important caveats. The implementation follows four checkpoints:
- CUDA ramp: prove custom ops compile, run, and profile on Windows.
- Measurement: build a 4070 Laptop roofline from Nsight data.
- Kernel: implement a correctness-gated fused INT4 dequant + GEMV.
- Attribution: sweep regimes and show where the speedup holds and disappears.
Several production-grade INT4 kernels already exist. This project is not an attempt to beat them — it targets a different goal (kernel-level attribution on consumer silicon) and a different regime (batch-1 weight-streaming GEMV) than any of them optimize for.
| Project | What it is | Optimized regime | Relationship to this work |
|---|---|---|---|
| Marlin | FP16×INT4 mixed-precision GEMM kernel | Near-4× up to batch 16–32, datacenter GPUs (serving, speculative decoding) | Built to push weight-only quant past the batch-1 regime — explicitly the regime this kernel cedes to FP16 GEMM by B=2 |
| AWQ | Activation-aware quantization method (+ serving kernels) | Accuracy-preserving 4-bit weights; throughput serving | Orthogonal: a what-to-quantize algorithm, not a batch-1 GEMV attribution study |
| bitsandbytes | General 8-bit / 4-bit (NF4/FP4) quant library | Broad compatibility, QLoRA fine-tuning, HF integration | Convenience and coverage over decode-latency tuning; a black box for per-kernel analysis |
Why a from-scratch kernel, then? The contribution here is the measurement, not the primitive. The goal is to attribute decode cost to individual CUDA kernels, prove the dominant ones are memory-bound against a measured roofline, and show that cutting bytes moved helps in the pure weight-streaming regime — on a mobile RTX 4070, not an A100. A production library would obscure exactly the per-kernel attribution this project is about. And the honest result — the fused kernel wins only at B=1 (2.38×) and loses to FP16 GEMM by B=2 — lands squarely in the regime Marlin and friends are engineered to leave behind. That boundary is the finding.
| Result | Value |
|---|---|
| Target model / workload | Qwen2.5-1.5B, FP16, batch-1 decode |
| Hardware | RTX 4070 Laptop GPU, 8 GB, Ada AD106, 36 SMs |
| Measured DRAM ceiling | ~250 GB/s achievable (ncu theoretical peak ~259 GB/s) |
| Decode bottleneck | ~81% of GPU-kernel time in weight GEMVs |
| Dominant baseline GEMV | MLP gate/up/down: ~27.5 MB read, ~112 us, ~247 GB/s |
| Fused kernel | INT4 group dequant + GEMV, G=128, FP32 accumulation |
Fused ncu result |
~223 GB/s, 86.38% of hardware peak, correctness checked before launch |
| Phase 2 speedup | 90.8× vs literal PyTorch two-op dequant -> GEMV baseline |
| Honest caveat | vs FP16 GEMM, fused MLP wins only at B=1 (2.38×) and loses by B=2 |
| Regime | Outcome | Interpretation |
|---|---|---|
| Batch 1, large MLP projection | Fused wins; byte reduction matters | Pure weight streaming, memory-bound |
| Batch 1, smaller attention q/o projection | Fused does not beat FP16 GEMM | Too small to saturate the roofline; launch/per-row overhead matters |
| Batch > 1 | Fused loses to FP16 GEMM | Weight reuse raises arithmetic intensity; GEMM is the right primitive |
| Literal two-op dequant baseline | Fused remains faster, but speedup shrinks | Baseline dequantizes once; fused rereads packed weights per batch row |
Environment details are recorded in docs/00_environment.md:
- RTX 4070 Laptop GPU, 8 GB GDDR6, CC 8.9, 36 SMs
- Driver 581.95, CUDA toolkit 13.2, PyTorch 2.6.0+cu124
- Nsight Compute 2026.1.1, Nsight Systems 2025.6.3
- Measured bandwidth ceiling: ~250 GB/s achievable
The roofline uses measured bandwidth, not the nominal 256 GB/s datasheet figure.
The native batch-1 decode loop in profiling/nsys_decode.py profiles Qwen2.5-1.5B on the same Windows/CUDA environment used by the custom kernel.
Nsight Systems identifies cuBLAS gemvx as the dominant CUDA-kernel class. Nsight Compute then measures per-kernel bandwidth, SM throughput, L2 hit rate, and occupancy.
The fused kernel in kernels/dequant_gemv.cu uses a simple symmetric INT4 group format:
scale = max(abs(W_group)) / 7
q = clamp(round(W / scale), -7, 7)
nib = q + 8
w_hat = (nib - 8) * scale
Packing is 8 nibbles per int32; scales are FP16 per row/group; accumulation is FP32. The Phase 2 design and traffic arithmetic are in docs/02_kernel_design.md.
No timing is reported unless a correctness check passes in the same run:
kernels/tests/test_correctness.pyvalidates multiple Qwen-shaped projections and seeds.kernels/tests/test_bench.pycorrectness-checks before writingbench/results/bench.csv.bench/profile_dequant_gemv.pycorrectness-checks before thencuprofiled launch.bench/sweep.pycorrectness-checks every row before timing.
# CUDA-enabled torch first, then the rest
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# should print ENVIRONMENT: GREEN
python scripts/check_env.pyOn native Windows, custom CUDA extensions require MSVC as the host compiler. The repo includes scripts/with_msvc.bat, which activates VS2022 Build Tools before running nvcc/ncu workflows.
bash scripts/reproduce.shThis regenerates:
bench/results/bandwidth.csvbench/results/ncu_gemv.csvbench/results/ncu_kernels.csvbench/results/roofline.pngbench/results/bench.csvbench/results/ncu_fused.csvbench/results/sweep.csvbench/results/sweep.png
# build the fused CUDA extension
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python kernels/load.py"
# correctness only
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python -m pytest kernels\tests\test_correctness.py -v"
# correctness-gated benchmark
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python -m pytest kernels\tests\test_bench.py -v -s"
# regime sweep
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python bench\sweep.py"If make is installed, the same workflows are exposed as make env, make build, make test, make profile-ncu, make roofline, make bench, make sweep, and make reproduce.
| File | What it contains |
|---|---|
docs/00_environment.md |
Exact hardware/software, Nsight counter permissions, Windows build recipe |
docs/01_roofline.md |
Decode-kernel roofline analysis and memory-bound conclusion |
docs/02_kernel_design.md |
INT4 format, traffic math, fused-kernel ncu result |
docs/03_results.md |
Batch/shape sweep and honest attribution |
bench/results/ |
Checked-in CSVs and plots so the README renders without a GPU |
Click to expand
decode-roofline/
├── README.md
├── PROJECT_SPEC.md
├── requirements.txt
├── environment.yml
├── Makefile
│
├── docs/
│ ├── 00_environment.md
│ ├── 01_roofline.md
│ ├── 02_kernel_design.md
│ └── 03_results.md
│
├── profiling/
│ ├── measure_bandwidth.py
│ ├── nsys_decode.py
│ ├── ncu_kernels.py
│ ├── roofline.py
│ └── metrics.md
│
├── harness/
│ ├── load_model.py
│ ├── reference_gemv.py
│ └── decode_harness.py
│
├── kernels/
│ ├── dequant_gemv.cu
│ ├── dequant_gemv_bindings.cpp
│ ├── load.py
│ └── tests/
│ ├── test_correctness.py
│ └── test_bench.py
│
├── bench/
│ ├── profile_dequant_gemv.py
│ ├── sweep.py
│ └── results/
│ ├── roofline.png
│ ├── sweep.png
│ └── *.csv
│
└── scripts/
├── check_env.py
├── phase0_saxpy.py
├── reproduce.sh
└── with_msvc.bat
- Correctness before speed. Every benchmark and profiler launch is gated by a reference check.
- No speedup without profiler context. Headline claims include shape, regime, achieved bandwidth, and roofline placement.
- Median + IQR, not best-case. The target is a mobile GPU; thermal and Windows jitter are part of the measurement.
- Measured roofline only. The project uses measured bandwidth from this machine, not desktop 4070 numbers.
- Small, verified results beat large, speculative ones. The final claim is intentionally regime-aware.
If you find this useful, cite the repository:
@misc{more2026decoderoofline,
title = {Decode Roofline: Kernel-Level Decode Profiling and Fused Dequant-GEMV on an RTX 4070 Laptop GPU},
author = {More, Rishi},
year = {2026},
url = {https://github.com/rishi-more-2003/decode-roofline}
}Built as a from-scratch CUDA/systems research project on consumer mobile silicon.

