Decode Roofline

Attributing and optimizing LLM decode at the CUDA-kernel level on an RTX 4070 Laptop GPU.

TL;DR · Key Results · Method · Reproduce · Project Structure

TL;DR

On an RTX 4070 Laptop GPU (Ada AD106, 8 GB, ~250 GB/s measured DRAM ceiling), batch-1 LLM decode is dominated by weight-streaming GEMVs. Nsight shows ~81% of decode GPU-kernel time in cuBLAS GEMV kernels, with the largest MLP projections running at ~95% of the memory roofline. A custom fused INT4 dequant + GEMV kernel moves ~8.7× fewer bytes than a literal two-op dequant baseline and reaches ~223 GB/s in ncu, but the win is honestly regime-specific: it is a batch-1 large-GEMV primitive, not a batched GEMM replacement.

Overview

Decode-phase inference at batch size 1 turns transformer linear layers into matrix-vector products. On a mobile RTX 4070, those GEMVs are not limited by FLOPs; they are limited by how fast weights can be streamed from DRAM.

This project asks a narrow systems question:

Can we attribute decode cost to individual CUDA kernels, prove the dominant kernels are memory-bound, and then reduce bytes moved with a fused dequantization kernel?

The answer is yes, with important caveats. The implementation follows four checkpoints:

CUDA ramp: prove custom ops compile, run, and profile on Windows.
Measurement: build a 4070 Laptop roofline from Nsight data.
Kernel: implement a correctness-gated fused INT4 dequant + GEMV.
Attribution: sweep regimes and show where the speedup holds and disappears.

Related Work

Several production-grade INT4 kernels already exist. This project is not an attempt to beat them — it targets a different goal (kernel-level attribution on consumer silicon) and a different regime (batch-1 weight-streaming GEMV) than any of them optimize for.

Project	What it is	Optimized regime	Relationship to this work
Marlin	FP16×INT4 mixed-precision GEMM kernel	Near-4× up to batch 16–32, datacenter GPUs (serving, speculative decoding)	Built to push weight-only quant past the batch-1 regime — explicitly the regime this kernel cedes to FP16 GEMM by B=2
AWQ	Activation-aware quantization method (+ serving kernels)	Accuracy-preserving 4-bit weights; throughput serving	Orthogonal: a what-to-quantize algorithm, not a batch-1 GEMV attribution study
bitsandbytes	General 8-bit / 4-bit (NF4/FP4) quant library	Broad compatibility, QLoRA fine-tuning, HF integration	Convenience and coverage over decode-latency tuning; a black box for per-kernel analysis

Why a from-scratch kernel, then? The contribution here is the measurement, not the primitive. The goal is to attribute decode cost to individual CUDA kernels, prove the dominant ones are memory-bound against a measured roofline, and show that cutting bytes moved helps in the pure weight-streaming regime — on a mobile RTX 4070, not an A100. A production library would obscure exactly the per-kernel attribution this project is about. And the honest result — the fused kernel wins only at B=1 (2.38×) and loses to FP16 GEMM by B=2 — lands squarely in the regime Marlin and friends are engineered to leave behind. That boundary is the finding.

Key Results


Roofline placement. Batch-1 decode GEMVs sit on the memory ceiling at arithmetic intensity ~1 FLOP/byte.	Regime sweep. The fused kernel wins in batch-1 large-GEMV mode, then loses to GEMM-style reuse as batch grows.

Headline Numbers

Result	Value
Target model / workload	Qwen2.5-1.5B, FP16, batch-1 decode
Hardware	RTX 4070 Laptop GPU, 8 GB, Ada AD106, 36 SMs
Measured DRAM ceiling	~250 GB/s achievable (`ncu` theoretical peak ~259 GB/s)
Decode bottleneck	~81% of GPU-kernel time in weight GEMVs
Dominant baseline GEMV	MLP gate/up/down: ~27.5 MB read, ~112 us, ~247 GB/s
Fused kernel	INT4 group dequant + GEMV, `G=128`, FP32 accumulation
Fused `ncu` result	~223 GB/s, 86.38% of hardware peak, correctness checked before launch
Phase 2 speedup	90.8× vs literal PyTorch two-op dequant -> GEMV baseline
Honest caveat	vs FP16 GEMM, fused MLP wins only at B=1 (2.38×) and loses by B=2

What Actually Happens

Regime	Outcome	Interpretation
Batch 1, large MLP projection	Fused wins; byte reduction matters	Pure weight streaming, memory-bound
Batch 1, smaller attention q/o projection	Fused does not beat FP16 GEMM	Too small to saturate the roofline; launch/per-row overhead matters
Batch > 1	Fused loses to FP16 GEMM	Weight reuse raises arithmetic intensity; GEMM is the right primitive
Literal two-op dequant baseline	Fused remains faster, but speedup shrinks	Baseline dequantizes once; fused rereads packed weights per batch row

Method

1. Measure the Machine, Not the Datasheet

Environment details are recorded in docs/00_environment.md:

RTX 4070 Laptop GPU, 8 GB GDDR6, CC 8.9, 36 SMs
Driver 581.95, CUDA toolkit 13.2, PyTorch 2.6.0+cu124
Nsight Compute 2026.1.1, Nsight Systems 2025.6.3
Measured bandwidth ceiling: ~250 GB/s achievable

The roofline uses measured bandwidth, not the nominal 256 GB/s datasheet figure.

2. Attribute Decode with Nsight

The native batch-1 decode loop in profiling/nsys_decode.py profiles Qwen2.5-1.5B on the same Windows/CUDA environment used by the custom kernel.

Nsight Systems identifies cuBLAS gemvx as the dominant CUDA-kernel class. Nsight Compute then measures per-kernel bandwidth, SM throughput, L2 hit rate, and occupancy.

3. Fuse Dequantization into GEMV

The fused kernel in kernels/dequant_gemv.cu uses a simple symmetric INT4 group format:

scale = max(abs(W_group)) / 7
q     = clamp(round(W / scale), -7, 7)
nib   = q + 8
w_hat = (nib - 8) * scale

Packing is 8 nibbles per int32; scales are FP16 per row/group; accumulation is FP32. The Phase 2 design and traffic arithmetic are in docs/02_kernel_design.md.

4. Correctness Gates Every Number

No timing is reported unless a correctness check passes in the same run:

kernels/tests/test_correctness.py validates multiple Qwen-shaped projections and seeds.
kernels/tests/test_bench.py correctness-checks before writing bench/results/bench.csv.
bench/profile_dequant_gemv.py correctness-checks before the ncu profiled launch.
bench/sweep.py correctness-checks every row before timing.

Reproduce

Quick Setup

# CUDA-enabled torch first, then the rest
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

# should print ENVIRONMENT: GREEN
python scripts/check_env.py

On native Windows, custom CUDA extensions require MSVC as the host compiler. The repo includes scripts/with_msvc.bat, which activates VS2022 Build Tools before running nvcc/ncu workflows.

One-Command Reproduction

bash scripts/reproduce.sh

This regenerates:

bench/results/bandwidth.csv
bench/results/ncu_gemv.csv
bench/results/ncu_kernels.csv
bench/results/roofline.png
bench/results/bench.csv
bench/results/ncu_fused.csv
bench/results/sweep.csv
bench/results/sweep.png

Useful Individual Commands

# build the fused CUDA extension
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python kernels/load.py"

# correctness only
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python -m pytest kernels\tests\test_correctness.py -v"

# correctness-gated benchmark
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python -m pytest kernels\tests\test_bench.py -v -s"

# regime sweep
MSYS_NO_PATHCONV=1 cmd.exe /c "scripts\with_msvc.bat python bench\sweep.py"

If make is installed, the same workflows are exposed as make env, make build, make test, make profile-ncu, make roofline, make bench, make sweep, and make reproduce.

Results Guide

File	What it contains
`docs/00_environment.md`	Exact hardware/software, Nsight counter permissions, Windows build recipe
`docs/01_roofline.md`	Decode-kernel roofline analysis and memory-bound conclusion
`docs/02_kernel_design.md`	INT4 format, traffic math, fused-kernel `ncu` result
`docs/03_results.md`	Batch/shape sweep and honest attribution
`bench/results/`	Checked-in CSVs and plots so the README renders without a GPU

Project Structure

Click to expand

decode-roofline/
├── README.md
├── PROJECT_SPEC.md
├── requirements.txt
├── environment.yml
├── Makefile
│
├── docs/
│   ├── 00_environment.md
│   ├── 01_roofline.md
│   ├── 02_kernel_design.md
│   └── 03_results.md
│
├── profiling/
│   ├── measure_bandwidth.py
│   ├── nsys_decode.py
│   ├── ncu_kernels.py
│   ├── roofline.py
│   └── metrics.md
│
├── harness/
│   ├── load_model.py
│   ├── reference_gemv.py
│   └── decode_harness.py
│
├── kernels/
│   ├── dequant_gemv.cu
│   ├── dequant_gemv_bindings.cpp
│   ├── load.py
│   └── tests/
│       ├── test_correctness.py
│       └── test_bench.py
│
├── bench/
│   ├── profile_dequant_gemv.py
│   ├── sweep.py
│   └── results/
│       ├── roofline.png
│       ├── sweep.png
│       └── *.csv
│
└── scripts/
    ├── check_env.py
    ├── phase0_saxpy.py
    ├── reproduce.sh
    └── with_msvc.bat

Engineering Rules

Correctness before speed. Every benchmark and profiler launch is gated by a reference check.
No speedup without profiler context. Headline claims include shape, regime, achieved bandwidth, and roofline placement.
Median + IQR, not best-case. The target is a mobile GPU; thermal and Windows jitter are part of the measurement.
Measured roofline only. The project uses measured bandwidth from this machine, not desktop 4070 numbers.
Small, verified results beat large, speculative ones. The final claim is intentionally regime-aware.

Citation

If you find this useful, cite the repository:

@misc{more2026decoderoofline,
  title  = {Decode Roofline: Kernel-Level Decode Profiling and Fused Dequant-GEMV on an RTX 4070 Laptop GPU},
  author = {More, Rishi},
  year   = {2026},
  url    = {https://github.com/rishi-more-2003/decode-roofline}
}

Built as a from-scratch CUDA/systems research project on consumer mobile silicon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decode Roofline

TL;DR

Overview

Related Work

Key Results

Headline Numbers

What Actually Happens

Method

1. Measure the Machine, Not the Datasheet

2. Attribute Decode with Nsight

3. Fuse Dequantization into GEMV

4. Correctness Gates Every Number

Reproduce

Quick Setup

One-Command Reproduction

Useful Individual Commands

Results Guide

Project Structure

Engineering Rules

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bench		bench
docs		docs
harness		harness
kernels		kernels
profiling		profiling
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Decode Roofline

TL;DR

Overview

Related Work

Key Results

Headline Numbers

What Actually Happens

Method

1. Measure the Machine, Not the Datasheet

2. Attribute Decode with Nsight

3. Fuse Dequantization into GEMV

4. Correctness Gates Every Number

Reproduce

Quick Setup

One-Command Reproduction

Useful Individual Commands

Results Guide

Project Structure

Engineering Rules

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages