Add int4 paged KV support to main paths by lesj0610 · Pull Request #3049 · flashinfer-ai/flashinfer

lesj0610 · 2026-04-13T15:43:23Z

📌 Description

Builds on the int8 paged-KV work in #3048 to add int4 support.

torch.uint8 is already used in some paths as an FP4 container, so a plain uint8 input creates a semantic conflict. An explicit INT4Tensor wrapper is used to keep the contract unambiguous. Storage is packed uint8 with grouped fp16 scales (group_size=32).

The implementation goes through staged dequantization to fp16 before calling existing kernels. On Hopper, auto backend selection falls back to FA2 the same way as in #3048. The following are not included in this PR:

CUDA graph: explicitly blocked, as the staging step requires temporary allocation
Native FA3, XQA, and TRTLLM-gen int4 paths

Until #3048 is merged, GitHub will also show the int8 commits in this diff because this branch is stacked on top of that work.

Tested on Ampere (A100) and Hopper (H100):

python -m pytest tests/attention/test_int4_paged_kv.py -v

51 tests passed on both architectures.

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Depends on #3048.

coderabbitai · 2026-04-13T15:43:31Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3aa2506-2043-4e35-8fd3-86f75858d713

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces support for INT4 and INT8 paged KV caches across the decode and prefill modules. Key changes include the implementation of int4_quantize and int4_dequantize functions, the addition of an INT4Tensor wrapper class, and the extension of paged KV cache management to handle these quantized formats. The PR also adds int8_t vector type support in the CUDA backend and ensures that quantized KV caches correctly fall back to the fa2 implementation. Review feedback suggests using itemsize for more robust 8-bit type detection when applying output scales and adjusting the INT4 quantization scaling factor to better utilize the available bit range.

gemini-code-assist · 2026-04-13T15:46:55Z

+        if out.dtype in (
+            torch.int8,
+            torch.float8_e4m3fn,
+            torch.float8_e5m2,
+        ):
+            out = (out.to(float) * scale_v).to(out.dtype)
+        else:
+            out *= scale_v


This logic for applying scale_v is slightly different from the one in flashinfer/decode.py:598, which uses if out.itemsize == 1:. The itemsize check is more robust as it covers all 8-bit types (like uint8) without needing to list them explicitly. Using itemsize would also make the implementation consistent across the codebase.

if out.itemsize == 1: out = (out.to(float) * scale_v).to(out.dtype) else: out *= scale_v

already updated in the latest branch.

gemini-code-assist · 2026-04-13T15:46:55Z

+    num_groups = hidden_dim // group_size
+    x_grouped = x_fp32.reshape(*x.shape[:-1], num_groups, group_size)
+    amax = x_grouped.abs().amax(dim=-1, keepdim=True)
+    scale = torch.where(amax > 0, amax / 7.0, torch.ones_like(amax))


The scaling factor amax / 7.0 does not seem to fully utilize the int4 quantization range of [-8, 7]. With this scaling, the values of x_grouped / scale will be within [-7.0, 7.0]. After torch.round, the quantized values will be in [-7, 7], meaning the value -8 is never used.

To better utilize the available quantization range, consider using 8.0 as the divisor. This would map the input range [-amax, amax] to [-8, 8], and after clamping to [-8, 7], it would make use of the full available range.

Suggested change

scale = torch.where(amax > 0, amax / 7.0, torch.ones_like(amax))

scale = torch.where(amax > 0, amax / 8.0, torch.ones_like(amax))

not changing this. amax / 7.0 is intentional for symmetric ±amax reconstruction.

yzh119 · 2026-04-13T20:04:23Z

/bot run

flashinfer-bot · 2026-04-13T20:04:53Z

GitLab MR !542 has been created, and the CI pipeline #48432957 is currently running. I'll report back once the pipeline job completes.

lesj0610 · 2026-04-17T07:16:22Z

Keeping this PR as the release-v0.6.7 snapshot and moving active review to #3101 against main.

flashinfer-bot added the op: attention label Apr 13, 2026

lesj0610 marked this pull request as ready for review April 13, 2026 15:43

lesj0610 requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, sricketts, yongwww, yyihuang and yzh119 as code owners April 13, 2026 15:43

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

lesj0610 mentioned this pull request Apr 13, 2026

INT4 per-token-head KV cache + kv_dequant dispatch scaffold vllm-project/vllm#39668

Draft

yzh119 added the run-ci label Apr 13, 2026

lesj0610 mentioned this pull request Apr 14, 2026

[Feature] KV cache per-token-head Int2/Int4 Quantization + Triton_Quant_KV Interface vllm-project/vllm#39074

Open

lesj0610 added 6 commits April 14, 2026 13:18

Add int8 support to paged KV main path

97bff85

Apply formatter fixes for int8 paged KV

668cc19

Fix FA3 FP8 scale_v handling in single prefill

e4ce593

Use explicit dtype checks for low-precision prefill output

45110a6

Add int4 support to paged KV main path

ec95e9a

Fix type and formatting issues for int4 paged KV

502f158

lesj0610 force-pushed the codex/int4-paged-kv-main-path branch from 3ee92f4 to 502f158 Compare April 14, 2026 04:18

lesj0610 marked this pull request as draft April 17, 2026 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add int4 paged KV support to main paths#3049

Add int4 paged KV support to main paths#3049
lesj0610 wants to merge 6 commits intoflashinfer-ai:release-v0.6.7from
lesj0610:codex/int4-paged-kv-main-path

lesj0610 commented Apr 13, 2026

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

lesj0610 Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Uh oh!

lesj0610 Apr 14, 2026

Uh oh!

yzh119 commented Apr 13, 2026

Uh oh!

flashinfer-bot commented Apr 13, 2026

Uh oh!

lesj0610 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	scale = torch.where(amax > 0, amax / 7.0, torch.ones_like(amax))
	scale = torch.where(amax > 0, amax / 8.0, torch.ones_like(amax))

Conversation

lesj0610 commented Apr 13, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lesj0610 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lesj0610 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Apr 13, 2026

Uh oh!

flashinfer-bot commented Apr 13, 2026

Uh oh!

lesj0610 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading