Skip to content

Add int8 paged KV support to main paths#3048

Draft
lesj0610 wants to merge 4 commits intoflashinfer-ai:release-v0.6.7from
lesj0610:codex/int8-paged-kv-main-path
Draft

Add int8 paged KV support to main paths#3048
lesj0610 wants to merge 4 commits intoflashinfer-ai:release-v0.6.7from
lesj0610:codex/int8-paged-kv-main-path

Conversation

@lesj0610
Copy link
Copy Markdown

📌 Description

The main paged-KV path had no int8 support. This PR extends the following to accept int8 KV cache:

  • append
  • single decode, single prefill
  • batch decode, batch prefill

On Hopper, auto backend selection routes to FA2 when FA3 int8 KV is unavailable, so no combination falls through to an unsupported path.

Tested on Ampere (A100) and Hopper (H100):

python -m pytest tests/attention/test_int8_paged_kv.py -v

7 tests passed on both architectures. The int4 part is in a separate follow-up PR.

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 731b3528-e57c-4240-8ccf-93ae7164e8a4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for int8 quantized KV cache across various attention kernels. Key changes include the addition of int8_t vector type specializations in the CUDA backend, updated dispatch macros for FFI, and logic in the Python layer to handle scaling factors and backend selection (ensuring int8 KV falls back to the fa2 backend). New tests have been added to verify the correctness of int8 paged KV operations and scaling factor application. One review comment suggests improving the reliability of data type detection in the prefill logic by checking the tensor dtype directly instead of relying on itemsize.

Comment thread flashinfer/prefill.py Outdated
Comment on lines +1344 to +1347
if out.itemsize == 1:
out = (out.to(float) * scale_v).to(out.dtype)
else:
out *= scale_v
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition if out.itemsize == 1 is used to check for FP8/INT8 output types, but itemsize is not a reliable way to distinguish between data types in all cases. It is better to check out.dtype directly against torch.float8_e4m3fn or torch.int8 to avoid potential ambiguity.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. fixed!

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Apr 13, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !543 has been created, and the CI pipeline #48432967 is currently running. I'll report back once the pipeline job completes.

@lesj0610 lesj0610 force-pushed the codex/int8-paged-kv-main-path branch from 1aba362 to 45110a6 Compare April 14, 2026 04:18
@lesj0610 lesj0610 marked this pull request as draft April 17, 2026 07:16
@lesj0610
Copy link
Copy Markdown
Author

Keeping this PR as the release-v0.6.7 snapshot and moving active review to #3100 against main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants