Skip to content

Conversation

@vincentzed
Copy link
Contributor

Motivation

flashinfer-ai/flashinfer#1979 enables auto for mm_fp4, so we should use it for cu13 (except in the case that user specify flashinfer mm fp4 backend).
If cudnn>=9.15, then we should use cudnn. It is faster

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Signed-off-by: vincentzed <[email protected]>
@github-actions github-actions bot added the quant LLM Quantization label Dec 31, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing FP4 GEMM (General Matrix Multiply) operations by integrating the latest features from FlashInfer 0.6.0. The core change introduces an intelligent, automatic backend selection for FP4 GEMM, specifically targeting CUDA 13 environments. This enhancement aims to improve performance and efficiency by leveraging the most suitable underlying library (cuDNN or Cutlass) for matrix multiplications, without requiring manual configuration, thereby streamlining the execution of quantized models.

Highlights

  • FlashInfer Dependency Update: The pull request updates the FlashInfer dependency to version 0.6.0, which introduces enhanced capabilities for automatic backend selection in FP4 matrix multiplication operations.
  • Automatic FP4 GEMM Backend Selection for CUDA 13: A new mechanism has been implemented to automatically resolve the mm_fp4 GEMM backend. For CUDA 13 environments, if the FLASHINFER_FP4_GEMM_BACKEND environment variable is not explicitly set, the backend will default to 'auto', allowing FlashInfer to dynamically choose between cuDNN (if version >= 9.15) and Cutlass for optimal performance. For other CUDA versions, it defaults to 'cutlass'.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the mm_fp4 backend selection to use auto for CUDA 13, which is a good improvement that leverages new features in flashinfer. The refactoring to centralize the backend selection logic into a new function, resolve_mm_fp4_gemm_backend, improves code clarity and maintainability. I have one suggestion to further simplify this new function to reduce code duplication. Overall, the changes are well-structured and align with the goals of the pull request.

Comment on lines +171 to +179
if _IS_CUDA_13:
# auto resolution: if cudnn < 9.15, use cutlass else use cudnn
# https://github.com/flashinfer-ai/flashinfer/pull/1979
backend = FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "auto"
else:
backend = (
FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "cutlass"
)
return backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function can be simplified to avoid repeating the check for FLASHINFER_FP4_GEMM_BACKEND. By checking for the user-specified backend first, you can reduce code duplication and make the logic more straightforward.

Suggested change
if _IS_CUDA_13:
# auto resolution: if cudnn < 9.15, use cutlass else use cudnn
# https://github.com/flashinfer-ai/flashinfer/pull/1979
backend = FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "auto"
else:
backend = (
FLASHINFER_FP4_GEMM_BACKEND if FLASHINFER_FP4_GEMM_BACKEND else "cutlass"
)
return backend
if FLASHINFER_FP4_GEMM_BACKEND:
return FLASHINFER_FP4_GEMM_BACKEND
if _IS_CUDA_13:
# auto resolution: if cudnn < 9.15, use cutlass else use cudnn
# https://github.com/flashinfer-ai/flashinfer/pull/1979
return "auto"
return "cutlass"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant