Skip to content

Conversation

@raayandhar
Copy link
Contributor

@raayandhar raayandhar commented Feb 9, 2026

📌 Description

Adds support for benchmarking the various backends for BF16 GEMM for mm and bmm

This is 90% done. There is still the remaining things to do:

  • Test the benchmarking very thoroughly (sizes, backends, etc. Just did some basic testing)
  • Based on these results, update what the default backend should be (? what's the best way to do this / how should I be determining this?)
  • Test on B200 (only testing done on B300 atm)
  • Small cleanups of language
  • Investigation of why FP16 output doesn't work for B300
  • Small changes from the last PR
    Will do these remaining things on Monday

🔍 Related Issues

#1974

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features
    • Added BF16 matrix-multiplication routines and TGV backend support
    • New CLI flags for BF16 runs, including --bias and PDL enablement
  • Documentation
    • Updated docs, legends and docstrings to include BF16 details and TGV
    • Autotune/help text updated to cover BF16 routines
  • Tests
    • Added BF16 benchmarks, reporting metrics, and a new "bias" output column

Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds BF16 GEMM benchmarking: introduces mm_bf16 and bmm_bf16 benchmarks, new --bias and --enable_pdl CLI flags, tgv backend support in CLI/docs, a new bias output column in benchmark outputs, and docstring clarifications for mm_bf16. No public API signatures changed.

Changes

Cohort / File(s) Summary
Benchmark docs & config
benchmarks/README.md, benchmarks/routines/flashinfer_benchmark_utils.py
Add BF16 routine entries (mm_bf16, bmm_bf16), add bias output column, extend backend legend to include tgv, and update backend-support notes and benchmark API lists.
BF16 GEMM benchmark tests & CLI
benchmarks/routines/gemm.py
Add testMmBf16 and testBmmBf16, wire mm_bf16/bmm_bf16 into run_gemm_test, extend CLI parser to include tgv, --bias, and --enable_pdl, and update autotune descriptions.
Public API docs
flashinfer/gemm/gemm_base.py
Reword mm_bf16 docstrings to indicate bias/PDL enabled for relevant backends; no behavioral changes.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as CLI
  participant Runner as Benchmark Runner
  participant Autotune as Autotuner
  participant Backend as Backend (CUTLASS / TGV / cuDNN)
  participant Ref as Reference (torch)
  participant Reporter as Reporter

  CLI->>Runner: parse args (routine=mm_bf16/bmm_bf16, --bias, --enable_pdl,...)
  alt autotune enabled
    Runner->>Autotune: run autotune trials
    Autotune->>Backend: execute candidate kernels
    Backend-->>Autotune: trial results
  end
  Runner->>Backend: execute BF16 GEMM kernel
  Backend-->>Runner: results (timing, TFLOPs)
  Runner->>Ref: run reference (torch.mm / torch.bmm float32)
  Ref-->>Runner: correctness outcome
  Runner->>Reporter: emit metrics & CSV row (includes bias column)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

op: gemm

Suggested reviewers

  • nvmbreughe
  • yongwww
  • cyx-6
  • jiahanc
  • jimmyzho
  • Anerudhan
  • bkryu

Poem

🐰 I hopped through tensors, light and spry,
BF16 whispers as kernels fly,
mm and bmm now join the race,
TGV hums with nimble grace,
Benchmarks blink — carrots for speed ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: BF16 GEMM benchmarking support' directly and clearly describes the main change: adding benchmarking support for BF16 GEMM operations.
Description check ✅ Passed The PR description covers the main objective (BF16 GEMM benchmarking support), lists related issues, and completes the template checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @raayandhar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities by adding comprehensive support for BFloat16 (BF16) General Matrix Multiplication (GEMM) operations, specifically for mm (matrix multiplication) and bmm (batched matrix multiplication). This allows developers to rigorously test and compare the performance of different backends, including cuDNN, CUTLASS, and the newly integrated TGV backend, for BF16 computations on modern hardware. The changes include new command-line arguments for fine-grained control over benchmarking, such as enabling bias and persistent data loader modes for the TGV backend, and extending autotuning to these new routines.

Highlights

  • BF16 GEMM Benchmarking: Introduced new benchmarking routines (mm_bf16 and bmm_bf16) to evaluate the performance of BFloat16 General Matrix Multiplication operations, specifically targeting Blackwell SM10.0+ architectures.
  • TGV Backend Integration: Added support for the TGV backend for mm_bf16, including specific arguments like --bias and --pdl to enable TGV-specific features during benchmarking.
  • Extended Autotuning Support: The --autotune flag now supports mm_bf16 and bmm_bf16 routines, allowing for performance optimization and selection of the best tactic across various available backends.
  • Documentation and Argument Updates: The benchmarks/README.md has been updated to reflect the new BF16 routines, their supported backends (cuDNN, CUTLASS, TGV), and new command-line arguments, along with clarifying mm_bf16 parameter documentation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmarks/README.md
    • Added mm_bf16 and bmm_bf16 to the list of supported GEMM routines for Blackwell SM10.0+.
    • Updated the --autotune flag description to include mm_bf16 and bmm_bf16 routines.
    • Introduced new command-line flags --bias and --pdl for mm_bf16 (TGV backend specific).
    • Added tgv to the backend legend and updated the backend support matrix for BF16 GEMM routines.
  • benchmarks/routines/flashinfer_benchmark_utils.py
    • Extended GEMM_ARGS_TO_PARSE to include bias and pdl arguments.
    • Added mm_bf16 and bmm_bf16 to the list of GEMM_ROUTINES.
    • Updated ROUTINE_SUPPORTED_BACKENDS to specify cudnn, cutlass, and tgv support for mm_bf16 and bmm_bf16 on SM10.0+.
  • benchmarks/routines/gemm.py
    • Integrated testMmBf16 and testBmmBf16 functions to handle the new BF16 GEMM routines.
    • Added tgv as a valid choice for the --backends argument.
    • Modified the --autotune help message to reflect support for BF16 GEMM routines.
    • Implemented --bias and --pdl command-line arguments for mm_bf16 (TGV backend).
    • Added testMmBf16 function, including logic for input tensor generation, backend execution (cuDNN, CUTLASS, TGV), reference checks, autotuning, and TGV-specific feature handling (bias, PDL, BF16 output constraint).
    • Added testBmmBf16 function, providing similar benchmarking capabilities for batched BF16 matrix multiplication.
  • flashinfer/gemm/gemm_base.py
    • Refined documentation for mm_bf16 parameters (bias, pdl, out, out_dtype) to clearly state which backends enable these features.
    • Updated backend descriptions for mm_bf16 to reflect TGV's support for bias and PDL.
Activity
  • The author, raayandhar, initiated this pull request to introduce BF16 GEMM benchmarking.
  • Pre-commit checks have been successfully completed, ensuring code quality and consistency.
  • All existing tests have passed, and new tests have been added or updated as needed to cover the new functionalities.
  • The author has identified and listed remaining tasks, including thorough testing across various sizes and backends, determining the optimal default backend, testing on B200 hardware, and minor code cleanups.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds benchmarking support for BF16 GEMM (mm and bmm) operations, including updates to documentation, utility functions, and new test routines. The changes are well-structured and follow the existing patterns in the benchmark suite.

My review includes a few suggestions for improvement:

  • Refactoring the backend filtering logic in the new test functions for better readability.
  • Addressing significant code duplication between the testMmBf16 and testBmmBf16 functions to improve maintainability.
  • Enhancing the docstrings for mm_bf16 to be more explicit about backend limitations, which would improve usability.

Overall, this is a solid contribution that extends the benchmarking capabilities of the project.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1283-1360: The backends string "auto" is dropped by
filter_backends_by_compute_capability and run_backend only handles explicit
names, making --backends auto unusable; fix by expanding "auto" into the
concrete BF16 candidate list (e.g. ["cudnn","cutlass","tgv"]) right after
reading backends (before calling filter_backends_by_compute_capability) or by
mapping "auto" to the filtered result before run_backend is used so run_backend
sees only explicit backend names; update the code around backends (the variable
backends, the call to filter_backends_by_compute_capability, and run_backend
usage) to perform this expansion/mapping.

Copy link
Collaborator

@bkryu bkryu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @raayandhar, thanks for contributing these changes. The added benchmarks should help compare the performances of backends in a nice apples-to-apples comparison.

I left a few comments about the benchmarking code.

"use_128x4_sf_layout",
"use_nvfp4",
"bias",
"pdl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf. enable_pdl below in the same file. Can we use that one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for that catch. Should the API for mm_bf16 also use enable_pdl instead?

Signed-off-by: raayandhar <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1363-1419: The reference output is computed once
(reference_output) with bias applied regardless of backend, but run_backend only
sends bias to the "tgv" backend, causing spurious cosine-mismatch; fix by
computing a per-backend reference inside the backend loop (or conditionally
adding bias only when cur_backend == "tgv") before comparing: use the same input
path/logic as run_backend to produce a reference per cur_backend (referencing
variables/function names reference_output, run_backend, outputs, backends,
tested_backends, bias, run_refcheck) so the cosine_similarity compares
like-for-like.

Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1436-1468: The code adds bias FLOPs/bytes unconditionally when
use_bias is true, inflating TFLOPs/TB/s for non-TGV backends; modify the bias
accounting so the extra m*n FLOPs and n*bfloat16 bytes are only added when
use_bias is true AND the current backend is the TGV implementation (check
backend or backend_name == "tgv" before adding to problem_flops/problem_bytes).
Update the same conditional logic that computes problem_flops and problem_bytes
(the block that precedes print_perf_metrics and the res append) so reported
metrics and saved cur_res reflect the true work for each backend.
🧹 Nitpick comments (2)
benchmarks/routines/gemm.py (2)

1517-1521: Nit: uppercase A, B variable names are inconsistent with the rest of the file.

Other test functions (testMmBf16, testBmmFp8, etc.) use lowercase a, b or input, mat2. Consider using lowercase for consistency.


1562-1571: run_backend should accept "tgv" if it's a valid backend choice for BF16.

The CLI allows tgv as a backend, and testMmBf16 supports it, but testBmmBf16.run_backend only accepts cudnn/cutlass and raises ValueError for tgv. If a user passes --backends tgv with bmm_bf16, the try/except validation at line 1541 will silently filter it out, which is fine — but it might be clearer to either document that tgv is not supported for bmm_bf16 or add it to the validation message.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Line 1283: The code reads use_pdl = getattr(args, "enable_pdl", False) but
parse_gemm_args never defines --enable_pdl, so enable_pdl is always False;
update parse_gemm_args to add a boolean CLI flag (e.g.,
parser.add_argument("--enable_pdl", action="store_true", help="...")) so
args.enable_pdl is set when requested, or alternatively add a clear comment next
to use_pdl documenting that PDL is intentionally unsupported; modify the CLI
parsing function parse_gemm_args to include the new flag so the existing use_pdl
variable is meaningful.

@raayandhar
Copy link
Contributor Author

Hi @raayandhar, thanks for contributing these changes. The added benchmarks should help compare the performances of backends in a nice apples-to-apples comparison.

I left a few comments about the benchmarking code.

Hi @bkryu, thanks for the review - would appreciate another look when you get the chance and let me know what you think.
I was wondering how I should be using benchmark results to help determine the default backends for the mm_bf16 and bmm_bf16 APIs? Also what kind of testing for this PR you'd like to see.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1523-1532: The testBmmBf16 function is missing the
compute-capability filtering step used by other tests; update testBmmBf16 to
call filter_backends_by_compute_capability on the backends list (same pattern as
in testMmBf16), passing the current backends and any relevant supported list
(e.g., autotune_supported_backends) or dtype context for bfloat16 so unsupported
GPUs are removed before running; ensure you use the identical helper function
name filter_backends_by_compute_capability and update the local backends
variable used later in the function.
- Around line 1283-1294: The testMmBf16 setup misses filtering backends by
device compute capability; call filter_backends_by_compute_capability(backends,
args.routine, device) after backends is assigned (and after device is
determined) in the testMmBf16 function so the backend list is reduced to only
supported SMs—mirror the pattern used in
testGemmFp8NtGroupwise/testGroupGemmFp8NtGroupwise/testBmmFp8 by invoking
filter_backends_by_compute_capability(backends, args.routine, device) before
iterating or running backends.

Copy link
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so far so good

@raayandhar
Copy link
Contributor Author

raayandhar commented Feb 10, 2026

Also, per the last PR (#2376) for cuDNN backend, I did the following

(flashinfer) root@a0f94960d09c:/sgl-workspace/sglang/flashinfer/benchmarks# touch cudnn.log
(flashinfer) root@a0f94960d09c:/sgl-workspace/sglang/flashinfer/benchmarks# export CUDNN_LOGLEVEL_DBG=2 CUDNN_LOGDEST_DBG=cudnn.log
(flashinfer) root@a0f94960d09c:/sgl-workspace/sglang/flashinfer/benchmarks# cd ..
(flashinfer) root@a0f94960d09c:/sgl-workspace/sglang/flashinfer# python3
Python 3.12.3 (main, Jan  8 2026, 11:30:50) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import flashinfer
>>> a = torch.randn([48, 64], device="cuda", dtype=torch.bfloat16)
>>> b = torch.randn([80, 64], device="cuda", dtype=torch.bfloat16).transpose(-2, -1)
>>> out = flashinfer.mm_bf16(a, b, out_dtype=torch.float16, backend="cudnn")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/sgl-workspace/sglang/flashinfer/flashinfer/utils.py", line 1176, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/flashinfer/flashinfer/gemm/gemm_base.py", line 419, in mm_bf16
    bf16_gemm_sm100(a, b, bias, pdl, out, workspace_buffer, backends)
  File "/sgl-workspace/sglang/flashinfer/flashinfer/gemm/gemm_base.py", line 908, in bf16_gemm_sm100
    runner(inputs=inputs, tactic=tactic)
  File "/sgl-workspace/sglang/flashinfer/flashinfer/autotuner.py", line 217, in __call__
    return self.forward(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/flashinfer/flashinfer/gemm/gemm_base.py", line 2247, in forward
    _cudnn_gemm_bf16(workspace_buffer, a, b, out, tactic=tactic)
  File "/sgl-workspace/sglang/flashinfer/flashinfer/gemm/gemm_base.py", line 2202, in _cudnn_gemm_bf16
    graph = build_cudnn_gemm_bf16_graph(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/flashinfer/flashinfer/gemm/gemm_base.py", line 2161, in build_cudnn_gemm_bf16_graph
    graph.build_plans()
cudnn._compiled_module.cudnnGraphNotSupportedError: [cudnn_frontend] Error: No valid execution plans built.
>>> exit()

and the log is the following:
https://gist.github.com/raayandhar/8e5495df8fcc392219bac29646be4eff

It doesn't seem very descriptive so maybe I messed up somewhere with the environment variables?

Let me know if any experts have more details. (B300, cuDNN, out_dtype=fp16)

Edit: I opened an issue NVIDIA/cudnn-frontend#203

@raayandhar raayandhar requested a review from bkryu February 10, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants