-
Notifications
You must be signed in to change notification settings - Fork 707
feat: BF16 GEMM benchmarking support #2525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: BF16 GEMM benchmarking support #2525
Conversation
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds BF16 GEMM benchmarking: introduces Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Runner as Benchmark Runner
participant Autotune as Autotuner
participant Backend as Backend (CUTLASS / TGV / cuDNN)
participant Ref as Reference (torch)
participant Reporter as Reporter
CLI->>Runner: parse args (routine=mm_bf16/bmm_bf16, --bias, --enable_pdl,...)
alt autotune enabled
Runner->>Autotune: run autotune trials
Autotune->>Backend: execute candidate kernels
Backend-->>Autotune: trial results
end
Runner->>Backend: execute BF16 GEMM kernel
Backend-->>Runner: results (timing, TFLOPs)
Runner->>Ref: run reference (torch.mm / torch.bmm float32)
Ref-->>Runner: correctness outcome
Runner->>Reporter: emit metrics & CSV row (includes bias column)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @raayandhar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the benchmarking capabilities by adding comprehensive support for BFloat16 (BF16) General Matrix Multiplication (GEMM) operations, specifically for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds benchmarking support for BF16 GEMM (mm and bmm) operations, including updates to documentation, utility functions, and new test routines. The changes are well-structured and follow the existing patterns in the benchmark suite.
My review includes a few suggestions for improvement:
- Refactoring the backend filtering logic in the new test functions for better readability.
- Addressing significant code duplication between the
testMmBf16andtestBmmBf16functions to improve maintainability. - Enhancing the docstrings for
mm_bf16to be more explicit about backend limitations, which would improve usability.
Overall, this is a solid contribution that extends the benchmarking capabilities of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1283-1360: The backends string "auto" is dropped by
filter_backends_by_compute_capability and run_backend only handles explicit
names, making --backends auto unusable; fix by expanding "auto" into the
concrete BF16 candidate list (e.g. ["cudnn","cutlass","tgv"]) right after
reading backends (before calling filter_backends_by_compute_capability) or by
mapping "auto" to the filtered result before run_backend is used so run_backend
sees only explicit backend names; update the code around backends (the variable
backends, the call to filter_backends_by_compute_capability, and run_backend
usage) to perform this expansion/mapping.
bkryu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @raayandhar, thanks for contributing these changes. The added benchmarks should help compare the performances of backends in a nice apples-to-apples comparison.
I left a few comments about the benchmarking code.
| "use_128x4_sf_layout", | ||
| "use_nvfp4", | ||
| "bias", | ||
| "pdl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cf. enable_pdl below in the same file. Can we use that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for that catch. Should the API for mm_bf16 also use enable_pdl instead?
Signed-off-by: raayandhar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1363-1419: The reference output is computed once
(reference_output) with bias applied regardless of backend, but run_backend only
sends bias to the "tgv" backend, causing spurious cosine-mismatch; fix by
computing a per-backend reference inside the backend loop (or conditionally
adding bias only when cur_backend == "tgv") before comparing: use the same input
path/logic as run_backend to produce a reference per cur_backend (referencing
variables/function names reference_output, run_backend, outputs, backends,
tested_backends, bias, run_refcheck) so the cosine_similarity compares
like-for-like.
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1436-1468: The code adds bias FLOPs/bytes unconditionally when
use_bias is true, inflating TFLOPs/TB/s for non-TGV backends; modify the bias
accounting so the extra m*n FLOPs and n*bfloat16 bytes are only added when
use_bias is true AND the current backend is the TGV implementation (check
backend or backend_name == "tgv" before adding to problem_flops/problem_bytes).
Update the same conditional logic that computes problem_flops and problem_bytes
(the block that precedes print_perf_metrics and the res append) so reported
metrics and saved cur_res reflect the true work for each backend.
🧹 Nitpick comments (2)
benchmarks/routines/gemm.py (2)
1517-1521: Nit: uppercaseA,Bvariable names are inconsistent with the rest of the file.Other test functions (
testMmBf16,testBmmFp8, etc.) use lowercasea,borinput,mat2. Consider using lowercase for consistency.
1562-1571:run_backendshould accept"tgv"if it's a valid backend choice for BF16.The CLI allows
tgvas a backend, andtestMmBf16supports it, buttestBmmBf16.run_backendonly acceptscudnn/cutlassand raisesValueErrorfortgv. If a user passes--backends tgvwithbmm_bf16, the try/except validation at line 1541 will silently filter it out, which is fine — but it might be clearer to either document thattgvis not supported forbmm_bf16or add it to the validation message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Line 1283: The code reads use_pdl = getattr(args, "enable_pdl", False) but
parse_gemm_args never defines --enable_pdl, so enable_pdl is always False;
update parse_gemm_args to add a boolean CLI flag (e.g.,
parser.add_argument("--enable_pdl", action="store_true", help="...")) so
args.enable_pdl is set when requested, or alternatively add a clear comment next
to use_pdl documenting that PDL is intentionally unsupported; modify the CLI
parsing function parse_gemm_args to include the new flag so the existing use_pdl
variable is meaningful.
Hi @bkryu, thanks for the review - would appreciate another look when you get the chance and let me know what you think. |
Signed-off-by: raayandhar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@benchmarks/routines/gemm.py`:
- Around line 1523-1532: The testBmmBf16 function is missing the
compute-capability filtering step used by other tests; update testBmmBf16 to
call filter_backends_by_compute_capability on the backends list (same pattern as
in testMmBf16), passing the current backends and any relevant supported list
(e.g., autotune_supported_backends) or dtype context for bfloat16 so unsupported
GPUs are removed before running; ensure you use the identical helper function
name filter_backends_by_compute_capability and update the local backends
variable used later in the function.
- Around line 1283-1294: The testMmBf16 setup misses filtering backends by
device compute capability; call filter_backends_by_compute_capability(backends,
args.routine, device) after backends is assigned (and after device is
determined) in the testMmBf16 function so the backend list is reduced to only
supported SMs—mirror the pattern used in
testGemmFp8NtGroupwise/testGroupGemmFp8NtGroupwise/testBmmFp8 by invoking
filter_backends_by_compute_capability(backends, args.routine, device) before
iterating or running backends.
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
aleozlx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so far so good
|
Also, per the last PR (#2376) for cuDNN backend, I did the following and the log is the following: It doesn't seem very descriptive so maybe I messed up somewhere with the environment variables? Let me know if any experts have more details. (B300, cuDNN, out_dtype=fp16) Edit: I opened an issue NVIDIA/cudnn-frontend#203 |
Signed-off-by: raayandhar <[email protected]>
Signed-off-by: raayandhar <[email protected]>
📌 Description
Adds support for benchmarking the various backends for BF16 GEMM for
mmandbmmThis is 90% done. There is still the remaining things to do:
Small cleanups of languageInvestigation of why FP16 output doesn't work for B300Small changes from the last PRWill do these remaining things on Monday
🔍 Related Issues
#1974
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit