-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[LoRA] Set default MXFP4 LoRA backend to Marlin #30598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Xin Yang <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request changes the default MXFP4 backend for LoRA from Triton to Marlin, citing better performance and accuracy. The logic is updated to select Marlin by default, and only fall back to Triton if VLLM_MXFP4_USE_MARLIN is explicitly set to 0 and Triton kernels are supported. The change is correct, well-scoped to LoRA as per the PR title, and aligns with the stated purpose. The implementation is clear and concise.
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add metrics in the PR description to show the acc / perf issue you mentioned.
lm_evalfor accvllm bench...for performance
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After addressing @yewentao256's comment, LGTM
@yewentao256 Thanks for review!
|
Let's fix the issue first then, do you have time to take a deep look into this issue? |
Purpose
This PR set default MXFP4 LoRA backend to Marlin because Triton has accuracy issues and Marlin has slight better performance.
VLLM_MXFP4_USE_MARLIN=0explicitly) andtriton_kernelsis supported.VLLM_MXFP4_USE_MARLINis not setVLLM_MXFP4_USE_MARLIN=1triton_kernelsis not supportedBenchmarking
Marlin:
Triton:
Marlin is slightly better because in mxfp4 Triton LoRA is implemented in UnfusedOAITritonExperts. It has to unfuse the activation and reduction to allow to inject lora modules, so this makes it lose the
triton_kernels's optimizations of fused activation and fused moe_sum.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.