[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

xyang16 · 2025-12-13T07:17:08Z

Purpose

This PR set default MXFP4 LoRA backend to Marlin because Triton has accuracy issues and Marlin has slight better performance.

Use Triton only if Marlin is disabled (set VLLM_MXFP4_USE_MARLIN=0 explicitly) and triton_kernels is supported.
Use Marlin by default
- if VLLM_MXFP4_USE_MARLIN is not set
- if VLLM_MXFP4_USE_MARLIN=1
- if triton_kernels is not supported

Benchmarking

Marlin:

VLLM_MXFP4_USE_MARLIN=1 vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE", "compile_sizes": [1, 2, 4, 8, 16]}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gsm8k \
  --max-lora-rank 32 \
  --no-enable-prefix-caching

vllm bench serve \
  --model openai/gpt-oss-20b \
  --lora-modules lora1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --sharegpt-output-len 800 \
  --max-concurrency 16 \
  --num-prompts 1000 \
  --num-warmups 60 \
  --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  418.75    
Total input tokens:                      226792    
Total generated tokens:                  800000    
Request throughput (req/s):              2.39      
Output token throughput (tok/s):         1910.47   
Peak output token throughput (tok/s):    2053.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2452.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          70.91     
Median TTFT (ms):                        60.91     
P99 TTFT (ms):                           204.49    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.24      
Median TPOT (ms):                        8.25      
P99 TPOT (ms):                           8.50      
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.24      
Median ITL (ms):                         8.11      
P99 ITL (ms):                            9.11      
==================================================

Triton:

vllm serve openai/gpt-oss-20b \
  --tensor-parallel-size 1 \
  --max-num-seqs 16 \
  --compilation-config '{"cudagraph_mode": "PIECEWISE", "compile_sizes": [1, 2, 4, 8, 16]}' \
  --enable-lora \
  --max-loras 1 \
  --lora-modules lora1=/opt/dlami/nvme/models/gpt-oss-20b-lora-gsm8k \
  --max-lora-rank 32 \
  --no-enable-prefix-caching

vllm bench serve \
  --model openai/gpt-oss-20b \
  --lora-modules lora1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --sharegpt-output-len 800 \
  --max-concurrency 16 \
  --num-prompts 1000 \
  --num-warmups 60 \
  --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  439.55    
Total input tokens:                      226792    
Total generated tokens:                  800000    
Request throughput (req/s):              2.28      
Output token throughput (tok/s):         1820.06   
Peak output token throughput (tok/s):    1968.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2336.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          75.77     
Median TTFT (ms):                        49.08     
P99 TTFT (ms):                           214.07    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.65      
Median TPOT (ms):                        8.66      
P99 TPOT (ms):                           8.94      
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.65      
Median ITL (ms):                         8.50      
P99 ITL (ms):                            9.49      
==================================================

Marlin is slightly better because in mxfp4 Triton LoRA is implemented in UnfusedOAITritonExperts. It has to unfuse the activation and reduction to allow to inject lora modules, so this makes it lose the triton_kernels's optimizations of fused activation and fused moe_sum.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Xin Yang <[email protected]>

chatgpt-codex-connector · 2025-12-13T07:17:14Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request changes the default MXFP4 backend for LoRA from Triton to Marlin, citing better performance and accuracy. The logic is updated to select Marlin by default, and only fall back to Triton if VLLM_MXFP4_USE_MARLIN is explicitly set to 0 and Triton kernels are supported. The change is correct, well-scoped to LoRA as per the PR title, and aligns with the stated purpose. The implementation is clear and concise.

yewentao256

Please add metrics in the PR description to show the acc / perf issue you mentioned.

lm_eval for acc
vllm bench... for performance

jeejeelee

After addressing @yewentao256's comment, LGTM

xyang16 · 2025-12-15T17:54:48Z

Please add metrics in the PR description to show the acc / perf issue you mentioned.

lm_eval for acc

vllm bench... for performance

@yewentao256 Thanks for review!

Accuracy: Currently gpt-oss mxfp4 with LoRA + triton kernel is generating garbage output, see [Bug]: FULL_AND_PIECEWISE cudagraph mode leading to !!! in generated text #29539 (comment)
Performance: I have pasted vllm bench numbers in the description.

yewentao256 · 2025-12-16T16:46:04Z

Please add metrics in the PR description to show the acc / perf issue you mentioned.

lm_eval for acc

vllm bench... for performance

@yewentao256 Thanks for review!

Accuracy: Currently gpt-oss mxfp4 with LoRA + triton kernel is generating garbage output, see [Bug]: FULL_AND_PIECEWISE cudagraph mode leading to !!! in generated text #29539 (comment)

Performance: I have pasted vllm bench numbers in the description.

Let's fix the issue first then, do you have time to take a deep look into this issue?

xyang16 · 2025-12-16T16:50:01Z

Let's fix the issue first then, do you have time to take a deep look into this issue?

I raised a PR to fix this: #30585

There's another PR to fix the cudagraph issue: #30650

[LoRA] Set default MXFP4 LoRA backend to Marlin

4d5e993

Signed-off-by: Xin Yang <[email protected]>

xyang16 requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 13, 2025 07:17

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

yewentao256 reviewed Dec 13, 2025

View reviewed changes

jeejeelee approved these changes Dec 14, 2025

View reviewed changes

jeejeelee enabled auto-merge (squash) December 18, 2025 00:08

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

xyang16 commented Dec 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

jeejeelee left a comment

Uh oh!

xyang16 commented Dec 15, 2025 •

edited

Loading

Uh oh!

yewentao256 commented Dec 16, 2025

Uh oh!

xyang16 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

Are you sure you want to change the base?

[LoRA] Set default MXFP4 LoRA backend to Marlin #30598

Conversation

xyang16 commented Dec 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmarking

Uh oh!

chatgpt-codex-connector bot commented Dec 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

xyang16 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 commented Dec 16, 2025

Uh oh!

xyang16 commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xyang16 commented Dec 13, 2025 •

edited by github-actions bot

Loading

xyang16 commented Dec 15, 2025 •

edited

Loading