-
Notifications
You must be signed in to change notification settings - Fork 707
Description
I am trying to serve Deepseek 3.2 with NVFP4 quantization and SPECV2 in SGLang. I am seeing this error with 1k input/1k output at concurrency=10
With lmsysorg/sglang:v0.5.8.post1-cu130 and SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model-path nvidia/DeepSeek-V3.2-NVFP4 --trust-remote-code --tensor-parallel-size 8 --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 --port 8000 --host 0.0.0.0 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --max-running-requests=64 --max-queued-requests=128 --quantization modelopt_fp4 --moe-runner-backend flashinfer_trtllm --fp4-gemm-backend=flashinfer_cutlass --mem-fraction-static=0.8
RuntimeError: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:258: Error occurred when running GEMM! (numBatches: 256 , GemmMNK: 2 512 7168 , Kernel: bmm_E2m1_E2m1E2m1_Fp32_t128x8x512u2_s5_et128x8_m128x8x64_cga1x1x1_16dp256b_rM_TN_transOut_schedS_bN_ldgsts_tmaOpt_clmp_swiGlu_dynBatch_sm100f )
[2026-02-06 21:15:27 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2974, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1125, in event_loop_overlap
batch_result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2294, in run_batch
batch_result = self.model_worker.forward_batch_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 646, in forward_batch_generation
batch_output = self.target_worker.forward_batch_generation(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 454, in forward_batch_generation
out = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2235, in forward
output = self._forward_raw(
^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2334, in _forward_raw
ret = self.forward_extend(
^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 2174, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2897, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2710, in forward
hidden_states, residual = layer(
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2410, in forward
hidden_states = self.mlp(
^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 569, in forward
return self.forward_normal(
^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 667, in forward_normal
final_hidden_states = self.experts(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1243, in forward
return self.forward_impl(hidden_states, topk_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1293, in forward_impl
result = trtllm_fp4_block_scale_moe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 2414, in trtllm_fp4_block_scale_moe
return get_trtllm_moe_sm100_module().trtllm_fp4_block_scale_moe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/fused_moe/core.py", line 1836, in trtllm_fp4_block_scale_moe_op
intermediate_output = moe_op.trtllm_fp4_block_scale_moe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
RuntimeError: Error in function 'run' at /workspace/csrc/trtllm_batched_gemm_runner.cu:258: Error occurred when running GEMM! (numBatches: 256 , GemmMNK: 2 512 7168 , Kernel: bmm_E2m1_E2m1E2m1_Fp32_t128x8x512u2_s5_et128x8_m128x8x64_cga1x1x1_16dp256b_rM_TN_transOut_schedS_bN_ldgsts_tmaOpt_clmp_swiGlu_dynBatch_sm100f )
Originally reported to SGLang but seems to be Flashinfer issue so reposting upstream. sgl-project/sglang#16624 (comment)