You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two rounds of code-review fixes:
- server_args: disable the prefill CUDA graph under EPv2 -- only the direct-mode
decode masked-GEMM path is capture-safe; the direct-mode prefill (extend) path
uses a non-masked layout with a host readback and is not capture-validated.
- fp8: DeepGEMM UE8M0 weight requant now fails fast for the EPv2 FusedMoE layer
with a clear message, instead of asserting isinstance(layer, DeepEPMoE).
- deepseek_v2: revert the 3 invented a2a helpers back to the original inline
backend checks plus `or is_epv2()`, so EPv2 integration is purely additive. Fix
two over-broad sites where a wide helper had replaced narrower checks: the AMD
gfx95 allocator-size path (restore is_deepep_class_backend + epv2) and
enable_a2a_moe (restore is_deepep/is_mooncake + epv2) -- unrelated backends
(nixl / ascend / flashinfer / megamoe) keep their original behavior.
- epv2: dispatch_b/combine_b raise a clear RuntimeError when called without a
preceding dispatch_a/combine_a, aligned with the combine_a stage check.
- kernels: document that the masked-slab overflow fast-fail is skipped during
CUDA graph capture, and that safety then relies on the static
max_m = cap * ep_group_size upper bound.
- utils: drop the now-unused a2a helpers; clarify the capability-resolver comment
(it reads runner flags to build the contract, like the DeepEP dispatcher does;
the dispatcher itself only consumes the resolved contract).
No functional or perf change for EPv2 or DeepEP: re-verified chat-completions
3-question correctness (direct + hybrid), unit tests (7 passed), and 4 throughput
points (decode/prefill, EPv2 vs DeepEP) -- all within run-to-run noise of the
pre-fix numbers.
0 commit comments