Skip to content

[Bug] Missing saving kv for LLaDA2 #19019

@DarkSharpness

Description

@DarkSharpness

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

When we override enable_fused_set_kv_buffer to False and run test/registered/dllm/test_llada2_mini.py, this test will fail.

The root cause is that, the flashinfer attention backend skip the save_kv_cache, which is incorrect. After removing the following lines, the test will pass again.

if save_kv_cache and layer.attn_type == AttentionType.ENCODER_ONLY:
save_kv_cache = False

Reproduction

  1. Modify this function. Always return False
    def enable_fused_set_kv_buffer(forward_batch: ForwardBatch):
    """Enable fused set_kv_buffer only on CUDA with bfloat16 KV cache."""
    return (
    _is_cuda
    and hasattr(forward_batch.token_to_kv_pool, "dtype")
    and forward_batch.token_to_kv_pool.dtype == torch.bfloat16
    and not isinstance(forward_batch.token_to_kv_pool, SWAKVPool)
    )
  2. Run pytest -xss test/registered/dllm/test_llada2_mini.py

Environment

H200

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions