Skip to content

[Bug]: scheduler issue with PD disaggregation #30659

@David9857

Description

@David9857

Your current environment

The output of python collect_env.py

vllm v0.11.0

🐛 Describe the bug

Hello, I got an issue of scheduler when deploying with PD disaggregation:
Since current scheduling strategy doesn't free blocks occupied by requests with WAITING_FOR_REMOTE_KVS state, will the server stuck in certain scenarios?
For example, in step 4, the secheduler will allocate blocks for request 1 fisrt since it was put back to the front of the waiting queue in step 3. Then request 2 will never get into running queue since it requires more blocks for next token and the scheduler will get stuck in the loop from step 2 to step 4.

Image

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions