[Doc] add design docs for async chunk in qwen3-omni#962
[Doc] add design docs for async chunk in qwen3-omni#962R2-Y wants to merge 4 commits intovllm-project:mainfrom
Conversation
a25a6a9 to
493de9c
Compare
493de9c to
1ea7425
Compare
| @@ -0,0 +1,419 @@ | |||
| # Async Chunking | |||
There was a problem hiding this comment.
for the async-chunk-arch png, please use a larger font
| |--------|----------------------------|------------------------|-------------|-------------|-------------|------------------------|-------------|-------------|-------------|-------------| | ||
| |single request | text | text + audio | True | 10 | 10 | 1 | 268.27 | 1268.83 | 20.28 | 1363.31 | | ||
| |single request | text | text + audio | False | 10 | 10 | 1 | 56.73 | 1407.34 | 24.57 | 1408.03 | | ||
| |single request | text | text + audio | True | 2500 | 900 | 1 | 380.03 | 1910.39 | 8.82 | 15650.26 | |
There was a problem hiding this comment.
the e2e time is even worse?
There was a problem hiding this comment.
The total number of generated tokens is different. When async chunk is set to false, the number of generated tokens is 942, while when async chunk is set to true, the number of generated tokens is 1732. Perhaps we need to update the data to ensure the number of generated tokens is consistent.
| - **Queue Coordination**: Temporary queues (waiting_for_chunk_waiting_requests, waiting_for_chunk_running_requests) keep requests out of base scheduler until chunk is ready, then restore | ||
|
|
||
| ## Performance | ||
| 1. **Reduced Latency**: Next stage can start processing immediately |
There was a problem hiding this comment.
you didn;t metion about througput ,memory, GPU utilization in the following table
hsliuustc0106
left a comment
There was a problem hiding this comment.
I think we need to put the data table at the beginning, drawing some histgram to compare is better
|
no need to test ci for docs |
0f3f013 to
3ed3864
Compare
fixed |
a2b35b5 to
99c9da1
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
for the figure plot, we should add 3 seperate figures ttft ttfp&tpot using only two colors
| 3. **IO-Compute Overlap**: Chunk retrieval happens asynchronously while other requests compute | ||
| 4. **Non-blocking Scheduler**: Requests waiting for chunks don't block the entire scheduler | ||
|
|
||
| | Scenario | Input Modality | Output Modality | async_chunk | Input tokens num | Output tokens num | Request num | TTFT(ms) | TTFP(ms) | |
Signed-off-by: Rein Yang <ruiruyang2@gmail.com> Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
b82b385 to
3185731
Compare
|
|
||
| The `async_chunk` feature enables asynchronous, chunked processing of data across multiple stages in a multi-stage pipeline (e.g., Qwen3-Omni with Thinker → Talker → Code2Wav stages). Instead of waiting for a complete stage output before forwarding to the next stage, this feature allows stages to process and forward data in chunks as it becomes available, significantly reducing latency and improving throughput. | ||
|
|
||
| **Chunk Size Definition** |
There was a problem hiding this comment.
The chunk size is defined as the num_scheduled_tokens of each step in each request. The num_scheduled_tokens of different steps in different requests may be different. For example, if the num_scheduled_tokens is 1 in the decoding phase, the chunk size is 1.
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
It seems add [skip ci] to the commit message or PR title will skip BuildKite CI. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
add design docs for async chunk in qwen3-omni
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)