Skip to content

[Doc] add design docs for async chunk in qwen3-omni#962

Open
R2-Y wants to merge 4 commits intovllm-project:mainfrom
R2-Y:async_chunk_doc
Open

[Doc] add design docs for async chunk in qwen3-omni#962
R2-Y wants to merge 4 commits intovllm-project:mainfrom
R2-Y:async_chunk_doc

Conversation

@R2-Y
Copy link
Contributor

@R2-Y R2-Y commented Jan 26, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

add design docs for async chunk in qwen3-omni

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@R2-Y R2-Y changed the title [WIP] add design docs for async chunk in qwen3-omni [WIP] [Doc] add design docs for async chunk in qwen3-omni Jan 26, 2026
@R2-Y R2-Y force-pushed the async_chunk_doc branch 4 times, most recently from a25a6a9 to 493de9c Compare January 30, 2026 01:48
@@ -0,0 +1,419 @@
# Async Chunking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the async-chunk-arch png, please use a larger font

|--------|----------------------------|------------------------|-------------|-------------|-------------|------------------------|-------------|-------------|-------------|-------------|
|single request | text | text + audio | True | 10 | 10 | 1 | 268.27 | 1268.83 | 20.28 | 1363.31 |
|single request | text | text + audio | False | 10 | 10 | 1 | 56.73 | 1407.34 | 24.57 | 1408.03 |
|single request | text | text + audio | True | 2500 | 900 | 1 | 380.03 | 1910.39 | 8.82 | 15650.26 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the e2e time is even worse?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total number of generated tokens is different. When async chunk is set to false, the number of generated tokens is 942, while when async chunk is set to true, the number of generated tokens is 1732. Perhaps we need to update the data to ensure the number of generated tokens is consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or remove e2el

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

- **Queue Coordination**: Temporary queues (waiting_for_chunk_waiting_requests, waiting_for_chunk_running_requests) keep requests out of base scheduler until chunk is ready, then restore

## Performance
1. **Reduced Latency**: Next stage can start processing immediately
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn;t metion about througput ,memory, GPU utilization in the following table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to put the data table at the beginning, drawing some histgram to compare is better

@hsliuustc0106
Copy link
Collaborator

no need to test ci for docs

@amy-why-3459 amy-why-3459 force-pushed the async_chunk_doc branch 3 times, most recently from 0f3f013 to 3ed3864 Compare January 31, 2026 09:08
@amy-why-3459
Copy link
Contributor

I think we need to put the data table at the beginning, drawing some histgram to compare is better

fixed

@amy-why-3459 amy-why-3459 force-pushed the async_chunk_doc branch 2 times, most recently from a2b35b5 to 99c9da1 Compare January 31, 2026 09:24
Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the figure plot, we should add 3 seperate figures ttft ttfp&tpot using only two colors

3. **IO-Compute Overlap**: Chunk retrieval happens asynchronously while other requests compute
4. **Non-blocking Scheduler**: Requests waiting for chunks don't block the entire scheduler

| Scenario | Input Modality | Output Modality | async_chunk | Input tokens num | Output tokens num | Request num | TTFT(ms) | TTFP(ms) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tpot missed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@R2-Y R2-Y changed the title [WIP] [Doc] add design docs for async chunk in qwen3-omni [Doc] add design docs for async chunk in qwen3-omni Feb 2, 2026
Signed-off-by: Rein Yang <ruiruyang2@gmail.com>

Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
@R2-Y R2-Y force-pushed the async_chunk_doc branch 2 times, most recently from b82b385 to 3185731 Compare February 2, 2026 12:30
@hsliuustc0106 hsliuustc0106 mentioned this pull request Feb 3, 2026
5 tasks

The `async_chunk` feature enables asynchronous, chunked processing of data across multiple stages in a multi-stage pipeline (e.g., Qwen3-Omni with Thinker → Talker → Code2Wav stages). Instead of waiting for a complete stage output before forwarding to the next stage, this feature allows stages to process and forward data in chunks as it becomes available, significantly reducing latency and improving throughput.

**Chunk Size Definition**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk size is defined as the num_scheduled_tokens of each step in each request. The num_scheduled_tokens of different steps in different requests may be different. For example, if the num_scheduled_tokens is 1 in the decoding phase, the chunk size is 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Rein Yang <ruiruyang2@gmail.com>
@congw729
Copy link
Contributor

congw729 commented Feb 6, 2026

no need to test ci for docs

It seems add [skip ci] to the commit message or PR title will skip BuildKite CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants