feat: Add NPU+GPU async pipelining for vision-language models#936
feat: Add NPU+GPU async pipelining for vision-language models#936liangliangchang wants to merge 2 commits into
Conversation
67b0d14 to
5698170
Compare
582133f to
390aec6
Compare
0635cf6 to
efb68e6
Compare
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class Qwen2_5_VL_CPUPreprocessor: |
There was a problem hiding this comment.
Seems strange that a model-specific class in in the generic vllm/vision_npu package. Can this be split better?
| @@ -0,0 +1,45 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Please include reproducible build instructions in the PR description and modify the README.md accordingly.
Make sure to only use the public release of Ryzen AI.
There was a problem hiding this comment.
Yes, I also want to ask, what is the best method we can use to handle the NPU model cache? We have two options: 1. providing scripts and instructions, so users can generate their own NPU model cache by extracting the vision model from Hugging Face VL models and compile it using release Ryzen AI toolflow. 2. just providing the model cache so users don't have to worry about model extraction and compilation. Which one looks better to you? In both cases, we will provide instructions on how to install flexmlRT from the release Ryzen AI pkgs.
| # Triton compilation to fail. | ||
| "VLLM_LORA_DISABLE_PDL": lambda: bool(int(os.getenv("VLLM_LORA_DISABLE_PDL", "0"))), | ||
| # NPU vision backend to use (e.g., "flexmlrt" for FlexMLRT backend) | ||
| "VLLM_VISION_NPU_BACKEND": lambda: os.getenv("VLLM_VISION_NPU_BACKEND", ""), |
There was a problem hiding this comment.
There is only one backend right? I.e. we got drop this env var?
| ) -> None: | ||
| super().__init__() | ||
|
|
||
| # Store minimal config needed for both NPU and PyTorch paths |
There was a problem hiding this comment.
I wonder whether making a new Qwen2_5_VisionTransformerNPU class would be cleaner instead of doing the overwriting/conditionals here. Can you try that?
|
Hi, thanks for the PR! I think you performance comparisons are not fair. We use "--max-num-seqs 1" to measure single request latency. If you want to measure throughput under the assumption that two requests will be available, you should use "--max-num-seqs 2" for GPU-only side. |
efb68e6 to
ebbcf7d
Compare
In vLLM, if we set max-num-seqs to be from 1 to 3, its speed is almost the same. Because vLLM will process up to 3 requests together in one iteration using SPMD-style in GPU. The current NPU vision model we extracted can only process 1 request. So it is not a fair comparison in that case. That's the reason max-seq-num is set to 1 during the test. But 3 requests were also sent in parallel for the GPU test, and it performed better than sending sequentially. It seems some startup, pre/post processing overhead gets avoided. |
098a021 to
10362f4
Compare
Enable true hardware parallelism between NPU (vision encoding) and GPU (LLM inference) for vision-language models, improving concurrent request throughput by ~1.4-1.5× even with max-num-seqs=1 (no LLM batching). Key components: 1. Vision Scheduler (vllm/v1/engine/core.py): - Proactively processes vision for ALL waiting requests - Runs BEFORE core scheduler in every step() - Submits NPU vision encoding to background thread pool - Enables Request 2's vision to start while Request 1's LLM runs 2. Background Vision Pre-encoding (vllm/v1/worker/gpu_model_runner.py): - Thread pool (max_workers=1) for NPU vision processing - GIL release during C++ FlexMLRT inference (enables parallelism) - Vision embeddings cached in _VISION_PREENCODING_CACHE - Hybrid scheduler checks cache before scheduling LLM 3. Hybrid Scheduler Check (vllm/v1/core/sched/scheduler.py): - Vision readiness check before scheduling WAITING requests - Defers requests whose vision isn't ready yet - Re-checks on next scheduler iteration Performance: - 3 concurrent requests: 192s vs 241s sequential (1.26× speedup) - Eliminates ~17s cold-start overhead per request - NPU+GPU pipelining: Request N+1 vision overlaps with Request N decode Environment variables: export VLLM_NPU_ASYNC_PIPELINE=1 export VLLM_VISION_NPU_BACKEND=flexmlrt export VLLM_VISION_NPU_DEVICE=stx export VLLM_VISION_NPU_CACHE=/path/to/model.xrt Test suite: tests/async_pipelining/ - compare_npu_vs_gpu.py: NPU vs GPU performance comparison - test_server_async_pipelining.py: End-to-end pipelining test - start_vllm_server.sh / test_pure_gpu.sh: Server startup scripts Modified files: - vllm/v1/engine/core.py (+131 lines) - vllm/v1/worker/gpu_model_runner.py (+711 lines) - vllm/v1/core/sched/scheduler.py (+41 lines) - vllm/v1/engine/output_processor.py (+30 lines) - vllm/v1/executor/uniproc_executor.py (+26 lines) New files: - vllm/vision_npu/: NPU backend infrastructure (FlexMLRT, CPU preprocess) - tests/async_pipelining/: Test suite and benchmarking tools Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com> Signed-off-by: lichang <liangliang.chang@amd.com>
10362f4 to
8e15457
Compare
Signed-off-by: lichang <liangliang.chang@amd.com>
| if current_platform.is_cuda_alike(): | ||
| # Check if the per-block quant operation is available (newer ROCm/CUDA versions) | ||
| if current_platform.is_cuda_alike() and hasattr( | ||
| torch.ops._C, "silu_and_mul_per_block_quant" |
There was a problem hiding this comment.
How is this related to the NPU work?
|
Can you check whether the NPU prefill can be integrated via the existing pipeline parallelism in vLLM (see |
Thanks. I agree. I only quickly checked the current pipeline parallelism during my implementation, and it seems to be for splitting fine-grained layers across GPUs. I will evaluate it in detail and see how to use the existing flow as much as possible. Will update later. |
Purpose
Enable NPU+GPU async pipelining for vision-language models to improve concurrent request throughput on heterogeneous hardware (NPU for vision encoding + GPU for LLM inference).
Problem: When using NPU for vision encoding and GPU for LLM, requests process sequentially even when submitted concurrently, because the V1 scheduler cannot pipeline different stages of different requests when
max-num-seqs=1:Solution: This PR adds a Vision Scheduler that proactively processes NPU vision encoding for waiting requests in background threads, enabling true hardware parallelism:
Key Components
Vision Scheduler (
vllm/v1/engine/core.py):step()max_num_running_reqssettingBackground Vision Pre-encoding (
vllm/v1/worker/gpu_model_runner.py):Hybrid Scheduler Check (
vllm/v1/core/sched/scheduler.py):Environment Variables
Enable with:
Gracefully degrades when disabled (zero overhead for GPU-only mode).
Test Plan
Prerequisites
Test 1: NPU Mode Performance
Expected behavior:
Test 2: GPU-only Mode (Baseline)
Expected behavior:
Test 3: Compare Results
Test 4: End-to-end Pipelining Validation
# With NPU server running python test_server_async_pipelining.pyVerify in logs:
Test Result
Hardware Setup
--max-num-seqs 1 --gpu-memory-utilization 0.83Performance Results (3 concurrent requests)
Timing breakdown
End-to-end time (3 concurrent requests)
The "Deferred" requests prove the Vision Scheduler is working - requests wait for vision to complete while not blocking the scheduler.
Output Quality Verification
Sample outputs from both modes (NPU vs GPU) are semantically identical, confirming NPU vision encoding produces correct results:
NPU+GPU Output:
Pure GPU Output:
Files Changed
Core pipelining (939 lines):
vllm/v1/engine/core.py(+112 lines)vllm/v1/worker/gpu_model_runner.py(+789 lines)vllm/v1/core/sched/scheduler.py(+41 lines)vllm/v1/engine/output_processor.py(+24 lines)vllm/v1/executor/uniproc_executor.py(+20 lines)NPU backend (1,193 lines):
vllm/vision_npu/*: FlexMLRT backend, CPU preprocessing, C++ bridgeTest suite (1,097 lines):
tests/async_pipelining/*: Performance comparison tools, test scripts, test imagesEssential Elements Checklist
tests/async_pipelining/README.mdAdditional Notes