Qwen3-VL is the most powerful vision-language model in the Qwen series to date created by Alibaba Cloud.
This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.
Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on‑demand deployment.
uv venv
source .venv/bin/activate
# Install vLLM >=0.11.0
uv pip install -U vllm
# Install Qwen-VL utility library (recommended for offline inference)
uv pip install qwen-vl-utils==0.0.14This is the Qwen3-VL flagship MoE model, which requires a minimum of 8 GPUs, each with at least 80 GB of memory (e.g., A100, H100, or H200). On some types of hardware the model may not launch successfully with its default setting. Recommended approaches by hardware type are:
- H100 with
fp8: Use FP8 checkpoint for optimal memory efficiency. - A100 & H100 with
bfloat16: Either reduce--max-model-lenor restrict inference to images only. - H200 & B200: Run the model out of the box, supporting full context length and concurrent image and video processing.
See sections below for detailed launch arguments for each configuration. We are actively working on optimizations and the recommended ways to launch the model will be updated accordingly.
H100 (Image + Video Inference, FP8)
```bash vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ --tensor-parallel-size 8 \ --mm-encoder-tp-mode data \ --enable-expert-parallel \ --async-scheduling ```H100 (Image Inference, FP8, TP4)
```bash vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \ --tensor-parallel-size 4 \ --limit-mm-per-prompt.video 0 \ --async-scheduling \ --gpu-memory-utilization 0.95 \ --max-num-seqs 128 ```A100 & H100 (Image Inference, BF16)
```bash vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ --tensor-parallel-size 8 \ --limit-mm-per-prompt.video 0 \ --async-scheduling ```A100 & H100 (Image + Video Inference, BF16)
```bash vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ --tensor-parallel-size 8 \ --max-model-len 128000 \ --async-scheduling ```H200 & B200
```bash vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \ --tensor-parallel-size 8 \ --mm-encoder-tp-mode data \ --async-scheduling ```ℹ️ Note
Qwen3-VL-235B-A22B-Instruct also excels on text-only tasks, ranking as the #1 open model on text by lmarena.ai at the time this guide was created.
You can enable text-only mode by passing--limit-mm-per-prompt.video 0 --limit-mm-per-prompt.image 0, which skips the vision encoder and multimodal profiling to free up memory for additional KV cache.
- It's highly recommended to specify
--limit-mm-per-prompt.video 0if your inference server will only process image inputs since enabling video inputs consumes more memory reserved for long video embeddings. Alternatively, you can skip memory profiling for multimodal inputs by--skip-mm-profilingand lower--gpu-memory-utilizationaccordingly at your own risk. - To avoid undesirable CPU contention, it's recommended to limit the number of threads allocated to preprocessing by setting the environment variable
OMP_NUM_THREADS=1. This is particulaly useful and shows significant throughput improvement when deploying multiple vLLM instances on the same host. - You can set
--max-model-lento preserve memory. By default the model's context length is 262K, but--max-model-len 128000is good for most scenarios. - Specifying
--async-schedulingimproves the overall system performance by overlapping scheduling overhead with the decoding process. Note: With vLLM >= 0.11.1, compatibility has been improved for structured output and sampling with penalties, but it may still be incompatible with speculative decoding (features merged but not yet released). Check the latest releases for continued improvements. - Specifying
--mm-encoder-tp-mode datadeploys the vision encoder in a data-parallel fashion for better performance. This is because the vision encoder is very small, thus tensor parallelism brings little gain but incurs significant communication overhead. Enabling this feature does consume additional memory and may require adjustment on--gpu-memory-utilization. - If your workload involves mostly unique multimodal inputs only, it is recommended to pass
--mm-processor-cache-gb 0to avoid caching overhead. Otherwise, specifying--mm-processor-cache-type shmenables this experimental feature which utilizes host shared memory to cache preprocessed input images and/or videos which shows better performance at a high TP setting. - vLLM supports Expert Parallelism (EP) via
--enable-expert-parallel, which allows experts in MoE models to be deployed on separate GPUs for better throughput. Check out Expert Parallelism Deployment for more details. - You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware.
- You can further extend the model's context window with
YaRNby passing--rope-scaling '{"rope_type":"yarn","factor":3.0,"original_max_position_embeddings": 262144,"mrope_section":[24,20,20],"mrope_interleaved": true}' --max-model-len 1000000
Once the server for the Qwen3-VL-235B-A22B-Instruct model is running, open another terminal and run the benchmark client:
vllm bench serve \
--backend openai-chat \
--endpoint /v1/chat/completions \
--model Qwen/Qwen3-VL-235B-A22B-Instruct \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--request-rate 20import time
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
timeout=3600
)
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Read all the text in the image."
}
]
}
]
start = time.time()
response = client.chat.completions.create(
model="Qwen/Qwen3-VL-235B-A22B-Instruct",
messages=messages,
max_tokens=2048
)
print(f"Response costs: {time.time() - start:.2f}s")
print(f"Generated text: {response.choices[0].message.content}")For more usage examples, check out the vLLM user guide for multimodal models and the official Qwen3-VL GitHub Repository!