./build/examples/llm/llm_inference \
--engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 \
--multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual \
--inputFile ./multi_input.json \
--outputFile ./output_vlm_qwen3-quant-carbin.json \
--dumpProfile true
报如下错误:
[19:18:46.755] [INFO] [llmInferenceRuntime.cpp:218:LLMInferenceRuntime] Start loading tokenizer from model directory: /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4
[19:18:48.516] [INFO] [tokenizer.cpp:383:loadVocabulary] Loaded 151643 vocabulary tokens
[19:18:48.682] [INFO] [tokenizer.cpp:94:loadFromHF] Loaded 26 special tokens
[19:18:48.767] [INFO] [tokenizer.cpp:729:loadChatTemplate] Successfully loaded chat template from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/processed_chat_template.json (for model: ./quantized/Qwen3-VL-4B-Instruct-nvfp4)
[19:18:48.767] [INFO] [tokenizer.cpp:121:loadFromHF] Successfully loaded tokenizer from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 (vocab_size=151669)
[19:18:48.785] [INFO] [TensorRT] Loaded engine size: 799 MiB
[19:18:48.920] [INFO] [TensorRT] [MS] Running engine with multi stream info
[19:18:48.920] [INFO] [TensorRT] [MS] Number of aux streams is 3
[19:18:48.920] [INFO] [TensorRT] [MS] Number of total worker streams is 4
[19:18:48.920] [INFO] [TensorRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[19:18:49.120] [INFO] [TensorRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +101, now: CPU 0, GPU 3585 (MiB)
[19:18:49.140] [INFO] [llmInferenceRuntime.cpp:268:operator()] Vision runner successfully initialized
[19:18:49.140] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[19:18:49.777] [INFO] [llmInferenceRuntime.cpp:843:captureDecodingCUDAGraph] LLMInferenceRuntime(): Successfully captured the decoding CUDA graph for all execution batch sizes and LoRA weights.
[19:18:49.777] [INFO] [llm_inference.cpp:768:main] Processing 2 batched requests...
[19:18:49.777] [INFO] [llm_inference.cpp:778:main] Progress: 1/2 (50.000000%)
[19:18:49.782] [INFO] [TensorRT] Switching optimization profile from: 1 to 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDA runtime error in cudaStreamSynchronize(stream): an illegal memory access was encountered
./start_qwen3_vl_4b_quant.sh: line 5: 33928 Aborted (core dumped) ./build/examples/llm/llm_inference --engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 --multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual --inputFile ./multi_input.json --outputFile ./output_vlm_qwen3-quant-carbin.json
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 128,
"requests": [
{
"messages": [
{
"role": "system",
"content": "你是一个专业的汽车座舱场景分析助手,具备精准的视觉理解与结构化信息提取能力。xxx"
},
{
"role": "user",
"content": [
{"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
{"type": "text", "text": "所给图片是汽车座舱内的照片。描述图片内容"
}
]
}
],
"save_system_prompt_kv_cache": true
},
{
"messages": [
{
"role": "system",
"content": "你是一个专业的汽车座舱场景分析助手,具备精准的视觉理解与结构化信息提取能力。xxx"
},
{
"role": "user",
"content": [
{"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
{"type": "text", "text": "图中有几个人,只需回答数字即可"
}
]
}
]
}
]
}
请求的时候在requests中加上 "save_system_prompt_kv_cache": true 推理报错
Step 1
执行以下指令
其中multi_input.json内容如下:
System information (Edge Device)
其它
如果没有加 "save_system_prompt_kv_cache": true可以正常推理