使用 save_system_prompt_kv_cache 推理报异常

## 请求的时候在requests中加上 "save_system_prompt_kv_cache": true 推理报错


### Step 1
**执行以下指令**
```bash
 ./build/examples/llm/llm_inference \
  --engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 \
  --multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual \
  --inputFile ./multi_input.json \
  --outputFile ./output_vlm_qwen3-quant-carbin.json \
  --dumpProfile true

报如下错误:
[19:18:46.755] [INFO] [llmInferenceRuntime.cpp:218:LLMInferenceRuntime] Start loading tokenizer from model directory: /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4
[19:18:48.516] [INFO] [tokenizer.cpp:383:loadVocabulary] Loaded 151643 vocabulary tokens
[19:18:48.682] [INFO] [tokenizer.cpp:94:loadFromHF] Loaded 26 special tokens
[19:18:48.767] [INFO] [tokenizer.cpp:729:loadChatTemplate] Successfully loaded chat template from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/processed_chat_template.json (for model: ./quantized/Qwen3-VL-4B-Instruct-nvfp4)
[19:18:48.767] [INFO] [tokenizer.cpp:121:loadFromHF] Successfully loaded tokenizer from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 (vocab_size=151669)
[19:18:48.785] [INFO] [TensorRT] Loaded engine size: 799 MiB
[19:18:48.920] [INFO] [TensorRT] [MS] Running engine with multi stream info
[19:18:48.920] [INFO] [TensorRT] [MS] Number of aux streams is 3
[19:18:48.920] [INFO] [TensorRT] [MS] Number of total worker streams is 4
[19:18:48.920] [INFO] [TensorRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[19:18:49.120] [INFO] [TensorRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +101, now: CPU 0, GPU 3585 (MiB)
[19:18:49.140] [INFO] [llmInferenceRuntime.cpp:268:operator()] Vision runner successfully initialized
[19:18:49.140] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[19:18:49.777] [INFO] [llmInferenceRuntime.cpp:843:captureDecodingCUDAGraph] LLMInferenceRuntime(): Successfully captured the decoding CUDA graph for all execution batch sizes and LoRA weights.
[19:18:49.777] [INFO] [llm_inference.cpp:768:main] Processing 2 batched requests...
[19:18:49.777] [INFO] [llm_inference.cpp:778:main] Progress: 1/2 (50.000000%)
[19:18:49.782] [INFO] [TensorRT] Switching optimization profile from: 1 to 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA runtime error in cudaStreamSynchronize(stream): an illegal memory access was encountered
./start_qwen3_vl_4b_quant.sh: line 5: 33928 Aborted                 (core dumped) ./build/examples/llm/llm_inference --engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 --multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual --inputFile ./multi_input.json --outputFile ./output_vlm_qwen3-quant-carbin.json
```
其中multi_input.json内容如下:
```{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "system",
                    "content": "你是一个专业的汽车座舱场景分析助手，具备精准的视觉理解与结构化信息提取能力。xxx"
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
                        {"type": "text", "text": "所给图片是汽车座舱内的照片。描述图片内容"
                        }
                    ]
                }
            ],
            "save_system_prompt_kv_cache": true
        },
        {
            "messages": [
                {
                    "role": "system",
                    "content": "你是一个专业的汽车座舱场景分析助手，具备精准的视觉理解与结构化信息提取能力。xxx"
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
                        {"type": "text", "text": "图中有几个人，只需回答数字即可"
                        }
                    ]
                }
            ]
        }
    ]
}
```
## System information (Edge Device)

- Platform : Thor-x
- Software release : 7.0.3
- CPU architecture: aarch64
- GPU compute capability : 
- Total device memory: 60G
- Build type :  
- Library versions:
  - TensorRT Edge-LLM version or commit hash: ?
  - CUDA: 12.8
  - TensorRT:  TensorRT-Edge-LLM 0.6.0
  - C++ compiler: GCC 13..3.0
- CMake options used:
  - CMAKE_TOOLCHAIN_FILE:  cmake/aarch64_linux_toolchain.cmake
  - EMBEDDED_TARGET: auto-thor
  - TRT_PACKAGE_DIR:  /usr/src/tensorrt

## 其它
如果没有加 "save_system_prompt_kv_cache": true可以正常推理

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用 save_system_prompt_kv_cache 推理报异常 #74

请求的时候在requests中加上 "save_system_prompt_kv_cache": true 推理报错

Step 1

System information (Edge Device)

其它

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

使用 save_system_prompt_kv_cache 推理报异常 #74

Description

请求的时候在requests中加上 "save_system_prompt_kv_cache": true 推理报错

Step 1

System information (Edge Device)

其它

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions