Skip to content

使用 save_system_prompt_kv_cache 推理报异常 #74

@815034762

Description

@815034762

请求的时候在requests中加上 "save_system_prompt_kv_cache": true 推理报错

Step 1

执行以下指令

 ./build/examples/llm/llm_inference \
  --engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 \
  --multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual \
  --inputFile ./multi_input.json \
  --outputFile ./output_vlm_qwen3-quant-carbin.json \
  --dumpProfile true

报如下错误:
[19:18:46.755] [INFO] [llmInferenceRuntime.cpp:218:LLMInferenceRuntime] Start loading tokenizer from model directory: /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4
[19:18:48.516] [INFO] [tokenizer.cpp:383:loadVocabulary] Loaded 151643 vocabulary tokens
[19:18:48.682] [INFO] [tokenizer.cpp:94:loadFromHF] Loaded 26 special tokens
[19:18:48.767] [INFO] [tokenizer.cpp:729:loadChatTemplate] Successfully loaded chat template from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/processed_chat_template.json (for model: ./quantized/Qwen3-VL-4B-Instruct-nvfp4)
[19:18:48.767] [INFO] [tokenizer.cpp:121:loadFromHF] Successfully loaded tokenizer from /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 (vocab_size=151669)
[19:18:48.785] [INFO] [TensorRT] Loaded engine size: 799 MiB
[19:18:48.920] [INFO] [TensorRT] [MS] Running engine with multi stream info
[19:18:48.920] [INFO] [TensorRT] [MS] Number of aux streams is 3
[19:18:48.920] [INFO] [TensorRT] [MS] Number of total worker streams is 4
[19:18:48.920] [INFO] [TensorRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[19:18:49.120] [INFO] [TensorRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +101, now: CPU 0, GPU 3585 (MiB)
[19:18:49.140] [INFO] [llmInferenceRuntime.cpp:268:operator()] Vision runner successfully initialized
[19:18:49.140] [INFO] [TensorRT] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[19:18:49.777] [INFO] [llmInferenceRuntime.cpp:843:captureDecodingCUDAGraph] LLMInferenceRuntime(): Successfully captured the decoding CUDA graph for all execution batch sizes and LoRA weights.
[19:18:49.777] [INFO] [llm_inference.cpp:768:main] Processing 2 batched requests...
[19:18:49.777] [INFO] [llm_inference.cpp:778:main] Progress: 1/2 (50.000000%)
[19:18:49.782] [INFO] [TensorRT] Switching optimization profile from: 1 to 0. Please ensure there are no enqueued operations pending in this context prior to switching profiles
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA runtime error in cudaStreamSynchronize(stream): an illegal memory access was encountered
./start_qwen3_vl_4b_quant.sh: line 5: 33928 Aborted                 (core dumped) ./build/examples/llm/llm_inference --engineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4 --multimodalEngineDir /ai_data/engines/Qwen3-VL-4B-Instruct-nvfp4/visual --inputFile ./multi_input.json --outputFile ./output_vlm_qwen3-quant-carbin.json

其中multi_input.json内容如下:

    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "system",
                    "content": "你是一个专业的汽车座舱场景分析助手,具备精准的视觉理解与结构化信息提取能力。xxx"
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
                        {"type": "text", "text": "所给图片是汽车座舱内的照片。描述图片内容"
                        }
                    ]
                }
            ],
            "save_system_prompt_kv_cache": true
        },
        {
            "messages": [
                {
                    "role": "system",
                    "content": "你是一个专业的汽车座舱场景分析助手,具备精准的视觉理解与结构化信息提取能力。xxx"
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": "/ai_data/TensorRT-Edge-LLM-0.6.0/input/images/carbin.png"},
                        {"type": "text", "text": "图中有几个人,只需回答数字即可"
                        }
                    ]
                }
            ]
        }
    ]
}

System information (Edge Device)

  • Platform : Thor-x
  • Software release : 7.0.3
  • CPU architecture: aarch64
  • GPU compute capability :
  • Total device memory: 60G
  • Build type :
  • Library versions:
    • TensorRT Edge-LLM version or commit hash: ?
    • CUDA: 12.8
    • TensorRT: TensorRT-Edge-LLM 0.6.0
    • C++ compiler: GCC 13..3.0
  • CMake options used:
    • CMAKE_TOOLCHAIN_FILE: cmake/aarch64_linux_toolchain.cmake
    • EMBEDDED_TARGET: auto-thor
    • TRT_PACKAGE_DIR: /usr/src/tensorrt

其它

如果没有加 "save_system_prompt_kv_cache": true可以正常推理

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions