Skip to content

Commit 1cb5058

Browse files
authored
Merge branch 'main' into feat/concat-slice
2 parents d1a8806 + 2a5a950 commit 1cb5058

46 files changed

Lines changed: 531 additions & 210 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/qnn_backend/aot_execute.rst

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
QNN AOT Execution Flow
2+
================================================================
3+
4+
.. note::
5+
Please refer to the `Environment Setup <setup_env.html>`_ documentation to configure the QNN and Hexagon SDK environments before proceeding.
6+
7+
This document aims to explain the main execution flow of QNN AOT (Ahead-of-Time). This implementation is designed to fully leverage the offline compilation capabilities of the Qualcomm QNN framework to achieve efficient inference of fully integer-quantized Large Language Models (LLMs) on mobile devices, which is the de facto workflow for LLM execution on the Hexagon NPU.
8+
9+
Specifically, our implementation employs a W4A16 quantization scheme. The Key-Value (KV) Cache is quantized to ``uint8``, and the linear weights are quantized using Low-Power Blockwise Quantization (LPBQ).
10+
11+
The implementation of this module was inspired by the `PyTorch ExecuTorch`_ project, especially its `Hybrid Execution Mode`_ designed for the Qualcomm backend, for which we are grateful.
12+
13+
.. _PyTorch ExecuTorch: https://pytorch.org/executorch/
14+
.. _Hybrid Execution Mode: https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/README.md
15+
16+
Overall Flow
17+
----------------------------------------------------------------
18+
19+
The QNN AOT execution flow is mainly divided into three stages:
20+
21+
1. **Model Quantization and Export (Python)**: On the host machine, a Python script is used to quantize the pre-trained floating-point model and export it to ``.safetensor`` file. The ``.safetensor`` is then converted to ``.mllm`` file using mllm-convertor.
22+
2. **Offline Compilation (C++)**: On the host machine, a C++ compiler program loads the ``.mllm`` file, invokes the QNN toolchain for model compilation, graph optimization, and quantization parameter adjustment, and finally generates a QNN Context Binary.
23+
3. **On-Device Execution (C++)**: On the target device (e.g., a mobile phone), the AOT runner program loads the pre-compiled context binary and executes inference.
24+
25+
26+
Detailed Steps
27+
----------------------------------------------------------------
28+
29+
Taking ``qwen3_qnn_aot`` as an example, the detailed steps are as follows.
30+
31+
1. **Model Quantization and Export**
32+
33+
First, we need to run a Python script on the host to quantize the model and export it as a ``.safetensors`` file.
34+
35+
.. code-block:: shell
36+
37+
cd ./pymllm/backends/qualcomm/transformers/qwen3
38+
python train.py --model_path "/your/qwen3/model/path/" --max_length 1024 --num_samples 128 --output_dir "/path/to/output"
39+
40+
This step generates a key file:
41+
42+
* ``model.safetensors``: The quantized model file, saved in the specified output directory.
43+
44+
Next, convert the exported ``.safetensors`` model to the MLLM format (``.mllm``) using the ``mllm-convertor`` script.
45+
46+
.. code-block:: shell
47+
pip install pymllm
48+
49+
mllm-convertor --input_path /path/to/output/model.safetensors --output_path /path/to/output/qwen3_1.7b.mllm
50+
51+
This will generate the ``qwen3_1.7b.mllm`` file, which will be used in the subsequent compilation step.
52+
53+
2. **Offline Compilation to Generate QNN Context**
54+
55+
Next, we use a C++ compiler program (``compile.cpp``) on the host to generate the QNN context. This process invokes the QNN SDK to convert the MLLM IR into a QNN-supported format and performs optimizations.
56+
57+
Compile and run the ``compile`` program:
58+
59+
.. code-block:: shell
60+
61+
# In the mllm-v2 project root directory
62+
python task.py tasks/build_x86_qnn_aot.yaml
63+
64+
# Run the compiler program
65+
./build-qnn-aot/bin/mllm-qwen3-aot-sha-c \
66+
-m /path/to/output/qwen3_1.7b.mllm \
67+
-c ./examples/qwen3_qnn_aot/config_1.7B.json \
68+
--aot_config ./examples/qwen3_qnn_aot/qnn_aot_cfg_1.7B.json
69+
70+
71+
This program reads the ``.mllm`` model file and the quantization recipe, and finally generates a QNN context binary file named ``qwen3-1.7B-lpbq-sha.bin``. This file contains all the information needed to execute inference on the target device.
72+
73+
.. note::
74+
The ``HtpSignedPd`` config in qnn_aot_cfg_1.7B.json will specify ``QNN_HTP_DEVICE_CONFIG_OPTION_SIGNEDPD`` during QNN initialization, which may cause an "Unsupported config option 2" error in older QNN versions. It is recommended to change the config in the json file to ``HtpUnsignedPd``.
75+
76+
3. **On-Device AOT Inference**
77+
78+
Finally, we push the generated ``qwen3-1.7B-lpbq-sha.bin`` file and other resources like the tokenizer to the target device. The on-device AOT runner program (``aot_run.cpp``) will load this binary file and execute inference.
79+
80+
Compile and run the ``aot_run`` program:
81+
82+
.. code-block:: shell
83+
84+
# Cross-compile the aot_run program for the target device (e.g., Android)
85+
python task.py tasks/build_android_qnn.yaml
86+
87+
# Push compiled context file to the device
88+
adb push qwen3-1.7B-lpbq-sha.bin /data/local/tmp/
89+
90+
# Push QNN libraries and Op Packages
91+
ANDR_LIB=$QNN_SDK_ROOT/lib/aarch64-android
92+
OP_PATH=mllm/backends/qnn/custom-op-package/LLaMAPackage/build
93+
94+
adb push $ANDR_LIB/libQnnHtp.so /data/local/tmp
95+
adb push $ANDR_LIB/libQnnHtpV75Stub.so /data/local/tmp
96+
adb push $ANDR_LIB/libQnnHtpPrepare.so /data/local/tmp
97+
adb push $ANDR_LIB/libQnnHtpProfilingReader.so /data/local/tmp
98+
adb push $ANDR_LIB/libQnnHtpOptraceProfilingReader.so /data/local/tmp
99+
adb push $ANDR_LIB/libQnnHtpV75CalculatorStub.so /data/local/tmp
100+
adb push $QNN_SDK_ROOT/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so /data/local/tmp
101+
adb push $QNN_SDK_ROOT/lib/aarch64-android/libQnnSystem.so /data/local/tmp
102+
103+
adb push $OP_PATH/aarch64-android/libQnnLLaMAPackage.so /data/local/tmp/libQnnLLaMAPackage_CPU.so
104+
adb push $OP_PATH/hexagon-v75/libQnnLLaMAPackage.so /data/local/tmp/libQnnLLaMAPackage_HTP.so
105+
106+
# Push mllm runner and libs to device
107+
adb push build-android-arm64-v8a-qnn/bin/*.so /data/local/tmp
108+
adb push build-android-arm64-v8a-qnn/bin/mllm-qwen3-aot-runner /data/local/tmp
109+
110+
# Execute on the device
111+
adb shell "cd /data/local/tmp && export LD_LIBRARY_PATH=. &&
112+
./mllm-qwen3-aot-runner -m qwen3-1.7B-lpbq-sha.bin
113+
-t qwen3-tokenizer.json -c config_1.7B.json --ar_len 32"
114+
115+
The AOT runner program loads the ``.bin`` file to initialize the QNN context, then receives input tokens, performs model inference, and outputs the next token, thus realizing the language model generation process.
116+
117+
Hybrid Mode Explanation
118+
----------------------------------------------------------------
119+
120+
Our QNN AOT implementation adopts a Hybrid mode similar to `executorch` to optimize the efficiency of Prompt processing and Token generation.
121+
122+
* **Prefill Phase**: When processing the user's input (Prompt) for the first time, the model calculates and caches the Key-Value (KV) states for all input tokens at once. This phase is computationally intensive but is performed only once.
123+
* **Decode Phase**: When generating subsequent tokens, the model takes only the previously generated token as input and uses the cached KV state for computation. This process is computationally light and fast, suitable for token-by-token generation.
124+
125+
In this way, we combine the advantages of batch processing and stream processing to improve overall throughput while ensuring low latency.

docs/qnn_backend/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,4 +6,4 @@ QNN Backend
66

77
setup_env
88
core_design
9-
qnn_model_convert
9+
aot_execute

docs/qnn_backend/setup_env.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,10 @@ Compilation Commands
9898
9999
This will build the necessary QNN op packages for both AArch64 and HVX v75 targets.
100100

101+
.. note::
102+
The Hexagon tools version in the Makefile may change. If compilation fails, please update the version number in the Makefile accordingly.
103+
104+
101105
Development Tips
102106
----------------
103107

examples/llama_qnn_aot/compile.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ MLLM_MAIN({
1717
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model file path.");
1818
auto& model_cfg_path = Argparse::add<std::string>("-c|--config").help("Model config file path.");
1919
auto& qnn_aot_cfg_files = Argparse::add<std::string>("-aot_cfg|--aot_config").help("AOT Config file path.");
20+
auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
21+
.def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
22+
.help("QNN AOT Environment path.");
2023

2124
Argparse::parse(argc, argv);
2225

@@ -47,7 +50,7 @@ MLLM_MAIN({
4750
model.load(params);
4851

4952
// Create Qnn AOT Model
50-
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/",
53+
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv(qnn_env_path.get(),
5154
mllm::qnn::aot::parseQcomTargetMachineFromJSONFile(qnn_aot_cfg_files.get()));
5255

5356
// Model length 32.

examples/llama_qnn_aot/compile_sha.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ MLLM_MAIN({
2525
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model file path.");
2626
auto& model_cfg_path = Argparse::add<std::string>("-c|--config").help("Model config file path.");
2727
auto& qnn_aot_cfg_files = Argparse::add<std::string>("-aot_cfg|--aot_config").help("AOT Config file path.");
28+
auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
29+
.def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
30+
.help("QNN AOT Environment path.");
2831

2932
Argparse::parse(argc, argv);
3033

@@ -73,7 +76,7 @@ MLLM_MAIN({
7376
model.load(params);
7477

7578
// Create Qnn AOT Model
76-
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/",
79+
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv(qnn_env_path.get(),
7780
mllm::qnn::aot::parseQcomTargetMachineFromJSONFile(qnn_aot_cfg_files.get()));
7881

7982
// Model length 32.

examples/llama_qnn_aot/modeling_llama_qnn_aot.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ Tensor QDQ_KV(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch)
8383
case kUInt8PerTensorSym: {
8484
auto scale = m->getTopParameterFile()->pull(scale_name);
8585
auto zp = m->getTopParameterFile()->pull(zp_name);
86-
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 0);
86+
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 128);
8787

8888
// Is 128! not 127!
8989
auto new_zp = Tensor::constant(128, kInt32).setName(zp_name).setMemType(kParamsNormal);

examples/llama_qnn_aot/modeling_llama_qnn_aot_sha.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ Tensor QDQ_KV(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch)
8888
case kUInt8PerTensorSym: {
8989
auto scale = m->getTopParameterFile()->pull(scale_name);
9090
auto zp = m->getTopParameterFile()->pull(zp_name);
91-
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 0);
91+
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 128);
9292

9393
// Is 128! not 127!
9494
auto new_zp = Tensor::constant(128, kInt32).setName(zp_name).setMemType(kParamsNormal);

examples/qwen2_qnn_aot/compile.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ MLLM_MAIN({
1717
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model file path.");
1818
auto& model_cfg_path = Argparse::add<std::string>("-c|--config").help("Model config file path.");
1919
auto& qnn_aot_cfg_files = Argparse::add<std::string>("-aot_cfg|--aot_config").help("AOT Config file path.");
20+
auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
21+
.def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
22+
.help("QNN AOT Environment path.");
2023

2124
Argparse::parse(argc, argv);
2225

@@ -47,7 +50,7 @@ MLLM_MAIN({
4750
model.load(params);
4851

4952
// Create Qnn AOT Model
50-
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/",
53+
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv(qnn_env_path.get(),
5154
mllm::qnn::aot::parseQcomTargetMachineFromJSONFile(qnn_aot_cfg_files.get()));
5255

5356
// Model length 32.

examples/qwen2_qnn_aot/compile_sha.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ MLLM_MAIN({
2525
auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model file path.");
2626
auto& model_cfg_path = Argparse::add<std::string>("-c|--config").help("Model config file path.");
2727
auto& qnn_aot_cfg_files = Argparse::add<std::string>("-aot_cfg|--aot_config").help("AOT Config file path.");
28+
auto& qnn_env_path = Argparse::add<std::string>("-qnn_env|--qnn_env_path")
29+
.def("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/")
30+
.help("QNN AOT Environment path.");
2831

2932
Argparse::parse(argc, argv);
3033

@@ -73,7 +76,7 @@ MLLM_MAIN({
7376
model.load(params);
7477

7578
// Create Qnn AOT Model
76-
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv("/opt/qcom/aistack/qairt/2.41.0.251128/lib/x86_64-linux-clang/",
79+
auto qnn_aot_env = mllm::qnn::aot::QnnAOTEnv(qnn_env_path.get(),
7780
mllm::qnn::aot::parseQcomTargetMachineFromJSONFile(qnn_aot_cfg_files.get()));
7881

7982
// Model length 32.

examples/qwen2_qnn_aot/modeling_qwen2_qnn_aot.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ Tensor QDQ_KV(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch)
8383
case kUInt8PerTensorSym: {
8484
auto scale = m->getTopParameterFile()->pull(scale_name);
8585
auto zp = m->getTopParameterFile()->pull(zp_name);
86-
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 0);
86+
MLLM_RT_ASSERT_EQ(zp.item<mllm_int32_t>(), 128);
8787

8888
// Is 128! not 127!
8989
auto new_zp = Tensor::constant(128, kInt32).setName(zp_name).setMemType(kParamsNormal);

0 commit comments

Comments
 (0)