This repo tries to make RKNN LLM usage easier for people who don't want to read through Rockchip's docs.
Main repo is https://github.com/Pelochus/ezrknpu where you can find more instructions, documentation... for general use. This repo is intended for details in RKLLM and also how to convert models.
Keep in mind this repo is focused for:
- High-end Rockchip SoCs, mainly the RK3588
- Linux, not Android
- Linux kernels from Rockchip (as of writing 5.10 and 6.1 from Rockchip should work, if your board has one of these it will very likely be Rockchip's kernel)
First clone the repo:
git clone https://github.com/Pelochus/ezrknn-llm
Then run:
cd ezrknn-llm && bash install.sh
Run (assuming you are on the folder where your .rkllm
file is located):
rkllm name-of-the-model.rkllm # Or any other model you like
In order to do this, you need a Linux PC x86 (Intel or AMD). Currently, Rockchip does not provide ARM support for converting models, so can't be done on a Orange Pi or similar. You can use my Docker container which makes things easier, however, I do not provide a guide for this (it used to be easier); you are on your own here:
docker run -it pelochus/ezrkllm-toolkit:latest bash
I recommend checking already converted models by the community on HuggingFace. Search for something like "name-of-the-model RK3588" or similar. Thanks to everyone converting models!
Check this reddit post if you LLM seems to be responding garbage:
https://www.reddit.com/r/RockchipNPU/comments/1cpngku/rknnllm_v101_lets_talk_about_converting_and/
RKLLM software stack can help users to quickly deploy AI models to Rockchip chips. The overall framework is as follows:
In order to use RKNPU, users need to first run the RKLLM-Toolkit tool on the computer, convert the trained model into an RKLLM format model, and then inference on the development board using the RKLLM C API.
-
RKLLM-Toolkit is a software development kit for users to perform model conversionand quantization on PC.
-
RKLLM Runtime provides C/C++ programming interfaces for Rockchip NPU platform to help users deploy RKLLM models and accelerate the implementation of LLM applications.
-
RKNPU kernel driver is responsible for interacting with NPU hardware. It has been open source and can be found in the Rockchip kernel code.
- RK3588 Series
- RK3576 Series
- RK3562 Series
- LLAMA models
- TinyLLAMA models
- Qwen models
- Phi models
- ChatGLM3-6B
- Gemma2
- Gemma3
- InternLM2 models
- MiniCPM models
- TeleChat models
- Qwen2-VL-2B-Instruct
- Qwen2-VL-7B-Instruct
- MiniCPM-V-2_6
- DeepSeek-R1-Distill
- Janus-Pro-1B
- InternVL2-1B
- Qwen2.5-VL-3B-Instruct
- Qwen3
model | platform | dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | memory(G) |
---|---|---|---|---|---|---|---|---|
Qwen2-0.5B | RK3562 | w4a16_g128 | 64 | 320 | 256 | 524 | 5.67 | 0.39 |
RK3562 | w4a8_g32 | 64 | 320 | 256 | 873 | 12.00 | 0.48 | |
RK3562 | w8a8 | 64 | 320 | 256 | 477 | 11.50 | 0.61 | |
RK3576 | w4a16 | 64 | 320 | 256 | 204 | 34.50 | 0.4 | |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 212 | 32.40 | 0.4 | |
RK3588 | w8a8 | 64 | 320 | 256 | 79 | 41.50 | 0.62 | |
RK3588 | w8a8_g128 | 64 | 320 | 256 | 183 | 25.07 | 0.75 | |
TinyLLAMA-1.1B | RK3576 | w4a16 | 64 | 320 | 256 | 345 | 21.10 | 0.77 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 410 | 18.50 | 0.8 | |
RK3588 | w8a8 | 64 | 320 | 256 | 140 | 24.21 | 1.25 | |
RK3588 | w8a8_g512 | 64 | 320 | 256 | 195 | 20.08 | 1.29 | |
Qwen2-1.5B | RK3576 | w4a16 | 64 | 320 | 256 | 512 | 14.40 | 1.75 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 550 | 12.75 | 1.76 | |
RK3588 | w8a8 | 64 | 320 | 256 | 206 | 16.46 | 2.47 | |
RK3588 | w8a8_g128 | 64 | 320 | 256 | 725 | 7 | 2.65 | |
Phi-3-3.8B | RK3576 | w4a16 | 64 | 320 | 256 | 975 | 6.6 | 2.16 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 1180 | 5.85 | 2.23 | |
RK3588 | w8a8 | 64 | 320 | 256 | 516 | 7.44 | 3.88 | |
RK3588 | w8a8_g512 | 64 | 320 | 256 | 610 | 6.13 | 3.95 | |
ChatGLM3-6B | RK3576 | w4a16 | 64 | 320 | 256 | 1168 | 4.62 | 3.86 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 1583 | 3.82 | 3.96 | |
RK3588 | w8a8 | 64 | 320 | 256 | 800 | 4.95 | 6.69 | |
RK3588 | w8a8_g128 | 64 | 320 | 256 | 2190 | 2.7 | 7.18 | |
Gemma2-2B | RK3576 | w4a16 | 64 | 320 | 256 | 628 | 8 | 3.63 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 776 | 7.4 | 3.63 | |
RK3588 | w8a8 | 64 | 320 | 256 | 342 | 9.67 | 4.84 | |
RK3588 | w8a8_g128 | 64 | 320 | 256 | 1055 | 5.49 | 5.14 | |
InternLM2-1.8B | RK3576 | w4a16 | 64 | 320 | 256 | 475 | 13.3 | 1.59 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 572 | 11.95 | 1.62 | |
RK3588 | w8a8 | 64 | 320 | 256 | 206 | 15.66 | 2.38 | |
RK3588 | w8a8_g512 | 64 | 320 | 256 | 298 | 12.66 | 2.45 | |
MiniCPM3-4B | RK3576 | w4a16 | 64 | 320 | 256 | 1397 | 4.8 | 2.7 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 1645 | 4.39 | 2.8 | |
RK3588 | w8a8 | 64 | 320 | 256 | 702 | 6.15 | 4.65 | |
RK3588 | w8a8_g128 | 64 | 320 | 256 | 1691 | 3.42 | 5.06 | |
llama3-8B | RK3576 | w4a16 | 64 | 320 | 256 | 1608 | 3.6 | 5.63 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 2010 | 3 | 5.76 | |
RK3588 | w8a8 | 64 | 320 | 256 | 1128 | 3.79 | 9.21 | |
RK3588 | w8a8_g512 | 64 | 320 | 256 | 1281 | 3.05 | 9.45 | |
Qwen3-0.6B | RK3576 | w4a16 | 64 | 320 | 256 | 288.70 | 21.44 | 0.57 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 313.44 | 18.36 | 0.61 | |
RK3588 | w8a8 | 64 | 320 | 256 | 100.00 | 32.60 | 0.75 | |
Qwen3-1.7B | RK3576 | w4a16 | 64 | 320 | 256 | 517.00 | 11.19 | 1.17 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 594.00 | 10.25 | 1.25 | |
RK3588 | w8a8 | 64 | 320 | 256 | 215.00 | 14.37 | 1.82 | |
Qwen3-4B | RK3576 | w4a16 | 64 | 320 | 256 | 1056.00 | 5.52 | 2.31 |
RK3576 | w4a16_g128 | 64 | 320 | 256 | 1308.00 | 5.05 | 2.47 | |
RK3588 | w8a8 | 64 | 320 | 256 | 586.00 | 6.53 | 4.00 |
multimodal model | image input size | vision model dtype | vision infer time(s) | vision memory(MB) | llm model dtype | seqlen | max_context | new_tokens | TTFT(ms) | Tokens/s | llm memory(G) | platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen2-VL-2B | (1, 3, 392, 392) | fp16 | 3.55 | 1436.52 | w4a16 | 256 | 384 | 128 | 2094.17 | 13.23 | 1.75 | RK3576 |
fp16 | 3.28 | 1436.52 | w8a8 | 256 | 384 | 128 | 856.86 | 16.19 | 2.47 | RK3588 | ||
MiniCPM-V-2_6 | (1, 3, 448, 448) | fp16 | 2.40 | 1031.30 | w4a16 | 128 | 256 | 128 | 2997.70 | 3.84 | 5.50 | RK3576 |
fp16 | 3.27 | 976.98 | w8a8 | 128 | 256 | 128 | 1720.60 | 4.13 | 8.88 | RK3588 |
- This performance data were collected based on the maximum CPU and NPU frequencies of each platform.
- The script for setting the frequencies is located in the scripts directory.
- The vision model were tested based on all NPU core with rknn-toolkit2 version 2.2.0.
- Run the frequency-setting script from the
scripts
directory on the target platform. - Execute
export RKLLM_LOG_LEVEL=1
on the device to log model inference performance and memory usage. - Use the
eval_perf_watch_cpu.sh
script to measure CPU utilization. - Use the
eval_perf_watch_npu.sh
script to measure NPU utilization.
- You can download the latest package from RKLLM_SDK, fetch code: rkllm
- You can download the converted rkllm model from rkllm_model_zoo, fetch code: rkllm
- Multimodel deployment demo: Qwen2-VL_Demo
- API usage demo: DeepSeek-R1-Distill-Qwen-1.5B_Demo
- API server demo: rkllm_server_demo
- Multimodal_Interactive_Dialogue_Demo Multimodal_Interactive_Dialogue_Demo
-
The supported Python versions are:
- Python 3.8
- Python 3.9
- Python 3.10
- Python 3.11
- Python 3.12
Note: Before installing package in a Python 3.12 environment, please run the command:
export BUILD_CUDA_EXT=0
- On some platforms, you may encounter an error indicating that libomp.so cannot be found. To resolve this, locate the library in the corresponding cross-compilation toolchain and place it in the board's lib directory, at the same level as librkllmrt.so.
- Latest version: v1.2.0
If you want to deploy additional AI model, we have introduced a SDK called RKNN-Toolkit2. For details, please refer to:
https://github.com/airockchip/rknn-toolkit2
- Supports custom model conversion.
- Supports chat_template configuration.
- Enables multi-turn dialogue interactions.
- Implements automatic prompt cache reuse for improved inference efficiency.
- Expands maximum context length to 16K.
- Supports embedding flash storage to reduce memory usage.
- Introduces the GRQ Int4 quantization algorithm.
- Supports GPTQ-Int8 model conversion.
- Compatible with the RK3562 platform.
- Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
- Supports CPU core configuration.
- Added support for Gemma3
- Added support for Python 3.9/3.11/3.12
for older version, please refer CHANGELOG