Skip to content

Releases: airockchip/rknn-llm

release-v1.2.0

08 Apr 10:05
Compare
Choose a tag to compare
  • Supports custom model conversion.
  • Supports chat_template configuration.
  • Enables multi-turn dialogue interactions.
  • Implements automatic prompt cache reuse for improved inference efficiency.
  • Expands maximum context length to 16K.
  • Supports embedding flash storage to reduce memory usage.
  • Introduces the GRQ Int4 quantization algorithm.
  • Supports GPTQ-Int8 model conversion.
  • Compatible with the RK3562 platform.
  • Added support for visual multimodal models such as InternVL2, Janus, and Qwen2.5-VL.
  • Supports CPU core configuration.
  • Added support for Gemma3
  • Added support for Python 3.9/3.11/3.12

release-v1.1.4

11 Dec 11:39
Compare
Choose a tag to compare
  • Add support for converting HuggingFace GPTQ-int4 models (requires groupsize to be 32, 64, or 128, and desc_act set to false).
  • Add support for TeleChat/TeleChat2/MiniCPM-S models.
  • Support exporting llm model in Qwen2VL
  • Resolve issues with LoRA inference.
  • Fix an import error related to IPython.

release-v1.1.2

05 Nov 07:41
Compare
Choose a tag to compare
  • Fix inference error in chatglm3 model
  • Fix inference issue with embedding input
  • Support exporting llm model in MiniCPMV

release-v1.1.1

18 Oct 10:17
Compare
Choose a tag to compare
  • Fixed the inference error in the minicpm3 mode
  • Fixed the runtime error in rkllm_server_demo.
  • Added the rkllm-toolkit installation package for Python 3.10.
  • Supported gguf model conversion when tie_word_embeddings is set to true.

release-v1.1.0

11 Oct 08:53
Compare
Choose a tag to compare
  • Added support for grouped quantization (w4a16 group sizes of 32/64/128, w8a8 group sizes of 128/256/512).
  • Added gdq algorithm to improve 4-bit quantization accuracy.
  • Added hybrid quantization algorithm, supporting a combination of grouped and non-grouped quantization based on specified ratios.
  • Added support for Llama3, Gemma2, and Minicpm3 models.
  • Added support for gguf model conversion (currently supports q4_0 and fp16 only).
  • Added support for LoRa models.
  • Added storage and loading of prompt cache
  • Added PC-side emulation accuracy testing and inference interface support for rkllm-toolkit.
  • Fixed catastrophic forgetting issue when the token count exceeds max_context.
  • Optimized prefill speed.
  • Optimized generate speed.
  • Optimized model initialization time
  • Added support for four input interfaces: prompt, embedding, token, and multimodal.

release-v1.0.1

09 May 09:37
Compare
Choose a tag to compare
  • Optimize model conversion memory occupation
  • Optimize inference memory occupation
  • Increase prefill speed
  • Reduce initialization time
  • Improve quantization accuracy
  • Add support for Gemma, ChatGLM3, MiniCPM, InternLM2, and Phi-3
  • Add Server invocation
  • Add inference interruption interface
  • Add logprob and token_id to the return value