Qwen3.Ink.Cpp is a study-oriented repository that reproduces the Qwen3-8B model using pure C++.
This project integrates quantization methods and optimized QGEMM(Quantized GEMM) kernels in GGML, along with the SIMD-aware weight packing strategy proposed in AWQ. By combining these techniques, it endeavors to achieve the best of both worlds.
In contrast to the implementation in llama.cpp for Qwen3-8B, this project delivers better performance during the prompt phase, while maintaining comparable (slightly lower) performance during the autoregressive generation phase.
[Warning] Please note that the benchmark results are highly dependent on my personal hardwares and may not be reproducible on other hardwares.
comparison.mp4
| Component | Specification |
|---|---|
| Operating System | Ubuntu 22.04 |
| CPU | Intel Core i5-13600KF |
| DRAM | 3600 MT/s, Dual-channel |
| model | quantization scheme | perpelexity | gsm8k |
|---|---|---|---|
| qwen3-8b | fp16 | 10.97 | 87.57 |
| qwen3-8b | w41 | 12.08 | 86.20 |
| qwen3-8b | w40 | 11.73 | 84.99 |
| qwen3-8b | w4z | 11.49 | 85.52 |
| qwen3-8b | A80W41 | 12.06 | 87.03 |
| qwen3-8b | A80W4z | 11.53 | 85.29 |
| qwen3-8b-awq | - | 11.52 | 86.35 |
| model | mem | backend | threads | test | t/s |
|---|---|---|---|---|---|
| qwen3 8B Q4_0 | 8.6 GiB | CPU | 16 | pp322 | 60.36 ± 0.23 |
| qwen3 8B Q4_0 | 8.6 GiB | CPU | 16 | tg128 | 10.40 ± 0.00 |
| qwen3 8B A81Q41-repack-FP16_FP32_mix ink | 14.0 GiB | CPU | 16 | pp322 | 134.35 |
| qwen3 8B A81Q41-repack-FP16_FP32_mix ink | 14.0 GiB | CPU | 16 | tg128 | 9.26 |
- We leveraged
llama-benchto benchmark llama.cpp models, and our own code along with custom input to benchmark our custom model. It is not a strict comparison. tg128stands for generation of 128 tokens in autoregreesive generation phase.pp322denotes processing an input prompt with 322 tokens.
Before setting up the environment, clone the repository and its submodules:
git clone https://github.com/jacksonsc007/Qwen3.Ink.Cpp.git
cd Qwen3.Ink.Cpp
git submodule update --init --recursiveWe recommend using uv to set up the Python environment quickly:
uv syncFollow the instructions in the notebook:
Run the build script:
bash build_qwen.shOnce built, you can start interacting with the model:
build/chatHere is an overview of the essential files:
| File Name | Description |
|---|---|
evaluate-qwen3_8b_W4.ipynb |
Evaluates the impact of weight-only quantization on Qwen3's performance. Perplexity on wikitext and the benchmark result on GSM8K are reported. |
evaluate-qwen3_8b_A8W4.ipynb |
Evaluates the impact of activation and weight quantization on Qwen3's performance. Perplexity on wikitext and the benchmark result on GSM8K are reported. |
save_A81W41_quantized_weight-qwen3_8b.ipynb |
Applies A81W41 quantization to the FP32 model and saves the quantized weights and metadata to disk. |
save_A80W40_quantized_weight-qwen3_8b.ipynb |
Applies A80W40 quantization to the FP32 model and saves the quantized weights and metadata to disk. |
quantize_methods.py |
Contains the core quantization methods used throughout the repository. |
This repository was heavily inspired by and built upon the following resources:
Much credit goes to Professor Han's for his open-source sprit and wonderfull lectures.
Much appreciation to Re:ゼロから始める異世界生活 for providing the benchmark text during the development.