ArcLight is a lightweight LLM inference framework written in C/C++ for unified-memory systems. It is designed for inference scenarios beyond high-performance GPU servers, with current v1.0 optimizations focused on many-core CPU platforms and cross-NUMA tensor parallelism.
ArcLight currently supports CPU backends for ARM and x86 platforms, with basic Windows build support. The recommended path today is to build from source and run GGUF models locally.
git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight
cmake -B build -DARCLIGHT_BACKEND=AUTO -DNNML_USE_NUMA=OFF
cmake --build build --config Release -j 32
./build/al-gen \
--model /path/to/MiniCPM5-1B-Q4_0.gguf \
--prompt "Hello!" \
--numa none --nodes 1 \
--threads 4ArcLight uses the GGUF model format from llama.cpp. The current nnml backend only ships kernels for f32 / f16 / q4_0 / q8_0 / q6_K / q8_K (see nnml/src/ops/types.cpp); other quants such as Q4_K_M will not load.
The official openbmb/MiniCPM5-1B-GGUF repo publishes F16, Q8_0, and Q4_K_M — note that Q4_0 is not included. For a q4_0 build you have to quantize it yourself from the released F16:
huggingface-cli download openbmb/MiniCPM5-1B-GGUF MiniCPM5-1B-F16.gguf --local-dir .
llama-quantize ./MiniCPM5-1B-F16.gguf ./MiniCPM5-1B-Q4_0.gguf Q4_0The current codebase includes model definitions for:
- MiniCPM5-1B
- Qwen3
- Llama2
For first-time testing, start with MiniCPM5-1B or another small GGUF model, preferably the locally produced Q4_0 (smallest) or the released Q8_0.
git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight
cmake -B build -DARCLIGHT_BACKEND=AUTO
cmake --build build --config Release -j 32ARCLIGHT_BACKEND can be set to:
| Value | Meaning |
|---|---|
AUTO |
Select the backend from the target CPU architecture. Recommended. |
X86 |
Enable the x86 backend explicitly. |
NEON |
Enable the ARM NEON backend explicitly. |
NONE |
Build without architecture-specific backend code. |
Make sure your machine has a C++17-capable toolchain, such as GCC/G++ on Linux or MSVC on Windows.
git clone https://github.com/OpenBMB/ArcLight.git
cd ArcLight
cmake -B build -G "Visual Studio 18 2026"
cmake --build build --config Release -j 32On Windows, executables are emitted under the build output directory configured by CMake, typically build\bin for Visual Studio builds.
ArcLight currently provides three command-line apps:
al-gen: one-shot generational-chat: interactive chatal-ppl: perplexity evaluation for one text
If you build from source, run them from the build directory, for example ./build/al-gen on Linux.
./build/al-gen \
--model /path/to/MiniCPM5-1B-Q4_0.gguf \
--prompt "Explain what unified memory means in one sentence." \
--numa none --nodes 1 \
--threads 4 \
--max_length 4096 \
--max_gen 256For a Chinese prompt:
./build/al-gen \
--model /path/to/MiniCPM5-1B-Q4_0.gguf \
--prompt "用一句话解释什么是统一内存。" \
--numa none --nodes 1 \
--threads 4 \
--max_gen 256./build/al-chat \
--model /path/to/MiniCPM5-1B-Q4_0.gguf \
--numa none --nodes 1 \
--threads 4 \
--max_length 4096 \
--max_gen 512Press Ctrl+C during generation to interrupt the current response. Press Ctrl+C while waiting for input to exit and print the performance profile.
./build/al-ppl \
--model /path/to/MiniCPM5-1B-Q4_0.gguf \
--prompt "Good morning, Miss Lee!" \
--numa none --nodes 1 \
--threads 4The app prints the evaluated text and a final perplexity: ... line.
ArcLight supports single-node inference and cross-node tensor parallelism.
| Mode | Required args | When to use |
|---|---|---|
--numa none |
--nodes 1 |
Single-node mode. Start here for correctness checks and small models. |
--numa tp |
--nodes N where N > 1 |
Cross-NUMA tensor parallelism. Use on many-core CPU machines for higher throughput. |
--numa pp |
Not ready | Reserved for future pipeline parallelism; currently not implemented. |
For tensor parallelism, --nodes should be a power of 2 in the current version. Choose --threads so that it can be evenly divided across NUMA nodes.
Example for a 4-node many-core machine:
./build/al-gen \
--model /path/to/MiniCPM5-1B-Q4_0.gguf\
--prompt "Hello!" \
--numa tp --nodes 4 \
--threads 32| Scenario | Suggested settings |
|---|---|
| First run / small model | --numa none --nodes 1 --threads <cores-on-one-node> |
| Many-core CPU throughput | --numa tp --nodes <power-of-2> --threads <total-threads> |
| Longer context | Increase --max_length and --kv_gb. |
| Larger model | Increase --w_gb, then tune --a_gb and --work_gb if allocation fails. |
Use --numa none --nodes 1. The current implementation requires --nodes to be exactly 1 when --numa none is selected.
Use --numa tp --nodes N with N > 1. In the current version, N should be a power of 2. Also make sure --threads is large enough and divisible by --nodes.
--numa pp is planned but not implemented yet. Use --numa none or --numa tp.
Check that the model is a GGUF checkpoint from a supported model family. The current codebase includes Qwen3, Llama, and MiniCPM5 definitions. Also verify that --w_gb is large enough for the selected model.
Increase --a_gb, --kv_gb, or --work_gb. Longer contexts require a larger KV cache, so --max_length 8192 generally needs a larger --kv_gb than --max_length 4096.