This project is a Topo template and follows the Topo Template Format Specification.
Complete LLM chat application optimized for Arm CPU inference.
Features: SVE, NEON
This project demonstrates running large language models on CPU using llama.cpp compiled with Arm baseline optimizations and accelerated using NEON SIMD and SVE (when supported and enabled).
The stack includes:
- llama.cpp server with Arm NEON optimizations (SVE optional)
- Quantized SmolLM2-135M-Instruct model bundled in the image
- Simple web-based chat interface
- No GPU required - pure CPU inference
- Arm Hardware: An Arm system (physical or virtual). Note that SVE support in llama.cpp requires an Armv8.2-A (or newer) CPU with the SVE extension.
- Docker: For container orchestration with Topo
- LLM Model: Optional when overriding the bundled default; provide a supported single-file GGUF model (e.g., Llama 3.1, Mistral, etc.)
Note:
MODELmust point to a supported single-file.ggufmodel artifact. Use a Hugging Face repo ID to auto-select a CPU-friendly quantization (preferring Q4_K_M), a Hugging Face repo plus exact filename as<repo>:<filename>, or a direct.ggufURL. Sharded GGUFs and multimodal projector files (mmproj) are rejected with a clear error because this template only supports single-file text model GGUFs today. Not all model repos include GGUF quantizations — look for repos with-GGUFin the name. The selected model is baked into the image at/models/model.gguf.
| Parameter | Description | Default |
|---|---|---|
MODEL |
Hugging Face GGUF repo, <repo>:<filename>, or direct .gguf URL |
unsloth/SmolLM2-135M-Instruct-GGUF |
ENABLE_SVE |
Enable SVE optimizations | OFF |
The easiest way to deploy is using topo. Download and install topo from here
topo clone git@github.com:Arm-Examples/topo-v9-cpu-chat.gitcd topo-v9-cpu-chat
topo deploy --target <ip-address-of-target>Use a different model:
topo deploy --target <ip-address-of-target> \
--arg MODEL=bartowski/Qwen_Qwen3.5-0.8B-GGUFForce an exact GGUF file:
topo deploy --target <ip-address-of-target> \
--arg MODEL=unsloth/SmolLM2-135M-Instruct-GGUF:SmolLM2-135M-Instruct-Q4_K_M.ggufOpen your browser to http://<ip-address-of-target>:3000 to start chatting!