A small, readable Go runtime for SmolLM3-3B local inference, with tokenizer, int8 weight-only quantization, KV cache, and ARM64/amd64 SIMD kernels.
Inspired by llama2.c and adapted from the smollm2.go implementation.
cmd/smollm3/ CLI entry point
internal/model/ SML3 loader, weights, KV cache, forward pass
internal/tokenizer/ TOK3 loader and byte-level BPE tokenizer
internal/sampler/ greedy, multinomial, and top-p sampling
tools/ Hugging Face model/tokenizer export scripts
docs/CHECKPOINT.md SML3/TOK3 binary format notes
The export scripts need torch, transformers, sentencepiece, safetensors,
accelerate, and numpy.
python3 -m venv .venv
.venv/bin/python -m pip install --upgrade pip torch transformers sentencepiece safetensors accelerate numpySmolLM3 requires recent Transformers support. Use transformers>=4.53.
mkdir -p models
.venv/bin/python tools/export.py models/smollm3-3b-f32.bin \
--hf HuggingFaceTB/SmolLM3-3B
.venv/bin/python tools/export_tokenizer.py models/smollm3-tokenizer.bin \
--hf HuggingFaceTB/SmolLM3-3BTo convert the FP32 checkpoint to weight-only int8:
.venv/bin/python tools/quantize.py \
models/smollm3-3b-f32.bin \
models/smollm3-3b-int8.binmkdir -p bin
go build -o bin/smollm3 ./cmd/smollm3Using:
go test ./internal/model -bench='Benchmark(Prefill|Decode)' -benchtime=1x -run '^$'Reference results on an Apple M2 Max:
| Benchmark | FP32 | Int8 |
|---|---|---|
| Prefill 128 tokens | 29.61 tok/s | 64.50 tok/s |
| Prefill 512 tokens | 26.70 tok/s | 53.38 tok/s |
| Decode at 128-token context | 6.517 tok/s | 20.74 tok/s |
| Decode at 512-token context | 6.441 tok/s | 17.89 tok/s |
Reference results on Windows/amd64 with an AMD Ryzen 9 9950X:
| Benchmark | FP32 | Int8 |
|---|---|---|
| Prefill 128 tokens | 62.07 tok/s | 99.45 tok/s |
| Prefill 512 tokens | 52.72 tok/s | 84.39 tok/s |
| Decode at 128-token context | 3.418 tok/s | 11.69 tok/s |
| Decode at 512-token context | 3.356 tok/s | 11.29 tok/s |
Generate plain continuation text:
bin/smollm3 \
-model models/smollm3-3b-int8.bin \
-tokenizer models/smollm3-tokenizer.bin \
-mode generate \
-n 128 \
-prompt "The galaxy empire" \
-temp 0Run a single chat turn:
bin/smollm3 \
-model models/smollm3-3b-int8.bin \
-tokenizer models/smollm3-tokenizer.bin \
-mode chat \
-prompt "Give me a brief explanation of gravity in simple terms." \
-temp 0Disable thinking:
bin/smollm3 \
-model models/smollm3-3b-int8.bin \
-tokenizer models/smollm3-tokenizer.bin \
-mode chat \
-think=false \
-system "Answer as concisely as possible. For arithmetic, give only the equation and result." \
-prompt "What is 2+2?" \
-temp 0Run the built-in tool-calling demo:
bin/smollm3 \
-model models/smollm3-3b-int8.bin \
-tokenizer models/smollm3-tokenizer.bin \
-mode toolcall \
-prompt "I have 40 dollars. Can I buy 3 notebooks, and how much money would be left?" \
-temp 0Other Go programs can import the root package and call the runtime directly:
package main
import (
"context"
"fmt"
"log"
"github.com/zhuyie/smollm3.go"
)
func main() {
client, err := smollm3.Load(smollm3.Config{
ModelPath: "models/smollm3-3b-int8.bin",
TokenizerPath: "models/smollm3-tokenizer.bin",
})
if err != nil {
log.Fatal(err)
}
text, stats, err := client.Generate(context.Background(), "The galaxy empire", smollm3.GenerateOptions{
MaxNewTokens: 128,
Temperature: 0,
})
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s\n%.2f tok/s\n", text, stats.TokensPerSecond())
}For chat-style prompts, use client.Chat with a message history:
reply, _, err := client.Chat(context.Background(), []smollm3.Message{
{Role: "user", Content: "Give me a brief explanation of gravity."},
}, smollm3.ChatOptions{
GenerateOptions: smollm3.GenerateOptions{MaxNewTokens: 128, Temperature: 0},
Thinking: true,
})GenerateOptions.TokenCallback receives decoded token pieces as they are produced, which is useful for streaming responses into an application UI. A loaded client owns mutable model state, so do not call generation methods concurrently on the same client.
