Add SmolLM3: Full and Quantized Implementation #3180

DrJesseGlass · 2025-11-12T20:45:14Z

Summary

This PR adds comprehensive support for SmolLM3-3B with both full precision (safetensors) and quantized (GGUF) implementations, unified under a single example interface.

What's New

Model Implementation

Full precision model (models/smol/smollm3.rs): Native safetensors support with F32/F16/BF16
Quantized model (models/smol/quantized_smollm3.rs): GGUF support with Q4_K_M, Q8_0, and F16 quantization
Unified example (examples/smollm3/main.rs): Single CLI that supports both model types seamlessly

SmolLM3 Architecture Features

Hybrid RoPE/NoPE: 3:1 ratio with every 4th layer using No Positional Encoding
Grouped Query Attention: 32 attention heads with 8 KV heads (4 groups)
High RoPE theta: 5,000,000 (vs typical 10k-500k)
Long context support: Up to 128k tokens
Thinking mode: Support for explicit reasoning with <think> tags

Verification

Output correctness verified against reference implementations:

Full precision: Validated against HuggingFace Transformers Python implementation
Quantized: Validated against llama.cpp (HuggingFace Transformers doesn't yet support quantized SmolLM3)

Performance

Tested on CPU and GPU with identical prompts (9 tokens generated):

Model Type	Device	Speed (tokens/s)	Speedup
Q8_0	CPU	7.31	1.0x
Q8_0	GPU	45.84	6.3x
Full F16	CPU	2.54	1.0x
Full F16	GPU	32.22	12.7x

Technical Details

Quantized Weight Reconstruction

The quantized implementation includes special handling for Q/K weight deinterleaving to maintain compatibility with GGUF format's interleaved storage pattern. The reconstruct_qk_weights() function properly reorganizes the attention weights.

Future Work: Add optimized kernels for CPU thread utilization similar to llama.cpp's implementation.

KV-Cache Optimization Opportunity

The current implementation uses .contiguous() calls when appending to KV cache:

// Can remove this contiguous call if using ConcatKV-Cache
// See: https://github.com/huggingface/candle/pull/3143
let (k, v) = self.kv_cache.append(&k.contiguous()?, &v.contiguous()?)?;

The ConcatKV-Cache implementation (#3143) offers significant performance improvements:

GPU: Multiple orders of magnitude faster
CPU/WASM: Equivalent performance with cleaner code

Action Item: I will open a separate issue to discuss adopting ConcatKV-Cache as the default KV-cache implementation across all transformer models in Candle. This would enable DRY practices and better performance by default.

Code Organization

This PR introduces an improved organizational pattern that should be considered for future transformer implementations:

Unified Module Structure

models/smol/
├── mod.rs                   # Module documentation and exports
├── smollm3.rs               # Full precision implementation
├── quantized_smollm3.rs     # Quantized implementation
└── README.md                # Family documentation

Single Example for Multiple Model Types

The examples/smollm3/main.rs demonstrates a unified approach:

Single enum SmolLM3Model wrapping both implementations
Unified ModelConfig abstraction for consistent access
Shared generation logic regardless of model type
Simple --model-type flag switches between full and quantized

Benefits:

User Experience: One example to learn, consistent CLI across model types
Maintainability: Shared logic reduces duplication
Testing: Single test harness validates both implementations
Documentation: Easier to explain trade-offs between model types

This pattern could be adopted for other model families (e.g., Llama, Mistral) to provide a more cohesive user experience.

Example Usage

# Quantized model (fast, smaller memory)
cargo run --release --example smollm3 -- \
  --model-type quantized \
  --quantization q8_0 \
  --prompt "Explain Rust's ownership system"

# Full precision model (highest quality)
cargo run --release --example smollm3 -- \
  --model-type full \
  --dtype f16 \
  --prompt "Explain Rust's ownership system"

# Enable thinking mode for reasoning tasks
cargo run --release --example smollm3 -- \
  --thinking \
  --prompt "Solve this logic puzzle step by step"

Testing

Builds successfully on CPU and GPU configurations
Quantized model (Q8_0) outputs match llama.cpp reference
Full model outputs match HuggingFace Transformers
KV-cache correctly maintains state across generation
NoPE layers properly skip positional encoding per config
Thinking mode formats prompts correctly

Files Changed

New Files:

candle-transformers/src/models/smol/mod.rs
candle-transformers/src/models/smol/smollm3.rs
candle-transformers/src/models/smol/quantized_smollm3.rs
candle-transformers/src/models/smol/README.md
candle-examples/examples/smollm3/main.rs
candle-examples/examples/smollm3/README.md

Modified Files:

candle-transformers/src/models/mod.rs (added pub mod smol;)
candle-examples/Cargo.toml (added chrono = "0.4")

Related Issues

Issue to be created: Adopt ConcatKV-Cache (feat(candle-nn) ConcatKvCache for 2-5x GPU speedup on autoregressive generation #3143) as default for all transformers
Model Card Details: https://huggingface.co/HuggingFaceTB/SmolLM3-3B
SmolLM Blog Series: https://huggingface.co/blog/smollm and https://huggingface.co/blog/smollm3
Related to NoPE paper: https://arxiv.org/abs/2410.01926

Checklist

Code follows Candle style guidelines
Verified outputs against reference implementations
Documentation added (README, rustdoc comments)
Example demonstrates both quantized and full precision usage
Tested on CPU and GPU
No compiler warnings
Proper error handling throughout

DrJesseGlass added 6 commits November 12, 2025 13:42

quantized and full SmolLM3

12d217f

include chrono for prompt

3a9d80c

resolve pub consist and unused var

da8edd6

formatted

e54e782

last spacing in format

cf013e2

add credits

08625c5

This was referenced Nov 12, 2025

Support Review of ConcatKvCache (#3143) and Plan for Future Adoption #3181

Open

Reorganize Transformers Module by Model Family #3182

Open

CPU-Optimized Kernels for Interleaved GGUF Weights (Following llama.cpp) #3183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SmolLM3: Full and Quantized Implementation #3180

Add SmolLM3: Full and Quantized Implementation #3180

Uh oh!

DrJesseGlass commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add SmolLM3: Full and Quantized Implementation #3180

Are you sure you want to change the base?

Add SmolLM3: Full and Quantized Implementation #3180

Uh oh!

Conversation

DrJesseGlass commented Nov 12, 2025

Summary

What's New

Model Implementation

SmolLM3 Architecture Features

Verification

Performance

Technical Details

Quantized Weight Reconstruction

KV-Cache Optimization Opportunity

Code Organization

Unified Module Structure

Single Example for Multiple Model Types

Example Usage

Testing

Files Changed

Related Issues

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant