Skip to content

Conversation

@DrJesseGlass
Copy link
Contributor

Summary

This PR adds comprehensive support for SmolLM3-3B with both full precision (safetensors) and quantized (GGUF) implementations, unified under a single example interface.

What's New

Model Implementation

  • Full precision model (models/smol/smollm3.rs): Native safetensors support with F32/F16/BF16
  • Quantized model (models/smol/quantized_smollm3.rs): GGUF support with Q4_K_M, Q8_0, and F16 quantization
  • Unified example (examples/smollm3/main.rs): Single CLI that supports both model types seamlessly

SmolLM3 Architecture Features

  • Hybrid RoPE/NoPE: 3:1 ratio with every 4th layer using No Positional Encoding
  • Grouped Query Attention: 32 attention heads with 8 KV heads (4 groups)
  • High RoPE theta: 5,000,000 (vs typical 10k-500k)
  • Long context support: Up to 128k tokens
  • Thinking mode: Support for explicit reasoning with <think> tags

Verification

Output correctness verified against reference implementations:

  • Full precision: Validated against HuggingFace Transformers Python implementation
  • Quantized: Validated against llama.cpp (HuggingFace Transformers doesn't yet support quantized SmolLM3)

Performance

Tested on CPU and GPU with identical prompts (9 tokens generated):

Model Type Device Speed (tokens/s) Speedup
Q8_0 CPU 7.31 1.0x
Q8_0 GPU 45.84 6.3x
Full F16 CPU 2.54 1.0x
Full F16 GPU 32.22 12.7x

Technical Details

Quantized Weight Reconstruction

The quantized implementation includes special handling for Q/K weight deinterleaving to maintain compatibility with GGUF format's interleaved storage pattern. The reconstruct_qk_weights() function properly reorganizes the attention weights.

Future Work: Add optimized kernels for CPU thread utilization similar to llama.cpp's implementation.

KV-Cache Optimization Opportunity

The current implementation uses .contiguous() calls when appending to KV cache:

// Can remove this contiguous call if using ConcatKV-Cache
// See: https://github.com/huggingface/candle/pull/3143
let (k, v) = self.kv_cache.append(&k.contiguous()?, &v.contiguous()?)?;

The ConcatKV-Cache implementation (#3143) offers significant performance improvements:

  • GPU: Multiple orders of magnitude faster
  • CPU/WASM: Equivalent performance with cleaner code

Action Item: I will open a separate issue to discuss adopting ConcatKV-Cache as the default KV-cache implementation across all transformer models in Candle. This would enable DRY practices and better performance by default.

Code Organization

This PR introduces an improved organizational pattern that should be considered for future transformer implementations:

Unified Module Structure

models/smol/
├── mod.rs                   # Module documentation and exports
├── smollm3.rs               # Full precision implementation
├── quantized_smollm3.rs     # Quantized implementation
└── README.md                # Family documentation

Single Example for Multiple Model Types

The examples/smollm3/main.rs demonstrates a unified approach:

  • Single enum SmolLM3Model wrapping both implementations
  • Unified ModelConfig abstraction for consistent access
  • Shared generation logic regardless of model type
  • Simple --model-type flag switches between full and quantized

Benefits:

  • User Experience: One example to learn, consistent CLI across model types
  • Maintainability: Shared logic reduces duplication
  • Testing: Single test harness validates both implementations
  • Documentation: Easier to explain trade-offs between model types

This pattern could be adopted for other model families (e.g., Llama, Mistral) to provide a more cohesive user experience.

Example Usage

# Quantized model (fast, smaller memory)
cargo run --release --example smollm3 -- \
  --model-type quantized \
  --quantization q8_0 \
  --prompt "Explain Rust's ownership system"

# Full precision model (highest quality)
cargo run --release --example smollm3 -- \
  --model-type full \
  --dtype f16 \
  --prompt "Explain Rust's ownership system"

# Enable thinking mode for reasoning tasks
cargo run --release --example smollm3 -- \
  --thinking \
  --prompt "Solve this logic puzzle step by step"

Testing

  • Builds successfully on CPU and GPU configurations
  • Quantized model (Q8_0) outputs match llama.cpp reference
  • Full model outputs match HuggingFace Transformers
  • KV-cache correctly maintains state across generation
  • NoPE layers properly skip positional encoding per config
  • Thinking mode formats prompts correctly

Files Changed

New Files:

  • candle-transformers/src/models/smol/mod.rs
  • candle-transformers/src/models/smol/smollm3.rs
  • candle-transformers/src/models/smol/quantized_smollm3.rs
  • candle-transformers/src/models/smol/README.md
  • candle-examples/examples/smollm3/main.rs
  • candle-examples/examples/smollm3/README.md

Modified Files:

  • candle-transformers/src/models/mod.rs (added pub mod smol;)
  • candle-examples/Cargo.toml (added chrono = "0.4")

Related Issues

Checklist

  • Code follows Candle style guidelines
  • Verified outputs against reference implementations
  • Documentation added (README, rustdoc comments)
  • Example demonstrates both quantized and full precision usage
  • Tested on CPU and GPU
  • No compiler warnings
  • Proper error handling throughout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant