Add SmolLM3: Full and Quantized Implementation #3180
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds comprehensive support for SmolLM3-3B with both full precision (safetensors) and quantized (GGUF) implementations, unified under a single example interface.
What's New
Model Implementation
models/smol/smollm3.rs): Native safetensors support with F32/F16/BF16models/smol/quantized_smollm3.rs): GGUF support with Q4_K_M, Q8_0, and F16 quantizationexamples/smollm3/main.rs): Single CLI that supports both model types seamlesslySmolLM3 Architecture Features
<think>tagsVerification
Output correctness verified against reference implementations:
Performance
Tested on CPU and GPU with identical prompts (9 tokens generated):
Technical Details
Quantized Weight Reconstruction
The quantized implementation includes special handling for Q/K weight deinterleaving to maintain compatibility with GGUF format's interleaved storage pattern. The
reconstruct_qk_weights()function properly reorganizes the attention weights.Future Work: Add optimized kernels for CPU thread utilization similar to llama.cpp's implementation.
KV-Cache Optimization Opportunity
The current implementation uses
.contiguous()calls when appending to KV cache:The ConcatKV-Cache implementation (#3143) offers significant performance improvements:
Action Item: I will open a separate issue to discuss adopting ConcatKV-Cache as the default KV-cache implementation across all transformer models in Candle. This would enable DRY practices and better performance by default.
Code Organization
This PR introduces an improved organizational pattern that should be considered for future transformer implementations:
Unified Module Structure
Single Example for Multiple Model Types
The
examples/smollm3/main.rsdemonstrates a unified approach:SmolLM3Modelwrapping both implementationsModelConfigabstraction for consistent access--model-typeflag switches between full and quantizedBenefits:
This pattern could be adopted for other model families (e.g., Llama, Mistral) to provide a more cohesive user experience.
Example Usage
Testing
Files Changed
New Files:
candle-transformers/src/models/smol/mod.rscandle-transformers/src/models/smol/smollm3.rscandle-transformers/src/models/smol/quantized_smollm3.rscandle-transformers/src/models/smol/README.mdcandle-examples/examples/smollm3/main.rscandle-examples/examples/smollm3/README.mdModified Files:
candle-transformers/src/models/mod.rs(addedpub mod smol;)candle-examples/Cargo.toml(addedchrono = "0.4")Related Issues
Checklist