-
Notifications
You must be signed in to change notification settings - Fork 170
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
I wanted to benchmark the newly released NVIDIA Nemotron 3 Nano model but it seems not be supported by ik_llama.cpp
I'm using the following version of ik_llama.cpp
./build/bin/llama-server --version
version: 4072 (21fc9322)
built with gcc (GCC) 15.2.0 for x86_64-pc-linux-gnuand this quantisation from Unsloth
Unsloth guide is located at
https://docs.unsloth.ai/models/nemotron-3
When trying the run llama-bench it returns with an error saying
./build/bin/llama-bench -v -m ~/Downloads/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf
... verbose output shortened ...
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'nemotron_h_moe'
llama_load_model_from_file: failed to load model
To be honest, I have no idea how much effort is required to support this model, and if the effort is too high, it might not make sense to invest work here. On the other hand, if it turns out to be easy, it would be interesting to be able to compare the performance with llama.cpp CPU-only.
Motivation
NVIDIA Nemotron 3 Nano has a certain amount of potential, at least on paper, and is claimed to be noticeably faster on the GPU than comparable MoE models such as Qwen3-30B-A3B. At least, that is what NVIDIA claims. This would make it an interesting option as a fast MoE model for CPU-only or mixed CPU/GPU inference on weak hardware.
Possible Implementation
No response