-
Notifications
You must be signed in to change notification settings - Fork 170
Description
What happened?
Summary
When running ik_llama in server mode with embeddings enabled, the server can enter a tight infinite loop during embedding requests. The loop repeatedly posts NEXT_RESPONSE, clears the KV cache, and creates new internal tasks. Once triggered, the server never completes the request, CPU usage spikes, and logs grow without bound. The only recovery is to forcibly terminate and restart the server process.
This issue has been reproduced consistently with Qwen3-Embedding-4B, including the official GGUF release, and appears independent of custom quantization.
Environment
-
Commit:
a3737f42 -
Operating System: Windows
-
Execution mode:
llama-server.exe(server mode) -
Hardware:
- GPU 0: NVIDIA GeForce RTX 5090
- GPU 1: NVIDIA GeForce RTX 3080
-
Client: SillyTavern embedding plugin configured to use the llama.cpp API
-
Model: Qwen3-Embedding-4B
-
Reproduced with:
Qwen3-Embedding-4B-f16.gguf(official release)Qwen3-Embedding-4B-Bf16-IQ6_K.gguf(custom quant)
-
Behavior is identical across these GGUFs
-
Official model source used for reproduction:
https://huggingface.co/Qwen/Qwen3-Embedding-4B-GGUF
Server Startup Command
(Executable path redacted)
llama-server.exe
-m Qwen3-Embedding-4B-f16.gguf
--threads 15
--threads-batch 16
--tensor-split 32,10
--n-gpu-layers 10
--embedding
Server Startup / Model Load (excerpt)
The server starts normally and loads the model without errors. Startup output:
Launching ik-llama-server...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 10239 MiB
CUDA0: using device CUDA0 - 30841 MiB free
CUDA1: using device CUDA1 - 9071 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 398 tensors from A:\AI\AI Models\Embedding\Qwen3-Embedding-4B-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 Embedding 4B
llama_model_loader: - kv 3: general.basename str = Qwen3-Embedding
llama_model_loader: - kv 4: general.size_label str = 4B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.base_model.count u32 = 1
llama_model_loader: - kv 7: general.base_model.0.name str = Qwen3 4B Base
llama_model_loader: - kv 8: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 9: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B-...
llama_model_loader: - kv 10: general.tags arr[str,5] = ["transformers", "sentence-transforme...
llama_model_loader: - kv 11: qwen3.block_count u32 = 36
llama_model_loader: - kv 12: qwen3.context_length u32 = 40960
llama_model_loader: - kv 13: qwen3.embedding_length u32 = 2560
llama_model_loader: - kv 14: qwen3.feed_forward_length u32 = 9728
llama_model_loader: - kv 15: qwen3.attention.head_count u32 = 32
llama_model_loader: - kv 16: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 18: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 19: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 20: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 21: general.file_type u32 = 1
llama_model_loader: - kv 22: qwen3.pooling_type u32 = 3
llama_model_loader: - kv 23: general.quantization_version u32 = 2
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151665] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151665] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 31: tokenizer.ggml.eot_token_id u32 = 151645
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 33: tokenizer.ggml.add_eos_token bool = true
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - type f32: 145 tensors
llama_model_loader: - type f16: 253 tensors
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3
llm_load_print_meta: n_ctx_train = 40960
llm_load_print_meta: n_embd = 2560
llm_load_print_meta: n_layer = 36
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 9728
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 3
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 40960
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 4B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 4.022 B
llm_load_print_meta: model size = 7.492 GiB (16.001 BPW)
llm_load_print_meta: general.name = Qwen3 Embedding 4B
print_info: vocab type = BPE
print_info: n_vocab = 151665
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151643 '<|endoftext|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
llm_load_tensors: ggml ctx size = 2.06 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/37 layers to GPU
llm_load_tensors: CPU buffer size = 7671.30 MiB
llm_load_tensors: CUDA0 buffer size = 1540.16 MiB
llm_load_tensors: CUDA1 buffer size = 385.04 MiB
.....................................................................................
llama_new_context_with_model: n_ctx = 40960
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: grouped er = 0
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: fused_mmad = 1
llama_new_context_with_model: rope_cache = 0
llama_new_context_with_model: graph_reuse = 0
llama_new_context_with_model: k_cache_hadam = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 4160.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1280.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 320.00 MiB
llama_new_context_with_model: KV self size = 5760.00 MiB, K (f16): 2880.00 MiB, V (f16): 2880.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1046.78 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 115.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 46.01 MiB
llama_new_context_with_model: graph nodes = 978
llama_new_context_with_model: graph splits = 316
XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload
prompt cache is enabled, size limit: 8192 MiB
use `--cache-ram 0` to disable the prompt cache
No warnings or errors during initialization indicate misconfiguration.
Reproduction Steps (Observed)
This issue is reproducible using SillyTavern as a client:
-
Start
ik_llamain server mode with embeddings enabled using the command above. -
Configure SillyTavern:
- Enable embedding / vector storage.
- Set the provider to llama.cpp API.
- Configure vectorization to embed previous chat messages.
-
Use a chat at least ~ 10–20 messages.
- Initial vectorization often succeeds.
-
In SillyTavern’s vector storage settings:
- Click “Purge all”.
- Then click “Vectorize all” to trigger bulk embedding of messages.
After this step, the server consistently enters the infinite loop described below.
Note: The exact HTTP endpoint used internally by SillyTavern is not visible in the UI.
Observed Behavior
Once triggered, the server enters a tight loop and never recovers. Debug logs show a repeating pattern similar to:
VERB [update_slots] posting NEXT_RESPONSE
INFO [update_slots] kv cache rm [p0, end)
VERB [update_slots] prompt processing progress | n_past=0 n_tokens=0
VERB [update_slots] no tokens to decode
VERB [post] new task id | new_id=...
Characteristics of the failure:
- The same slot remains active indefinitely.
n_tokens=0andn_past=0on every iteration.- A new internal task ID is created on each loop iteration.
- KV cache is repeatedly cleared.
- CPU usage spikes (tight control-flow loop).
- Log output grows extremely fast (GB/minute scale if logging is enabled) with high verbosity and redirected into a file.
- The embedding request never completes.
Expected Behavior
Embedding requests should either:
- Complete successfully and return embeddings, or
- Fail gracefully with an error response.
Under no circumstances should an embedding request cause an infinite scheduler loop, CPU spin, or unbounded log growth.
Scope and Notes
- The issue reproduces with the official Qwen3-Embedding-4B GGUF, not only with custom quants.
- I have not yet tested other embedding models.
- The behavior appears entirely server-side once triggered.
- Even if the triggering input or client behavior is invalid, the server should handle it gracefully rather than entering a livelock.
Impact
This issue makes ik_llama effectively unusable for embeddings in this configuration. A single triggering request permanently wedges the server until restart, disrupting a core workflow and potentially exhausting disk space due to log growth.
Attachments
- Full server debug log showing the infinite
NEXT_RESPONSEloop and repeated KV cache clearing.
Name and Version
Version field
From the provided log.zip, the binary reports:
.\llama-server.exe --version
version: 4050 (a2d9b040)
built with MSVC 19.50.35719.0 for Windows
Note: The repository commit I am actually based on is a3737f42. The version string reports a2d9b040 because the working tree contained additional local commits modifying CMakePresets.json, so the reported commit hash differs from the upstream commit.
If the issue form wants a single “version” value, paste the full string above (preferred), or just the commit hash:
a2d9b040
What operating system are you seeing the problem on?
Windows
Relevant log output
INFO [ main] build info | tid="39704" timestamp=1766174872 build=4050 commit="a2d9b040"
INFO [ main] system info | tid="39704" timestamp=1766174872 n_threads=15 / n_threads_batch=16 ...
INFO [ init] initializing slots | tid="39704" timestamp=1766174874 n_slots=1
INFO [ init] new slot | tid="39704" timestamp=1766174874 id_slot=0 n_ctx_slot=40960
INFO [ main] model loaded | tid="39704" timestamp=1766174874
INFO [ main] HTTP server listening | tid="39704" timestamp=1766174874 hostname="127.0.0.1" port="8080" n_threads_http="15"
VERB [ update_slots] posting NEXT_RESPONSE | tid="39704" timestamp=1766174901
VERB [ post] new task id | tid="39704" timestamp=1766174901 new_id=8
VERB [ update_slots] tokenizing prompt | tid="39704" timestamp=1766174901 id_slot=0 id_task=2
INFO [ update_slots] empty prompt - releasing slot | tid="39704" timestamp=1766174901 id_slot=0 id_task=2
VERB [ update_slots] no tokens to decode | tid="39704" timestamp=1766174901
VERB [ get_available_slot] selected slot by lru | tid="39704" timestamp=1766174901 id_slot=0 t_last=28604106
VERB [ update_slots] posting NEXT_RESPONSE | tid="39704" timestamp=1766174901
VERB [ post] new task id | tid="39704" timestamp=1766174901 new_id=9
VERB [ update_slots] tokenizing prompt | tid="39704" timestamp=1766174901 id_slot=0 id_task=3
INFO [ update_slots] empty prompt - releasing slot | tid="39704" timestamp=1766174901 id_slot=0 id_task=3
VERB [ update_slots] no tokens to decode | tid="39704" timestamp=1766174901
VERB [ get_available_slot] selected slot by lru | tid="39704" timestamp=1766174901 id_slot=0 t_last=28605024
VERB [ update_slots] posting NEXT_RESPONSE | tid="39704" timestamp=1766174901
VERB [ post] new task id | tid="39704" timestamp=1766174901 new_id=10
VERB [ update_slots] tokenizing prompt | tid="39704" timestamp=1766174901 id_slot=0 id_task=4
INFO [ update_slots] empty prompt - releasing slot | tid="39704" timestamp=1766174901 id_slot=0 id_task=4
VERB [ update_slots] no tokens to decode | tid="39704" timestamp=1766174901
...
(repeats indefinitely; CPU pegged; log grows extremely fast)