Skip to content

failures running many GGUF embedding models #1647

@arvindvenkataramani

Description

@arvindvenkataramani

Which version of LM Studio?
Example: LM Studio 0.4.6

Which operating system?
MacOS Sequoia 15.6.4

What is the bug?
I get errors with GGUF versions of various embedding models when trying to use them with knowledge stacks in Msty Studio.

Error claims out of memory, but I have a 48GB M4 Max and LMS says only 10 GB is used (verified with Activity Monitor)

I am able to successfully use text-embedding-qwen3-embedding-0.6b, text-embedding-nomic-embed-text-v1.5@q8_0, and text-embedding-mxbai-embed-large-v1

Logs
Attempting to use the Qwen 3 4b and 8b embedding models:

2026-03-13 23:13:24 [DEBUG]
 [lmstudio-llama-cpp] LLM: Embedding failed: Failed during string embedding. Message: Unknown exception caused embedding to stop: Failed to decode batch! Error:
2026-03-13 23:13:24 [DEBUG]
 [WARNING] At least one last token in strings embedded is not SEP. 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header
2026-03-13 23:13:24 [DEBUG]
 ggml_metal_synchronize: error: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
ggml_metal_synchronize: error: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
2026-03-13 23:13:24 [DEBUG]
 ggml_metal_synchronize: error: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
      [repeated several times]
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
ggml_metal_graph_compute: backend is in error state from a previous command buffer failure - recreate the backend to recover
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
process_ubatch: failed to compute graph, compute status: -1
decode: removing memory module entries for seq_id = 0, pos = [0, +inf)
llama_decode: failed to decode, ret = -3
BatchDecode : failed to decode

llama-embed-nemotron-8b (GGUF) and the model is not auto-loading on request, and when manually loaded fails with

 No models loaded. Please load a model in the developer page or use the 'lms load' command.```

**To Reproduce**
Steps to reproduce the behavior:
1. install GGUF embedding models
2. Select embedding models in MstyStudio
3. Attempt to compose embeddings
4. Fails with one of these two categories of errors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions