Skip to content

Crashes in llama #189

@iilyak

Description

@iilyak

🐛 Bug Description

I tried different models and different backends. The shimmy always crashes on first message from vscode.

🔄 Steps to Reproduce

  1. shimmy serve --gpu-backend cpu --model-dirs /Volumes/VMs/models
  2. configure vscode and Local Model Provider extension
{
    "local.model.provider.serverUrl": "http://127.0.0.1:11435/v1"
}
  1. Select model I tried "typst-coder-9b.q8-0" and "qwen3-coder-30b-a3b-instruct-q6-k"

✅ Expected Behavior

I was expecting output from the model in vscode chat window.

❌ Actual Behavior

I got error in vscode and a crash of shimmy. Which could be related to Q6 quantization.

📦 Shimmy Version

Latest (main branch)

💻 Operating System

macOS

📥 Installation Method

Pre-built binary from releases

🌍 Environment Details

My hardware

  • MacOS: 15.7.3
  • CPU: Apple M1 Max
  • Unified memory: 64 GB

📋 Logs/Error Messages

error message on vscode side


Sorry, your request failed. Please try again.

Copilot Request id: 229d6540-1300-4d6e-867a-4be43093cf1f

Reason: Chat completion request failed: terminated: GatewayError: Chat completion request failed: terminated at q.streamChatCompletion (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:2:314) at process.processTicksAndRejections (node:internal/process/task_queues:103:5) at async N.provideLanguageModelChatResponse (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:14:1037)


a tail of a crash of shimmy process


llama_kv_cache: layer  47: dev = CPU
llama_kv_cache:        CPU KV buffer size =   384.00 MiB
llama_kv_cache: size =  384.00 MiB (  4096 cells,  48 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   398.62 MiB
llama_context:        CPU compute buffer size =    24.01 MiB
llama_context: graph nodes  = 1495
llama_context: graph splits = 530 (with bs=512), 1 (with bs=1)
/Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/shimmy-llama-cpp-sys-2-0.1.123/llama.cpp/src/llama-context.cpp:997: GGML_ASSERT(n_tokens_all <= cparams.n_batch) failed
(lldb) process attach --pid 4675
Process 4675 stopped
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
libsystem_kernel.dylib`__psynch_cvwait:
->  0x18cb053cc <+8>:  b.lo   0x18cb053ec    ; <+40>
    0x18cb053d0 <+12>: pacibsp
    0x18cb053d4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x18cb053d8 <+20>: mov    x29, sp
Target 0: (shimmy) stopped.
Executable binary set to "/Users/iilyak/.local/bin/shimmy".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018cb4409c libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x0000000100f0c608 shimmy`std::sync::poison::condvar::Condvar::wait::hd046f0052c651dc8 + 76
    frame #3: 0x0000000100f0c94c shimmy`tokio::runtime::park::Inner::park::h3f94c11df27ac8d1 + 92
    frame #4: 0x0000000100bb3864 shimmy`shimmy::main::hba1ae758e82f1e8f + 3604
    frame #5: 0x0000000100b81f94 shimmy`std::sys::backtrace::__rust_begin_short_backtrace::haf34922edfb8182d + 12
    frame #6: 0x0000000100bf10dc shimmy`main + 884
    frame #7: 0x000000018c7a2b98 dyld`start + 6076
(lldb) quit
fish: Job 1, 'shimmy serve --gpu-backend cpu …' terminated by signal SIGABRT (Abort)

📝 Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions