I tried different models and different backends. The shimmy always crashes on first message from vscode.
I was expecting output from the model in vscode chat window.
I got error in vscode and a crash of shimmy. Which could be related to Q6 quantization.
error message on vscode side
Sorry, your request failed. Please try again.
Copilot Request id: 229d6540-1300-4d6e-867a-4be43093cf1f
Reason: Chat completion request failed: terminated: GatewayError: Chat completion request failed: terminated at q.streamChatCompletion (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:2:314) at process.processTicksAndRejections (node:internal/process/task_queues:103:5) at async N.provideLanguageModelChatResponse (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:14:1037)
a tail of a crash of shimmy process
llama_kv_cache: layer 47: dev = CPU
llama_kv_cache: CPU KV buffer size = 384.00 MiB
llama_kv_cache: size = 384.00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
llama_context: Metal compute buffer size = 398.62 MiB
llama_context: CPU compute buffer size = 24.01 MiB
llama_context: graph nodes = 1495
llama_context: graph splits = 530 (with bs=512), 1 (with bs=1)
/Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/shimmy-llama-cpp-sys-2-0.1.123/llama.cpp/src/llama-context.cpp:997: GGML_ASSERT(n_tokens_all <= cparams.n_batch) failed
(lldb) process attach --pid 4675
Process 4675 stopped
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
libsystem_kernel.dylib`__psynch_cvwait:
-> 0x18cb053cc <+8>: b.lo 0x18cb053ec ; <+40>
0x18cb053d0 <+12>: pacibsp
0x18cb053d4 <+16>: stp x29, x30, [sp, #-0x10]!
0x18cb053d8 <+20>: mov x29, sp
Target 0: (shimmy) stopped.
Executable binary set to "/Users/iilyak/.local/bin/shimmy".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x000000018cb4409c libsystem_pthread.dylib`_pthread_cond_wait + 984
frame #2: 0x0000000100f0c608 shimmy`std::sync::poison::condvar::Condvar::wait::hd046f0052c651dc8 + 76
frame #3: 0x0000000100f0c94c shimmy`tokio::runtime::park::Inner::park::h3f94c11df27ac8d1 + 92
frame #4: 0x0000000100bb3864 shimmy`shimmy::main::hba1ae758e82f1e8f + 3604
frame #5: 0x0000000100b81f94 shimmy`std::sys::backtrace::__rust_begin_short_backtrace::haf34922edfb8182d + 12
frame #6: 0x0000000100bf10dc shimmy`main + 884
frame #7: 0x000000018c7a2b98 dyld`start + 6076
(lldb) quit
fish: Job 1, 'shimmy serve --gpu-backend cpu …' terminated by signal SIGABRT (Abort)
🐛 Bug Description
I tried different models and different backends. The shimmy always crashes on first message from vscode.
🔄 Steps to Reproduce
{ "local.model.provider.serverUrl": "http://127.0.0.1:11435/v1" }✅ Expected Behavior
I was expecting output from the model in vscode chat window.
❌ Actual Behavior
I got error in vscode and a crash of shimmy. Which could be related to Q6 quantization.
📦 Shimmy Version
Latest (main branch)
💻 Operating System
macOS
📥 Installation Method
Pre-built binary from releases
🌍 Environment Details
My hardware
📋 Logs/Error Messages
📝 Additional Context
No response