Crashes in llama

### 🐛 Bug Description

I tried different models and different backends. The shimmy always crashes on first message from vscode.


### 🔄 Steps to Reproduce

1. shimmy serve --gpu-backend cpu --model-dirs /Volumes/VMs/models
2. configure vscode and Local Model Provider extension
```json
{
    "local.model.provider.serverUrl": "http://127.0.0.1:11435/v1"
}
```
3. Select model I tried "typst-coder-9b.q8-0" and "qwen3-coder-30b-a3b-instruct-q6-k" 

### ✅ Expected Behavior

I was expecting output from the model in vscode chat window. 

### ❌ Actual Behavior

I got error in vscode and a crash of shimmy. Which could be related to Q6 quantization.

### 📦 Shimmy Version

Latest (main branch)

### 💻 Operating System

macOS

### 📥 Installation Method

Pre-built binary from releases

### 🌍 Environment Details

My hardware

- MacOS: 15.7.3
- CPU: Apple M1 Max
- Unified memory: 64 GB

### 📋 Logs/Error Messages

```text
error message on vscode side


Sorry, your request failed. Please try again.

Copilot Request id: 229d6540-1300-4d6e-867a-4be43093cf1f

Reason: Chat completion request failed: terminated: GatewayError: Chat completion request failed: terminated at q.streamChatCompletion (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:2:314) at process.processTicksAndRejections (node:internal/process/task_queues:103:5) at async N.provideLanguageModelChatResponse (/Users/iilyak2/.vscode/extensions/krevas.local-model-provider-1.1.1/out/extension.js:14:1037)


a tail of a crash of shimmy process


llama_kv_cache: layer  47: dev = CPU
llama_kv_cache:        CPU KV buffer size =   384.00 MiB
llama_kv_cache: size =  384.00 MiB (  4096 cells,  48 layers,  1/1 seqs), K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 3480
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   398.62 MiB
llama_context:        CPU compute buffer size =    24.01 MiB
llama_context: graph nodes  = 1495
llama_context: graph splits = 530 (with bs=512), 1 (with bs=1)
/Users/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/shimmy-llama-cpp-sys-2-0.1.123/llama.cpp/src/llama-context.cpp:997: GGML_ASSERT(n_tokens_all <= cparams.n_batch) failed
(lldb) process attach --pid 4675
Process 4675 stopped
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
libsystem_kernel.dylib`__psynch_cvwait:
->  0x18cb053cc <+8>:  b.lo   0x18cb053ec    ; <+40>
    0x18cb053d0 <+12>: pacibsp
    0x18cb053d4 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x18cb053d8 <+20>: mov    x29, sp
Target 0: (shimmy) stopped.
Executable binary set to "/Users/iilyak/.local/bin/shimmy".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, name = 'main', queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018cb053cc libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018cb4409c libsystem_pthread.dylib`_pthread_cond_wait + 984
    frame #2: 0x0000000100f0c608 shimmy`std::sync::poison::condvar::Condvar::wait::hd046f0052c651dc8 + 76
    frame #3: 0x0000000100f0c94c shimmy`tokio::runtime::park::Inner::park::h3f94c11df27ac8d1 + 92
    frame #4: 0x0000000100bb3864 shimmy`shimmy::main::hba1ae758e82f1e8f + 3604
    frame #5: 0x0000000100b81f94 shimmy`std::sys::backtrace::__rust_begin_short_backtrace::haf34922edfb8182d + 12
    frame #6: 0x0000000100bf10dc shimmy`main + 884
    frame #7: 0x000000018c7a2b98 dyld`start + 6076
(lldb) quit
fish: Job 1, 'shimmy serve --gpu-backend cpu …' terminated by signal SIGABRT (Abort)
```

### 📝 Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crashes in llama #189

🐛 Bug Description

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

📦 Shimmy Version

💻 Operating System

📥 Installation Method

🌍 Environment Details

📋 Logs/Error Messages

📝 Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Crashes in llama #189

Description

🐛 Bug Description

🔄 Steps to Reproduce

✅ Expected Behavior

❌ Actual Behavior

📦 Shimmy Version

💻 Operating System

📥 Installation Method

🌍 Environment Details

📋 Logs/Error Messages

📝 Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions