Segmentation fault (core dumped) appearing randomly

**Description**:
I'm experiencing random assertion failures and segmentation faults when streaming responses from a fine-tuned Llama3.1 70B GGUF model. The error occurs in the GGML matrix multiplication validation.
Sometimes, it gives this GGML error, but most of the times, it just gives `Segmentation fault (core dumped)` and my pipeline crashes.

**Environment**:
- `llama_cpp_python` version: 0.3.4
- GPU: NVIDIA A40
- Model: Custom fine-tuned Llama3.1 70B GGUF (originally fine-tuned with Unsloth at 4k context, running at 16k `n_ctx`)
- OS: Ubuntu
- Python version: 3.11

**Error Log**:
```
llama-cpp-python/llama-cpp-python/vendor/llama.cpp/ggml/src/ggml.c:3513: GGML_ASSERT(a->ne[2] == b->ne[0]) failed
Segmentation fault (core dumped)
```

**Reproduction Steps**:
1. Load fine-tuned 70B GGUF model with:
   ```python
   llm = Llama(
       model_path="llama3.1_70B_finetuned.Q4_K_M.gguf",
       n_ctx=16384,
       n_gpu_layers=-1,
       logits_all = True
   )
   ```
2. Start streaming chat completion:
   ```python
   for chunk in llm.create_chat_completion(
       messages=[...],
       stream=True,
       max_tokens=1000
   ):
       print(chunk)
   ```
3. Error occurs randomly during streaming (usually after several successful chunks)

**Additional Context**:
- The model was fine-tuned using Unsloth with 4k context length
- Converted to GGUF using `llama.cpp`'s convert script
- Works fine for non-streaming inference
- Error appears more frequently with longer context (>8k tokens)
- Memory usage appears normal before crash (~80GB GPU mem for 70B Q4_K_M)

**Debugging Attempts**:
1. Tried different `n_ctx` values (4096, 8192, 16384)
2. Verified model integrity with `llama.cpp`'s main example
3. Added thread locking around model access (no effect)

**System Info**:
```bash
Cuda 12.2
python 3.11
```

**Request**:
Could you help investigate:
1. Potential causes for the GGML tensor dimension mismatch
2. Whether this relates to the context length difference between fine-tuning (4k) and inference (16k)
3. Any known issues with streaming large (70B) models


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segmentation fault (core dumped) appearing randomly #2005

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Segmentation fault (core dumped) appearing randomly #2005

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions