llama_cpp_python server > 0.2.79 breaks the vulkan image

When you build the vulkan image using llama_cpp_python 0.2.79 you see that it is actually able to detect and use gpu bc in the logs you can find 

```
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................
```

However starting from 0.2.80+ there is something broken and the gpu detection/usage is completely skipped. In the logs you just find
```
...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...
```

I also tested with the latest version 0.2.87 and it is still broken. Now we're using 0.2.85 -> https://github.com/containers/ai-lab-recipes/blob/main/model_servers/llamacpp_python/src/requirements.txt#L1



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama_cpp_python server > 0.2.79 breaks the vulkan image #742

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama_cpp_python server > 0.2.79 breaks the vulkan image #742

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions