Skip to content

llama_cpp_python server > 0.2.79 breaks the vulkan image #742

@lstocchi

Description

@lstocchi

When you build the vulkan image using llama_cpp_python 0.2.79 you see that it is actually able to detect and use gpu bc in the logs you can find

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................

However starting from 0.2.80+ there is something broken and the gpu detection/usage is completely skipped. In the logs you just find

...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...

I also tested with the latest version 0.2.87 and it is still broken. Now we're using 0.2.85 -> https://github.com/containers/ai-lab-recipes/blob/main/model_servers/llamacpp_python/src/requirements.txt#L1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions