Prerequisites
Feature Description
Add support for VK_EXT_memory_priority in the Vulkan backend to assign high priority (1.0f) to model allocations such as model itself, its kv cache. This tells the driver to prefer keeping model weights resident in device-local VRAM under memory pressure.
Current behavior:
All Vulkan allocations (weights, KV cache) initially reside in VRAM. When memory pressure occurs and llama.cpp is idle, the driver may evict some of these allocations (weights and/or KV cache) to GTT.
Expected behavior:
- Model allocations get
priority=1.0f -> they are preferred to stay in VRAM
- Desktop apps / other processes with default priority (0.5f when unspecified) are more likely to be evicted first under pressure.
Control
- Implement support to control this through flags and/or environment variables separately for both model itself and kv cache if possible
References:
Motivation
Universal problem across GPU sizes: When llama.cpp runs alongside desktop environments or other apps:
- Model loads -> weights fully resident in VRAM
- Idle period -> switch to desktop/browser
- Memory pressure -> compositor requests VRAM
- Driver evicts -> driver sees llama.cpp buffers as "idle", moves allocations to GTT
- Resume generation -> allocations reload from host, but VRAM still partially occupied -> leading to more data offloaded to GTT and slower generation
This especially affects people with low vram available
VK_EXT_memory_priority is designed exactly for this situation: with priority set to 1.0f for critical allocations, the driver can keep model allocations in VRAM and prefer evicting lower‑priority data (which defaults to 0.5f when not specified), reducing eviction‑induced stalls and performance drops.
Possible Implementation
No response
Prerequisites
Feature Description
Add support for
VK_EXT_memory_priorityin the Vulkan backend to assign high priority (1.0f) to model allocations such as model itself, its kv cache. This tells the driver to prefer keeping model weights resident in device-local VRAM under memory pressure.Current behavior:
All Vulkan allocations (weights, KV cache) initially reside in VRAM. When memory pressure occurs and llama.cpp is idle, the driver may evict some of these allocations (weights and/or KV cache) to GTT.
Expected behavior:
priority=1.0f-> they are preferred to stay in VRAMControl
References:
Motivation
Universal problem across GPU sizes: When llama.cpp runs alongside desktop environments or other apps:
This especially affects people with low vram available
VK_EXT_memory_priorityis designed exactly for this situation: with priority set to 1.0f for critical allocations, the driver can keep model allocations in VRAM and prefer evicting lower‑priority data (which defaults to 0.5f when not specified), reducing eviction‑induced stalls and performance drops.Possible Implementation
No response