-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads #3197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Look @pcuenca, it's the power of open source ✨ |
|
This looks good to me. Just want to try it out in various contexts first :) |
|
Indeed, I saw several softmax buffers when checking with Instruments! |
| } | ||
|
|
||
| let pending_limit = parse_env_mebibytes("CANDLE_METAL_PENDING_LIMIT_MB") | ||
| .or_else(|| system_memory_bytes().map(|mem| (mem / 3).clamp(MIN_PENDING, MAX_PENDING))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just verifying that I understand the intention here.
Given that we are between 512MB and 12GB, we set the pending_limit to 1/3 of the available memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we pick pending_limit from CANDLE_METAL_PENDING_LIMIT_MB if set, otherwise as one-third of the available system memory (after OS/GPU reservations) clamped between 512 MiB and 12 GiB, and fall back to 4 GiB if we can’t detect the memory.
|
It might be that |
Thanks for catching that. I'll look into removing those |
|
I did a quick test, this is what I saw:
The last spike is the VAE decoding phase; it's normal that memory grows. Memory is ~stable during the UNet forward steps. However, it increases a bit each step. This could point to some memory leak somewhere (including my test code).
I think we can merge this PR if we can determine that the iOS regression was introduced elsewhere; I'll run some more tests about that. |

Description:
On the Metal backend, large transformer/VLM workloads (e.g. Dots‑style OCR with a heavy vision tower and Qwen2‑style text tower) can cause the process RSS on macOS to grow to tens or even hundreds of GiB during a single forward pass, even
though the model’s working set should fit comfortably in memory.
I use
Instrumentstraces on a Dots‑style model show that the bulk of the “resource size” comes from a chain of large tensor ops:candle_nn::ops::softmaxon 3D tensors[batch * heads, seq_len, total_len]Tensor::max_keepdimTensor::broadcast_subTensor::expTensor::sum_keepdimTensor::broadcast_divTensor::matmulandbroadcast_mulon matching shapesClick to expand raw data
Each of these ops, on the Metal backend, allocates its output via:
MetalStorage::unary_impl/MetalStorage::binary/MetalStorage::matmulMetalDevice::new_buffer→MetalDevice::allocate_bufferDevice::new_buffer→MTLDevice::newBufferWithLength_optionsBecause command buffers are only flushed based on a fixed “kernels per command buffer” counter and we don’t track allocation volume, long sequences of these ops can allocate many large temporary buffers before any flush happens. Those buffers only become reusable after a distant flush, so peak RSS grows roughly with the sum of all intermediate outputs in a stage rather than the maximum.
Minimum exmaples to reproduce this issue
Cargo.toml
main.rs
Whats changed
This PR introduces a simple allocation policy for the Metal backend so that, once a configurable amount of new buffer memory has been allocated, we automatically synchronize and trim the reuse cache, giving the existing Metal buffer pooling a chance to recycle large temporaries and preventing runaway memory growth.