Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads #3197

TimmyOVO · 2025-11-17T22:43:17Z

Description:

On the Metal backend, large transformer/VLM workloads (e.g. Dots‑style OCR with a heavy vision tower and Qwen2‑style text tower) can cause the process RSS on macOS to grow to tens or even hundreds of GiB during a single forward pass, even
though the model’s working set should fit comfortably in memory.

I use Instruments traces on a Dots‑style model show that the bulk of the “resource size” comes from a chain of large tensor ops:

candle_nn::ops::softmax on 3D tensors [batch * heads, seq_len, total_len]
the underlying elementwise ops:
- Tensor::max_keepdim
- Tensor::broadcast_sub
- Tensor::exp
- Tensor::sum_keepdim
- Tensor::broadcast_div
plus large Tensor::matmul and broadcast_mul on matching shapes

Click to expand raw data

Resource Size	Self Resource Size	Symbol Names
390.07 GiB  66.5%	0 Bytes	                                       _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h56c4e8aa02cfcf3b
250.01 GiB  42.6%	0 Bytes	                                        candle_nn::ops::softmax::hf1a7e8c20702ed6b
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_div::h1f0e384f7eef2e07
84.00 GiB  14.3%	0 Bytes	                                          candle_core::tensor::Tensor::div::h9a2bcb60b7c25bce
84.00 GiB  14.3%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::hfda09b175c7c7058
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
84.00 GiB  14.3%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                           __kdebug_trace64
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::exp::hadf5e168faedee1f
84.00 GiB  14.3%	0 Bytes	                                          candle_core::storage::Storage::unary_impl::h736da20777c1466c
84.00 GiB  14.3%	0 Bytes	                                           _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::unary_impl::he570d8aad3e1c738
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                              candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                               objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                   _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                         __kdebug_trace64
82.00 GiB  14.0%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_sub::heb9fb3e79a80bbd1
82.00 GiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sub::h571175f735b5dfbb
82.00 GiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::h5d345597b9bcf69f
82.00 GiB   0.0%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
82.00 GiB   0.0%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
82.00 GiB   0.0%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
82.00 GiB   0.0%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
82.00 GiB   0.0%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
82.00 GiB   0.0%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
82.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
82.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
82.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
82.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
82.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
82.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
82.00 GiB   0.0%	82.00 GiB	                                                           __kdebug_trace64
7.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::sum_keepdim::h648a11aa78a7272d
7.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sum_impl::h56c45f280ce49025
7.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
7.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
7.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
7.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
7.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
7.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
7.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
7.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
7.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
7.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
7.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
7.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
7.50 MiB   0.0%	7.50 MiB	                                                          __kdebug_trace64
1.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::max_keepdim::ha1543e08396d8976
1.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::reduce_impl::h9b9a30dfacd26553
1.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
1.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
1.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
1.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
1.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
1.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
1.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
1.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
1.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
1.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
1.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
1.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
1.50 MiB   0.0%	1.50 MiB	                                                          __kdebug_trace64
76.00 GiB  13.0%	0 Bytes	                                        candle_core::tensor::Tensor::broadcast_mul::h87612ac4e7a9271d
76.00 GiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::mul::h0369c368b543f4c8
76.00 GiB   0.0%	0 Bytes	                                          candle_core::storage::Storage::binary_impl::h3594d7ee262558eb
76.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
76.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
76.00 GiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
76.00 GiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
76.00 GiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
76.00 GiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
76.00 GiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
76.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
76.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
76.00 GiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
76.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
76.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
76.00 GiB   0.0%	76.00 GiB	                                                          __kdebug_trace64
64.00 GiB  10.9%	0 Bytes	                                        candle_core::tensor::Tensor::matmul::h78ebc324e001cd34
64.00 GiB   0.0%	0 Bytes	                                         candle_core::storage::Storage::matmul::h63e24b756488a23b
64.00 GiB   0.0%	0 Bytes	                                          _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::matmul::hcdf3ad195cd136a7
64.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 GiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 GiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 GiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 GiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 GiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 GiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 GiB   0.0%	64.00 GiB	                                                        __kdebug_trace64
64.00 MiB   0.0%	0 Bytes	                                        candle_core::tensor::Tensor::reshape::h6f8d982e8766aede
64.00 MiB   0.0%	0 Bytes	                                         candle_core::device::Device::alloc_uninit::h6c7d3c31740be006
64.00 MiB   0.0%	0 Bytes	                                          candle_core::metal_backend::_$LT$impl$u20$candle_core..backend..BackendDevice$u20$for$u20$candle_core..metal_backend..device..MetalDevice$GT$::alloc_uninit::h3925f8c3eca38242
64.00 MiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 MiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 MiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 MiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 MiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 MiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 MiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 MiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 MiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 MiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 MiB   0.0%	64.00 MiB	                                                        __kdebug_trace64

Each of these ops, on the Metal backend, allocates its output via:

MetalStorage::unary_impl / MetalStorage::binary / MetalStorage::matmul
MetalDevice::new_buffer → MetalDevice::allocate_buffer
Device::new_buffer → MTLDevice::newBufferWithLength_options

Because command buffers are only flushed based on a fixed “kernels per command buffer” counter and we don’t track allocation volume, long sequences of these ops can allocate many large temporary buffers before any flush happens. Those buffers only become reusable after a distant flush, so peak RSS grows roughly with the sum of all intermediate outputs in a stage rather than the maximum.

Minimum exmaples to reproduce this issue

Cargo.toml

[dependencies]
candle-core = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
candle-nn = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }

main.rs

use candle_core::{DType, Device, Result, Tensor, D};

fn main() -> Result<()> {
    let dev = Device::new_metal(0)?;
    let dtype = DType::BF16;

    let batch = 1usize;
    let num_heads = 12usize;
    let head_dim = 1536 / num_heads;
    let seq_len = 8192usize;
    let total_len = seq_len;

    let bh = batch * num_heads;

    let q_flat = Tensor::randn(0f32, 1f32, (bh, seq_len, head_dim), &dev)?.to_dtype(dtype)?;
    let k_tiled = Tensor::randn(0f32, 1f32, (bh, head_dim, total_len), &dev)?.to_dtype(dtype)?;
    let v_flat = Tensor::randn(0f32, 1f32, (bh, total_len, head_dim), &dev)?.to_dtype(dtype)?;

    let use_mask = true;
    let mask = if use_mask {
        Some(Tensor::randn(0f32, 1f32, (batch, 1, seq_len, total_len), &dev)?.to_dtype(dtype)?)
    } else {
        None
    };

    for _layer in 0..32 {
        let mut scores = q_flat.matmul(&k_tiled)?;
        let scale = 1.0f64 / (head_dim as f64).sqrt();
        let scale_tensor = Tensor::full(scale as f32, (), scores.device())?.to_dtype(dtype)?;
        scores = scores.broadcast_mul(&scale_tensor)?;
        if let Some(mask) = &mask {
            let expanded = mask
                .expand((batch, num_heads, seq_len, total_len))?
                .reshape((bh, seq_len, total_len))?;
            scores = scores.add(&expanded)?;
        }

        let probs = candle_nn::ops::softmax(&scores, D::Minus1)?;
        let _ctx = probs.matmul(&v_flat)?;
    }

    dev.synchronize()?;
    Ok(())
}

Whats changed

This PR introduces a simple allocation policy for the Metal backend so that, once a configurable amount of new buffer memory has been allocated, we automatically synchronize and trim the reuse cache, giving the existing Metal buffer pooling a chance to recycle large temporaries and preventing runaway memory growth.

ivarflakstad · 2025-11-18T11:06:02Z

Look @pcuenca, it's the power of open source ✨
(Ref pointing out the memory spikes just the other day☺️)

ivarflakstad · 2025-11-18T11:07:04Z

This looks good to me. Just want to try it out in various contexts first :)

candle-core/src/metal_backend/device.rs

pcuenca · 2025-11-18T12:31:15Z

Indeed, I saw several softmax buffers when checking with Instruments!

…lback

…ation

candle-core/src/metal_backend/device.rs

ivarflakstad · 2025-11-18T21:22:43Z

candle-core/src/metal_backend/device.rs

+        }
+
+        let pending_limit = parse_env_mebibytes("CANDLE_METAL_PENDING_LIMIT_MB")
+            .or_else(|| system_memory_bytes().map(|mem| (mem / 3).clamp(MIN_PENDING, MAX_PENDING)))


Just verifying that I understand the intention here.
Given that we are between 512MB and 12GB, we set the pending_limit to 1/3 of the available memory?

Yes, we pick pending_limit from CANDLE_METAL_PENDING_LIMIT_MB if set, otherwise as one-third of the available system memory (after OS/GPU reservations) clamped between 512 MiB and 12 GiB, and fall back to 4 GiB if we can’t detect the memory.

candle-core/src/metal_backend/device.rs

ivarflakstad · 2025-11-18T22:03:17Z

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying.
Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

TimmyOVO · 2025-11-18T23:53:58Z

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying. Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

Thanks for catching that. I'll look into removing those u64 to usize conversions first thing tomorrow. It's 8 AM here and I need to get some sleep, so I'd rather tackle this with a fresh mind.

pcuenca · 2025-11-21T18:40:45Z

I did a quick test, this is what I saw:

The memory profile is much better behaved on macOS. This is what it looks like running Stable Diffusion:

The last spike is the VAE decoding phase; it's normal that memory grows. Memory is ~stable during the UNet forward steps. However, it increases a bit each step. This could point to some memory leak somewhere (including my test code).

On iOS, I'm seeing deadlocks again. This may be a regression after the fixes in [Metal] Ensure metal backend is send/sync via status semaphore #3164 (not necessarily from this PR, I didn't check)
I think we need to add dep:libc to candle-core/Cargo.toml for the metal feature.

I think we can merge this PR if we can determine that the iOS regression was introduced elsewhere; I'll run some more tests about that.

feat(metal): reuse buffers and limit RSS growth

5d344a8

ivarflakstad reviewed Nov 18, 2025

View reviewed changes

candle-core/src/metal_backend/device.rs Outdated Show resolved Hide resolved

TimmyOVO and others added 4 commits November 19, 2025 01:48

perf(metal): improve memory detection with iogpu.wired_limit_mb fal…

5dc33cc

…lback

perf(metal): improve memory budget calculation for system allocation

cfe73fd

perf(metal): improve memory budget calculation logic for system alloc…

e63ea2e

…ation

Merge branch 'main' into main

b413f06

ivarflakstad reviewed Nov 18, 2025

View reviewed changes

TimmyOVO added 9 commits November 19, 2025 06:36

refactor(metal): use CStr constants for sysctl keys

5d5f41d

refactor(metal): simplified sysctl_u64 to a single sysctlbyname call

e949fd4

refactor(metal): simplified the wired-limit handling

42dccad

refactor(metal): droip the 32-bit upper-bound check on hw.memsize

0435470

refactor(metal): hardened env parsing to avoid overflow

6d71931

Merge remote-tracking branch 'origin/main'

5449778

refactor(metal): refine memory limit heuristics and naming

57204f5

refactor(metal): extract min cache limit constant

5c41d75

refactor(metal): avoid removing empty buffer size keys during cache trim

f3b39b9

ivarflakstad mentioned this pull request Nov 19, 2025

Add more misc. changes from candle fork #3196

Open

Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads #3197

Are you sure you want to change the base?

Metal: bound temporary buffer cache and prevent runaway memory usage on large softmax/broadcast/matmul workloads #3197

Uh oh!

Conversation

TimmyOVO commented Nov 17, 2025

Description:

Minimum exmaples to reproduce this issue

Whats changed

Uh oh!

ivarflakstad commented Nov 18, 2025

Uh oh!

ivarflakstad commented Nov 18, 2025

Uh oh!

Uh oh!

pcuenca commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivarflakstad Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

TimmyOVO Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivarflakstad commented Nov 18, 2025

Uh oh!

TimmyOVO commented Nov 18, 2025

Uh oh!

pcuenca commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants