Skip to content

Conversation

@TimmyOVO
Copy link
Contributor

Description:

On the Metal backend, large transformer/VLM workloads (e.g. Dots‑style OCR with a heavy vision tower and Qwen2‑style text tower) can cause the process RSS on macOS to grow to tens or even hundreds of GiB during a single forward pass, even
though the model’s working set should fit comfortably in memory.

I use Instruments traces on a Dots‑style model show that the bulk of the “resource size” comes from a chain of large tensor ops:

  • candle_nn::ops::softmax on 3D tensors [batch * heads, seq_len, total_len]
  • the underlying elementwise ops:
    • Tensor::max_keepdim
    • Tensor::broadcast_sub
    • Tensor::exp
    • Tensor::sum_keepdim
    • Tensor::broadcast_div
  • plus large Tensor::matmul and broadcast_mul on matching shapes
Click to expand raw data
Resource Size	Self Resource Size	Symbol Names
390.07 GiB  66.5%	0 Bytes	                                       _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h56c4e8aa02cfcf3b
250.01 GiB  42.6%	0 Bytes	                                        candle_nn::ops::softmax::hf1a7e8c20702ed6b
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_div::h1f0e384f7eef2e07
84.00 GiB  14.3%	0 Bytes	                                          candle_core::tensor::Tensor::div::h9a2bcb60b7c25bce
84.00 GiB  14.3%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::hfda09b175c7c7058
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
84.00 GiB  14.3%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                           __kdebug_trace64
84.00 GiB  14.3%	0 Bytes	                                         candle_core::tensor::Tensor::exp::hadf5e168faedee1f
84.00 GiB  14.3%	0 Bytes	                                          candle_core::storage::Storage::unary_impl::h736da20777c1466c
84.00 GiB  14.3%	0 Bytes	                                           _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::unary_impl::he570d8aad3e1c738
84.00 GiB  14.3%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
84.00 GiB  14.3%	0 Bytes	                                             candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
84.00 GiB  14.3%	0 Bytes	                                              candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
84.00 GiB  14.3%	0 Bytes	                                               objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
84.00 GiB   0.0%	0 Bytes	                                                _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
84.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
84.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
84.00 GiB   0.0%	0 Bytes	                                                   _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
84.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
84.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
84.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
84.00 GiB   0.0%	84.00 GiB	                                                         __kdebug_trace64
82.00 GiB  14.0%	0 Bytes	                                         candle_core::tensor::Tensor::broadcast_sub::heb9fb3e79a80bbd1
82.00 GiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sub::h571175f735b5dfbb
82.00 GiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::binary_impl::h5d345597b9bcf69f
82.00 GiB   0.0%	0 Bytes	                                            _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
82.00 GiB   0.0%	0 Bytes	                                             candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
82.00 GiB   0.0%	0 Bytes	                                              _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
82.00 GiB   0.0%	0 Bytes	                                               candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
82.00 GiB   0.0%	0 Bytes	                                                candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
82.00 GiB   0.0%	0 Bytes	                                                 objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
82.00 GiB   0.0%	0 Bytes	                                                  _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
82.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
82.00 GiB   0.0%	0 Bytes	                                                    objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
82.00 GiB   0.0%	0 Bytes	                                                     _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
82.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                        -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
82.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
82.00 GiB   0.0%	0 Bytes	                                                          -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
82.00 GiB   0.0%	82.00 GiB	                                                           __kdebug_trace64
7.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::sum_keepdim::h648a11aa78a7272d
7.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::sum_impl::h56c45f280ce49025
7.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
7.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
7.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
7.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
7.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
7.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
7.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
7.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
7.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
7.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
7.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
7.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
7.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
7.50 MiB   0.0%	7.50 MiB	                                                          __kdebug_trace64
1.50 MiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::max_keepdim::ha1543e08396d8976
1.50 MiB   0.0%	0 Bytes	                                          candle_core::tensor::Tensor::reduce_impl::h9b9a30dfacd26553
1.50 MiB   0.0%	0 Bytes	                                           candle_core::storage::Storage::reduce_op::ha2d9bc02bdba85cb
1.50 MiB   0.0%	0 Bytes	                                            _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::reduce_op::h6c1fa34e549af913
1.50 MiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
1.50 MiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
1.50 MiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
1.50 MiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
1.50 MiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
1.50 MiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
1.50 MiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
1.50 MiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
1.50 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
1.50 MiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
1.50 MiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
1.50 MiB   0.0%	1.50 MiB	                                                          __kdebug_trace64
76.00 GiB  13.0%	0 Bytes	                                        candle_core::tensor::Tensor::broadcast_mul::h87612ac4e7a9271d
76.00 GiB   0.0%	0 Bytes	                                         candle_core::tensor::Tensor::mul::h0369c368b543f4c8
76.00 GiB   0.0%	0 Bytes	                                          candle_core::storage::Storage::binary_impl::h3594d7ee262558eb
76.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h35f1c58c739baa69
76.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::MetalStorage::binary::h8a4fc6643c45827f
76.00 GiB   0.0%	0 Bytes	                                             _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
76.00 GiB   0.0%	0 Bytes	                                              candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
76.00 GiB   0.0%	0 Bytes	                                               candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
76.00 GiB   0.0%	0 Bytes	                                                objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
76.00 GiB   0.0%	0 Bytes	                                                 _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
76.00 GiB   0.0%	0 Bytes	                                                  objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
76.00 GiB   0.0%	0 Bytes	                                                   objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
76.00 GiB   0.0%	0 Bytes	                                                    _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
76.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                      -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                       -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
76.00 GiB   0.0%	0 Bytes	                                                        -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
76.00 GiB   0.0%	0 Bytes	                                                         -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
76.00 GiB   0.0%	76.00 GiB	                                                          __kdebug_trace64
64.00 GiB  10.9%	0 Bytes	                                        candle_core::tensor::Tensor::matmul::h78ebc324e001cd34
64.00 GiB   0.0%	0 Bytes	                                         candle_core::storage::Storage::matmul::h63e24b756488a23b
64.00 GiB   0.0%	0 Bytes	                                          _$LT$candle_core..metal_backend..MetalStorage$u20$as$u20$candle_core..backend..BackendStorage$GT$::matmul::hcdf3ad195cd136a7
64.00 GiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 GiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 GiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 GiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 GiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 GiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 GiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 GiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 GiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 GiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 GiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 GiB   0.0%	64.00 GiB	                                                        __kdebug_trace64
64.00 MiB   0.0%	0 Bytes	                                        candle_core::tensor::Tensor::reshape::h6f8d982e8766aede
64.00 MiB   0.0%	0 Bytes	                                         candle_core::device::Device::alloc_uninit::h6c7d3c31740be006
64.00 MiB   0.0%	0 Bytes	                                          candle_core::metal_backend::_$LT$impl$u20$candle_core..backend..BackendDevice$u20$for$u20$candle_core..metal_backend..device..MetalDevice$GT$::alloc_uninit::h3925f8c3eca38242
64.00 MiB   0.0%	0 Bytes	                                           _$LT$core..result..Result$LT$T$C$E$GT$$u20$as$u20$core..ops..try_trait..Try$GT$::branch::h13fc16f7d7a15a9a
64.00 MiB   0.0%	0 Bytes	                                            candle_core::metal_backend::device::MetalDevice::allocate_buffer::h87e110be93f79a4d
64.00 MiB   0.0%	0 Bytes	                                             candle_metal_kernels::metal::device::Device::new_buffer::hdff57d7c15bf891f
64.00 MiB   0.0%	0 Bytes	                                              objc2_metal::generated::__MTLDevice::MTLDevice::newBufferWithLength_options::hc27f4530b25513c7
64.00 MiB   0.0%	0 Bytes	                                               _$LT$MethodFamily$u20$as$u20$objc2..__macro_helpers..msg_send_retained..MsgSend$LT$Receiver$C$Return$GT$$GT$::send_message::h2d47315393e8e979
64.00 MiB   0.0%	0 Bytes	                                                objc2::runtime::message_receiver::MessageReceiver::send_message::h81dfd419189b6a37
64.00 MiB   0.0%	0 Bytes	                                                 objc2::runtime::message_receiver::msg_send_primitive::send::hc0961fbfcac6b629
64.00 MiB   0.0%	0 Bytes	                                                  _$LT$$LP$A$C$B$RP$$u20$as$u20$objc2..encode..EncodeArguments$GT$::__invoke::hd2941f8e31fd9e9b
64.00 MiB   0.0%	0 Bytes	                                                   -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                    -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                     -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
64.00 MiB   0.0%	0 Bytes	                                                      -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
64.00 MiB   0.0%	0 Bytes	                                                       -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:placementSparsePageSize:placementSparseResidencyBytes:args:argsSize:deallocator:]
64.00 MiB   0.0%	64.00 MiB	                                                        __kdebug_trace64

Each of these ops, on the Metal backend, allocates its output via:

  • MetalStorage::unary_impl / MetalStorage::binary / MetalStorage::matmul
  • MetalDevice::new_bufferMetalDevice::allocate_buffer
  • Device::new_bufferMTLDevice::newBufferWithLength_options

Because command buffers are only flushed based on a fixed “kernels per command buffer” counter and we don’t track allocation volume, long sequences of these ops can allocate many large temporary buffers before any flush happens. Those buffers only become reusable after a distant flush, so peak RSS grows roughly with the sum of all intermediate outputs in a stage rather than the maximum.

Minimum exmaples to reproduce this issue

Cargo.toml
[dependencies]
candle-core = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
candle-nn = { git = "https://github.com/huggingface/candle", default-features = false,features = ["accelerate","metal"] }
main.rs
use candle_core::{DType, Device, Result, Tensor, D};

fn main() -> Result<()> {
    let dev = Device::new_metal(0)?;
    let dtype = DType::BF16;

    let batch = 1usize;
    let num_heads = 12usize;
    let head_dim = 1536 / num_heads;
    let seq_len = 8192usize;
    let total_len = seq_len;

    let bh = batch * num_heads;

    let q_flat = Tensor::randn(0f32, 1f32, (bh, seq_len, head_dim), &dev)?.to_dtype(dtype)?;
    let k_tiled = Tensor::randn(0f32, 1f32, (bh, head_dim, total_len), &dev)?.to_dtype(dtype)?;
    let v_flat = Tensor::randn(0f32, 1f32, (bh, total_len, head_dim), &dev)?.to_dtype(dtype)?;

    let use_mask = true;
    let mask = if use_mask {
        Some(Tensor::randn(0f32, 1f32, (batch, 1, seq_len, total_len), &dev)?.to_dtype(dtype)?)
    } else {
        None
    };

    for _layer in 0..32 {
        let mut scores = q_flat.matmul(&k_tiled)?;
        let scale = 1.0f64 / (head_dim as f64).sqrt();
        let scale_tensor = Tensor::full(scale as f32, (), scores.device())?.to_dtype(dtype)?;
        scores = scores.broadcast_mul(&scale_tensor)?;
        if let Some(mask) = &mask {
            let expanded = mask
                .expand((batch, num_heads, seq_len, total_len))?
                .reshape((bh, seq_len, total_len))?;
            scores = scores.add(&expanded)?;
        }

        let probs = candle_nn::ops::softmax(&scores, D::Minus1)?;
        let _ctx = probs.matmul(&v_flat)?;
    }

    dev.synchronize()?;
    Ok(())
}

Whats changed

This PR introduces a simple allocation policy for the Metal backend so that, once a configurable amount of new buffer memory has been allocated, we automatically synchronize and trim the reuse cache, giving the existing Metal buffer pooling a chance to recycle large temporaries and preventing runaway memory growth.

@ivarflakstad
Copy link
Member

Look @pcuenca, it's the power of open source ✨
(Ref pointing out the memory spikes just the other day☺️)

@ivarflakstad
Copy link
Member

This looks good to me. Just want to try it out in various contexts first :)

@pcuenca
Copy link
Member

pcuenca commented Nov 18, 2025

Indeed, I saw several softmax buffers when checking with Instruments!

}

let pending_limit = parse_env_mebibytes("CANDLE_METAL_PENDING_LIMIT_MB")
.or_else(|| system_memory_bytes().map(|mem| (mem / 3).clamp(MIN_PENDING, MAX_PENDING)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just verifying that I understand the intention here.
Given that we are between 512MB and 12GB, we set the pending_limit to 1/3 of the available memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we pick pending_limit from CANDLE_METAL_PENDING_LIMIT_MB if set, otherwise as one-third of the available system memory (after OS/GPU reservations) clamped between 512 MiB and 12 GiB, and fall back to 4 GiB if we can’t detect the memory.

@ivarflakstad
Copy link
Member

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying.
Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

@TimmyOVO
Copy link
Contributor Author

It might be that sysctl_usize behaves correctly in all cases, if you want to get rid of the u64 -> usize conversions. Worth trying. Btw, I can't guarantee the code I provided via the review is completely correct. It's all freehand :)

Thanks for catching that. I'll look into removing those u64 to usize conversions first thing tomorrow. It's 8 AM here and I need to get some sleep, so I'd rather tackle this with a fresh mind.

@pcuenca
Copy link
Member

pcuenca commented Nov 21, 2025

I did a quick test, this is what I saw:

  • The memory profile is much better behaved on macOS. This is what it looks like running Stable Diffusion:
Screenshot 2025-11-21 at 19 18 16

The last spike is the VAE decoding phase; it's normal that memory grows. Memory is ~stable during the UNet forward steps. However, it increases a bit each step. This could point to some memory leak somewhere (including my test code).

I think we can merge this PR if we can determine that the iOS regression was introduced elsewhere; I'll run some more tests about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants