-
Notifications
You must be signed in to change notification settings - Fork 32.5k
Open
Labels
Description
System Info
transformersversion: 5.3.0- Platform: Linux-6.19.0-9-generic-x86_64-with-glibc2.42
- Python version: 3.13.12
- Huggingface_hub version: 1.7.1
- Safetensors version: 0.7.0
- Accelerate version: 1.13.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.11.0a0+rocm7.11.0a20260106 (CUDA)
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: Radeon 8060S Graphics
Strix Halo APU has 128 GB unified memory. I've configured 125 GB TTM memory and 1 GB UMA buffer.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When I run
model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")it errors like
...
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/models/auto/auto_factory.py", line 374, in from_pretrained
return model_class.from_pretrained(
~~~~~~~~~~~~~~~~~~~~~~~~~~~^
pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4137, in from_pretrained
loading_info, disk_offload_index = cls._load_pretrained_model(model, state_dict, checkpoint_files, load_config)
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/modeling_utils.py", line 4256, in _load_pretrained_model
loading_info, disk_offload_index = convert_and_load_state_dict_in_model(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
model=model,
^^^^^^^^^^^^
...<3 lines>...
disk_offload_index=disk_offload_index,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 1212, in convert_and_load_state_dict_in_model
realized_value = mapping.convert(
first_param_name,
...<3 lines>...
loading_info=loading_info,
)
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 678, in convert
collected_tensors = self.materialize_tensors()
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 657, in materialize_tensors
tensors = [func() for func in tensors]
~~~~^^
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 800, in _job
return _materialize_copy(tensor, device, dtype)
File "/home/wd/venv_torch/lib/python3.13/site-packages/transformers/core_model_loading.py", line 789, in _materialize_copy
tensor = tensor.to(device=device, dtype=dtype)
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 2.00 MiB. GPU 0 has a total capacity of 125.00 GiB of which 24.00 KiB is free. Of the allocated memory 58.06 GiB is allocated by PyTorch, and 4.94 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
The Qwen3.5 35B fp16 model takes 67 GB memory. With 125 GB TTM memory it should not OOM.
From what I know, safetensors uses mmap, and currently mmap does not work well on Strix Halo. A proper fix would be in the amdgpu driver. For now I use the following trick to work around this issue:
def patch_transformers_disable_mmap():
import transformers.core_model_loading as _cml
def _materialize_copy_no_mmap(t, device=None, dtype=None):
# The indexing materializes the tensor from safetensors.
t = t[...]
# If safetensors returned an mmapped tensor on CPU,
# force a copy on CPU before any device/dtype conversion.
if isinstance(t, torch.Tensor) and t.device.type == "cpu":
t = t.to(device="cpu", copy=True)
if dtype is not None or device is not None:
t = t.to(device=device, dtype=dtype)
return t
assert hasattr(_cml, "_materialize_copy")
_cml._materialize_copy = _materialize_copy_no_mmap
print("Patched transformers to disable mmap.")
def main():
patch_transformers_disable_mmap()
model = AutoModel.from_pretrained("Qwen/Qwen3.5-35B-A3B")
...It's inspired by how ComfyUI disables mmap:
https://github.com/Comfy-Org/ComfyUI/blob/593be209a45a8a306c26de550e240a363de405a7/comfy/utils.py#L137
and it's known that currently disabling mmap improves the model loading speed in ComfyUI on Strix Halo.
Reactions are currently unavailable