Failed to save quantized model if some tensors offload to CPU

## Describe the bug


### Failed to save quantized model if some tensors offload to CPU.

```python
Saving original model config to /workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/saved_models_GLM-4_5V-nvfp4
Saving processor config to /workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/saved_models_GLM-4_5V-nvfp4
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 8559.80it/s]
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 4387.35it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 9020.01it/s]
/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py:545: UserWarning: Cannot export model to the model_config. The modelopt-optimized model state_dict can be saved with torch.save for further inspection.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/../llm_ptq/hf_ptq.py", line 780, in <module>
    main(args)
  File "/workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/../llm_ptq/hf_ptq.py", line 629, in main
    export_hf_checkpoint(
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 549, in export_hf_checkpoint
    raise e
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 515, in export_hf_checkpoint
    post_state_dict, hf_quant_config = _export_hf_checkpoint(model, dtype)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 465, in _export_hf_checkpoint
    _export_quantized_weight(sub_module, dtype)
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 307, in _export_quantized_weight
    weight_scale = NVFP4QTensor.get_weights_scaling_factor(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 84, in get_weights_scaling_factor
    per_block_scale = per_block_amax / (6.0 * weights_scaling_factor_2)
                      ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```

### Steps/Code to reproduce bug



- `./huggingface_example.sh --model zai-org/GLM-4.5V --quant nvfp4`

### Expected behavior

handle offloading or find enough memory somewhere to save the model.

## System information



- Container used (if applicable):  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.2 LTS
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): NVIDIA RTX PRO 6000 Blackwell Workstation Edition
- GPU memory size: 95.6 GB
- Number of GPUs: 1
- Library versions (if applicable):
  - Python: 3.12.3
  - ModelOpt version or commit hash: 0.37.0
  - CUDA: 13.0
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
  - PyTorch: 2.8.0a0+34c6371d24.nv25.08
  - Transformers: 4.56.0
[TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
  - TensorRT-LLM: 1.2.0rc1
  - ONNXRuntime: 1.22.0
  - TensorRT: 10.13.2.6
- Any other details that may help: ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to save quantized model if some tensors offload to CPU #491

Describe the bug

Failed to save quantized model if some tensors offload to CPU.

Steps/Code to reproduce bug

Expected behavior

System information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to save quantized model if some tensors offload to CPU #491

Description

Describe the bug

Failed to save quantized model if some tensors offload to CPU.

Steps/Code to reproduce bug

Expected behavior

System information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions