Skip to content

Failed to save quantized model if some tensors offload to CPU #491

@DKingAlpha

Description

@DKingAlpha

Describe the bug

Failed to save quantized model if some tensors offload to CPU.

Saving original model config to /workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/saved_models_GLM-4_5V-nvfp4
Saving processor config to /workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/saved_models_GLM-4_5V-nvfp4
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 8559.80it/s]
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 4387.35it/s]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 9020.01it/s]
/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py:545: UserWarning: Cannot export model to the model_config. The modelopt-optimized model state_dict can be saved with torch.save for further inspection.
  warnings.warn(
Traceback (most recent call last):
  File "/workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/../llm_ptq/hf_ptq.py", line 780, in <module>
    main(args)
  File "/workspace/output_models/TensorRT-Model-Optimizer/examples/vlm_ptq/../llm_ptq/hf_ptq.py", line 629, in main
    export_hf_checkpoint(
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 549, in export_hf_checkpoint
    raise e
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 515, in export_hf_checkpoint
    post_state_dict, hf_quant_config = _export_hf_checkpoint(model, dtype)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 465, in _export_hf_checkpoint
    _export_quantized_weight(sub_module, dtype)
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/export/unified_export_hf.py", line 307, in _export_quantized_weight
    weight_scale = NVFP4QTensor.get_weights_scaling_factor(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/output_models/TensorRT-Model-Optimizer/modelopt/torch/quantization/qtensor/nvfp4_tensor.py", line 84, in get_weights_scaling_factor
    per_block_scale = per_block_amax / (6.0 * weights_scaling_factor_2)
                      ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Steps/Code to reproduce bug

  • ./huggingface_example.sh --model zai-org/GLM-4.5V --quant nvfp4

Expected behavior

handle offloading or find enough memory somewhere to save the model.

System information

  • Container used (if applicable): nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc1
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.2 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA RTX PRO 6000 Blackwell Workstation Edition
  • GPU memory size: 95.6 GB
  • Number of GPUs: 1
  • Library versions (if applicable):
    • Python: 3.12.3
    • ModelOpt version or commit hash: 0.37.0
    • CUDA: 13.0
      /usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
      import pynvml # type: ignore[import]
    • PyTorch: 2.8.0a0+34c6371d24.nv25.08
    • Transformers: 4.56.0
      [TensorRT-LLM] TensorRT LLM version: 1.2.0rc1
    • TensorRT-LLM: 1.2.0rc1
    • ONNXRuntime: 1.22.0
    • TensorRT: 10.13.2.6
  • Any other details that may help: ?

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions