-
Notifications
You must be signed in to change notification settings - Fork 276
Description
Describe the bug
Running olive auto-opt to optimize a model for CPU with INT4 precision fails when --use_model_builder is not specified. The default ONNX export path in olive/passes/onnx/conversion.py calls DynamicCache.from_legacy_cache(), which was removed in transformers 5.x, causing an AttributeError.
Adding --use_model_builder (and --use_ort_genai) bypasses this by using the onnxruntime-genai model builder instead of torch.onnx.export, and the optimization + inference completes successfully.
The --use_model_builder flag is documented as optional, but omitting it when targeting CPU with INT4 precision on transformers 5.x results in a crash. The official quickstart example in the README omits this flag, which may lead users to the same failure.
This was discovered while investigating a related issue where, on older package versions (transformers 4.x, onnxruntime-genai 0.5.0), the model builds successfully without --use_model_builder but fails at inference time with an OrtException related to GatherBlockQuantized and uint8 tensors. On current package versions, the failure occurs earlier — at the conversion stage itself.
To Reproduce
-
Install current packages:
olive-ai[all](0.11.0)onnxruntime-genai==0.11.4transformers==5.1.0torch==2.10.0- Python 3.13
-
Run optimization without
--use_model_builder:
python -m olive auto-opt \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--output_path models/qwen3-cpu-int4 \
--device cpu \
--provider CPUExecutionProvider \
--precision int4 \
--log_level 1- Observe crash at the ONNX conversion stage.
Expected behavior
olive auto-opt should either:
- Default to
--use_model_builderwhen targeting CPU with INT4 precision, or - Be compatible with
transformers 5.xon the standard ONNX export path, or - Surface a clear error message directing users to use
--use_model_builder
Olive config
No JSON config — reproduced via CLI.
Working command:
python -m olive auto-opt \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--output_path models/qwen3-cpu-int4 \
--device cpu \
--provider CPUExecutionProvider \
--use_model_builder \
--use_ort_genai \
--precision int4 \
--log_level 1Olive logs
Traceback (most recent call last):
File "...\olive\engine\engine.py", line 732, in _run_pass
output_model_config = host.run_pass(p, input_model_config, output_model_path)
File "...\olive\systems\local.py", line 52, in run_pass
return the_pass.run(model_config, output_model_path)
File "...\olive\passes\onnx\conversion.py", line 196, in run
return self._run_for_config(model_config, config, output_model_path)
File "...\olive\passes\onnx\conversion.py", line 390, in _run_for_config
return OnnxConversion._convert_model_on_device(...)
File "...\olive\passes\onnx\conversion.py", line 596, in _convert_model_on_device
ir_model = _export_pytorch_model(...)
File "...\torch\utils\_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
File "...\olive\passes\onnx\conversion.py", line 267, in _export_pytorch_model
torch.onnx.export(...)
File "...\torch\onnx\__init__.py", line 341, in export
export(...)
File "...\torch\onnx\_internal\torchscript_exporter\utils.py", line 552, in export
_export(...)
File "...\torch\onnx\_internal\torchscript_exporter\utils.py", line 1513, in _export
graph, params_dict, torch_out = _model_to_graph(...)
File "...\torch\onnx\_internal\torchscript_exporter\utils.py", line 1112, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args)
File "...\torch\onnx\_internal\torchscript_exporter\utils.py", line 996, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "...\torch\onnx\_internal\torchscript_exporter\utils.py", line 903, in _trace_and_get_graph_from_model
trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(...)
File "...\torch\jit\_trace.py", line 1432, in _get_trace_graph
outs = ONNXTracedModule(...)(*args, **kwargs)
File "...\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\torch\nn\modules\module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "...\torch\jit\_trace.py", line 140, in forward
graph, _out = torch._C._create_graph_by_tracing(...)
File "...\torch\jit\_trace.py", line 131, in wrapper
outs.append(self.inner(*trace_inputs))
File "...\torch\nn\modules\module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "...\torch\nn\modules\module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "...\torch\nn\modules\module.py", line 1766, in _slow_forward
result = self.forward(*input, **kwargs)
File "...\olive\passes\onnx\conversion.py", line 104, in patched_forward
args[pkv_index] = DynamicCache.from_legacy_cache(args[pkv_index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: type object 'DynamicCache' has no attribute 'from_legacy_cache'
Other information
- OS: Windows 11
- Olive version: 0.11.0
- ONNXRuntime package and version:
onnxruntime-genai==0.11.4 - Transformers package version:
transformers==5.1.0 - Torch version:
2.10.0 - Python version: 3.13
Additional context
- The root cause is in
olive/passes/onnx/conversion.pyline 104, which callsDynamicCache.from_legacy_cache()— a method that was removed intransformers 5.x. - This likely affects all models optimized via
olive auto-optwithout--use_model_builderontransformers 5.x, not just Qwen2.5. - The
olive-recipesrepo currently has no CPU recipe for Qwen2.5-0.5B-Instruct — all existing recipes target GPU/NPU runtimes. Happy to contribute a CPU recipe PR. - Related: an earlier report of the same underlying issue (missing
--use_model_builder) on older packages (transformers 4.x,onnxruntime-genai 0.5.0) manifested as aGatherBlockQuantized/uint8OrtException at inference time rather than at conversion time.