[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support by taowen-paraflow · Pull Request #1635 · huggingface/optimum-intel

taowen-paraflow · 2026-03-09T05:18:05Z

Summary

Add complete OpenVINO export and stateful inference support for Qwen3.5 (Gated Delta Networks hybrid architecture)
Qwen3.5 combines 18 linear attention layers (GatedDeltaNet with conv_states + recurrent_states) and 6 full attention layers (standard KV cache)
Include compatibility fixes for transformers 5.x and OpenVINO dev version parsing

Qwen3.5 core support (6 files)

File	Change
`model_configs.py`	`Qwen3_5OpenVINOConfig` + `Qwen3_5DummyPastKeyValuesGenerator` for hybrid cache (conv + recurrent + KV)
`model_patcher.py`	`Qwen3_5Patcher` wrapping `Qwen3_5DynamicCache`, patching GDN layers to torch fallback paths
`utils.py`	Register `qwen3_5` / `qwen3_5_text` in `SSM_MODELS` and `SKIP_CHECK_TRACE_MODELS`
`stateful.py`	`patch_stateful_hybrid_ssm` supports `recurrent` prefix for GDN states
`modeling_decoder.py`	`OVCacheWithMambaStates` extended with `recurrent_states`; token-by-token prefill with `position_ids`
`configuration.py`	Default int4 quantization presets for Qwen3.5-3B and Qwen3.5-8B

Compatibility fixes (7 files)

File	Change
`import_utils.py`	OpenVINO dev version string parsing (`"2026.0.0-17740-abc"` → `"2026.0.0"`)
`modeling_base.py`	`AttributeError` handling for composite configs; `is_offline_mode` import fallback
`modeling_open_clip.py`	`is_offline_mode` import fallback
`modeling_seq2seq.py`	`AutoModelForVision2Seq` import fallback (removed in transformers 5.x)
`utils.py` (intel)	`ParameterFormat` / `compute_serialized_parameters_size` inline fallback
`modeling_utils.py`	`HfFolder` shim for huggingface_hub 0.25+
`setup.py`	Remove `transformers<4.58` upper bound (Qwen3.5 requires ≥5.3)

Key design decisions

Decode-only export: Model is traced with seq_len=1; prefill is handled token-by-token at runtime with correct position_ids for RoPE
No CUDA dependencies: GDN layers are patched to use torch fallback paths (torch_causal_conv1d_update, torch_recurrent_gated_delta_rule) during export, avoiding flash-linear-attention / causal-conv1d CUDA kernels
OV native stateful conversion: Uses apply_make_stateful_transformation directly (no custom reimplementation)
48 stateful variables: 18 conv + 18 recurrent + 6 key + 6 value for the 0.8B model
bfloat16 handling: Added bf16→OVType.bf16 path in _get_input_info since NumPy lacks bf16 support

Tested on

Qwen3.5-0.8B on Intel Core Ultra 7 258V (Lunar Lake), CPU inference ~6-10 tok/s
Output quality verified against PyTorch reference ("The capital of France is Paris", arithmetic patterns, etc.)

Test plan

Export Qwen3.5-0.8B to OpenVINO IR with optimum-cli export openvino
Run stateful inference and verify output quality
Verify no regression on existing SSM models (Zamba2, Mamba, etc.)
Add tiny-random-qwen3.5 model to HF Hub for CI testing

🤖 Generated with Claude Code

…pport Qwen3.5 uses a hybrid GatedDeltaNet + full attention architecture (18 linear_attention + 6 full_attention layers for the 0.8B model). This adds complete OpenVINO export and stateful inference support. Core changes: - Qwen3_5OpenVINOConfig with custom DummyPastKeyValuesGenerator for conv_states, recurrent_states (linear attn) and KV cache (full attn) - Qwen3_5Patcher: wraps Qwen3_5DynamicCache, patches GDN layers to use torch fallback paths (no CUDA-only flash-linear-attention dependency) - Stateful conversion with recurrent state prefix support - Token-by-token prefill in OVModelWithMambaForCausalLM with position_ids - Default quantization presets for Qwen3.5-3B and Qwen3.5-8B Also includes compatibility fixes: - OpenVINO dev version string parsing (strip commit suffixes) - AttributeError handling for composite model configs - transformers 5.x import fallbacks (is_offline_mode, HfFolder, AutoModelForVision2Seq, ParameterFormat) - bfloat16 tensor handling in _get_input_info (NumPy lacks bf16 support) - Remove transformers<4.58 upper bound (Qwen3.5 requires transformers>=5.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rkazants · 2026-03-09T06:27:40Z

Hi @taowen-paraflow, thanks a lot for your contribution but we already on this task in this PR: #1634

Feel free to take good-first-issue for contribution.

rkazants closed this Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635

[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635
taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
taowen-paraflow:add-qwen3.5-openvino-support

taowen-paraflow commented Mar 9, 2026

Uh oh!

rkazants commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taowen-paraflow commented Mar 9, 2026

Summary

Qwen3.5 core support (6 files)

Compatibility fixes (7 files)

Key design decisions

Tested on

Test plan

Uh oh!

rkazants commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants