Skip to content

[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635

Closed
taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
taowen-paraflow:add-qwen3.5-openvino-support
Closed

[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635
taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
taowen-paraflow:add-qwen3.5-openvino-support

Conversation

@taowen-paraflow
Copy link

Summary

  • Add complete OpenVINO export and stateful inference support for Qwen3.5 (Gated Delta Networks hybrid architecture)
  • Qwen3.5 combines 18 linear attention layers (GatedDeltaNet with conv_states + recurrent_states) and 6 full attention layers (standard KV cache)
  • Include compatibility fixes for transformers 5.x and OpenVINO dev version parsing

Qwen3.5 core support (6 files)

File Change
model_configs.py Qwen3_5OpenVINOConfig + Qwen3_5DummyPastKeyValuesGenerator for hybrid cache (conv + recurrent + KV)
model_patcher.py Qwen3_5Patcher wrapping Qwen3_5DynamicCache, patching GDN layers to torch fallback paths
utils.py Register qwen3_5 / qwen3_5_text in SSM_MODELS and SKIP_CHECK_TRACE_MODELS
stateful.py patch_stateful_hybrid_ssm supports recurrent prefix for GDN states
modeling_decoder.py OVCacheWithMambaStates extended with recurrent_states; token-by-token prefill with position_ids
configuration.py Default int4 quantization presets for Qwen3.5-3B and Qwen3.5-8B

Compatibility fixes (7 files)

File Change
import_utils.py OpenVINO dev version string parsing ("2026.0.0-17740-abc""2026.0.0")
modeling_base.py AttributeError handling for composite configs; is_offline_mode import fallback
modeling_open_clip.py is_offline_mode import fallback
modeling_seq2seq.py AutoModelForVision2Seq import fallback (removed in transformers 5.x)
utils.py (intel) ParameterFormat / compute_serialized_parameters_size inline fallback
modeling_utils.py HfFolder shim for huggingface_hub 0.25+
setup.py Remove transformers<4.58 upper bound (Qwen3.5 requires ≥5.3)

Key design decisions

  • Decode-only export: Model is traced with seq_len=1; prefill is handled token-by-token at runtime with correct position_ids for RoPE
  • No CUDA dependencies: GDN layers are patched to use torch fallback paths (torch_causal_conv1d_update, torch_recurrent_gated_delta_rule) during export, avoiding flash-linear-attention / causal-conv1d CUDA kernels
  • OV native stateful conversion: Uses apply_make_stateful_transformation directly (no custom reimplementation)
  • 48 stateful variables: 18 conv + 18 recurrent + 6 key + 6 value for the 0.8B model
  • bfloat16 handling: Added bf16→OVType.bf16 path in _get_input_info since NumPy lacks bf16 support

Tested on

  • Qwen3.5-0.8B on Intel Core Ultra 7 258V (Lunar Lake), CPU inference ~6-10 tok/s
  • Output quality verified against PyTorch reference ("The capital of France is Paris", arithmetic patterns, etc.)

Test plan

  • Export Qwen3.5-0.8B to OpenVINO IR with optimum-cli export openvino
  • Run stateful inference and verify output quality
  • Verify no regression on existing SSM models (Zamba2, Mamba, etc.)
  • Add tiny-random-qwen3.5 model to HF Hub for CI testing

🤖 Generated with Claude Code

…pport

Qwen3.5 uses a hybrid GatedDeltaNet + full attention architecture
(18 linear_attention + 6 full_attention layers for the 0.8B model).
This adds complete OpenVINO export and stateful inference support.

Core changes:
- Qwen3_5OpenVINOConfig with custom DummyPastKeyValuesGenerator for
  conv_states, recurrent_states (linear attn) and KV cache (full attn)
- Qwen3_5Patcher: wraps Qwen3_5DynamicCache, patches GDN layers to use
  torch fallback paths (no CUDA-only flash-linear-attention dependency)
- Stateful conversion with recurrent state prefix support
- Token-by-token prefill in OVModelWithMambaForCausalLM with position_ids
- Default quantization presets for Qwen3.5-3B and Qwen3.5-8B

Also includes compatibility fixes:
- OpenVINO dev version string parsing (strip commit suffixes)
- AttributeError handling for composite model configs
- transformers 5.x import fallbacks (is_offline_mode, HfFolder,
  AutoModelForVision2Seq, ParameterFormat)
- bfloat16 tensor handling in _get_input_info (NumPy lacks bf16 support)
- Remove transformers<4.58 upper bound (Qwen3.5 requires transformers>=5.3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rkazants
Copy link
Collaborator

rkazants commented Mar 9, 2026

Hi @taowen-paraflow, thanks a lot for your contribution but we already on this task in this PR: #1634

Feel free to take good-first-issue for contribution.

@rkazants rkazants closed this Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants