[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635
Closed
taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
Closed
[OpenVINO] Add Qwen3.5 (Gated Delta Networks) export and inference support#1635taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
taowen-paraflow wants to merge 1 commit intohuggingface:mainfrom
Conversation
…pport Qwen3.5 uses a hybrid GatedDeltaNet + full attention architecture (18 linear_attention + 6 full_attention layers for the 0.8B model). This adds complete OpenVINO export and stateful inference support. Core changes: - Qwen3_5OpenVINOConfig with custom DummyPastKeyValuesGenerator for conv_states, recurrent_states (linear attn) and KV cache (full attn) - Qwen3_5Patcher: wraps Qwen3_5DynamicCache, patches GDN layers to use torch fallback paths (no CUDA-only flash-linear-attention dependency) - Stateful conversion with recurrent state prefix support - Token-by-token prefill in OVModelWithMambaForCausalLM with position_ids - Default quantization presets for Qwen3.5-3B and Qwen3.5-8B Also includes compatibility fixes: - OpenVINO dev version string parsing (strip commit suffixes) - AttributeError handling for composite model configs - transformers 5.x import fallbacks (is_offline_mode, HfFolder, AutoModelForVision2Seq, ParameterFormat) - bfloat16 tensor handling in _get_input_info (NumPy lacks bf16 support) - Remove transformers<4.58 upper bound (Qwen3.5 requires transformers>=5.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
|
Hi @taowen-paraflow, thanks a lot for your contribution but we already on this task in this PR: #1634 Feel free to take good-first-issue for contribution. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Qwen3.5 core support (6 files)
model_configs.pyQwen3_5OpenVINOConfig+Qwen3_5DummyPastKeyValuesGeneratorfor hybrid cache (conv + recurrent + KV)model_patcher.pyQwen3_5PatcherwrappingQwen3_5DynamicCache, patching GDN layers to torch fallback pathsutils.pyqwen3_5/qwen3_5_textinSSM_MODELSandSKIP_CHECK_TRACE_MODELSstateful.pypatch_stateful_hybrid_ssmsupportsrecurrentprefix for GDN statesmodeling_decoder.pyOVCacheWithMambaStatesextended withrecurrent_states; token-by-token prefill withposition_idsconfiguration.pyCompatibility fixes (7 files)
import_utils.py"2026.0.0-17740-abc"→"2026.0.0")modeling_base.pyAttributeErrorhandling for composite configs;is_offline_modeimport fallbackmodeling_open_clip.pyis_offline_modeimport fallbackmodeling_seq2seq.pyAutoModelForVision2Seqimport fallback (removed in transformers 5.x)utils.py(intel)ParameterFormat/compute_serialized_parameters_sizeinline fallbackmodeling_utils.pyHfFoldershim for huggingface_hub 0.25+setup.pytransformers<4.58upper bound (Qwen3.5 requires ≥5.3)Key design decisions
seq_len=1; prefill is handled token-by-token at runtime with correctposition_idsfor RoPEtorch_causal_conv1d_update,torch_recurrent_gated_delta_rule) during export, avoidingflash-linear-attention/causal-conv1dCUDA kernelsapply_make_stateful_transformationdirectly (no custom reimplementation)_get_input_infosince NumPy lacks bf16 supportTested on
Test plan
optimum-cli export openvino🤖 Generated with Claude Code