feat(qwen2): add KV cache management and selective attention #3236
+303
−30
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Adds KV cache management and fixes critical causal mask bug for Qwen2 multi-turn inference. Includes numerical precision improvements for RoPE and attention.
Changes
[tgt, tgt], now[tgt, total]) - critical for multi-turn conversationsextract_kv_cache/restore_kv_cachemethods for cache manipulation and inspectionprepare_4d_causal_attention_mask_with_cache_positionfor non-contiguous cache positionsforward_from_embedsmethods enable custom embedding workflows (e.g., multimodal)NEG_INFINITYwithf32::MINto avoid NaN propagation when combining masksshift_kv_cache_first_to_lastfor advanced patterns (e.g., negative prompt refresh)Motivation
The causal mask bug prevented proper multi-turn decoding with KV cache. The new cache management APIs enable advanced inference patterns like streaming audio generation (VibeVoice) and speculative decoding while maintaining precision for F16/BF16 inference.
Breaking Changes
None - all changes are backward compatible additions or bug fixes.
✅ Validation
Routine
cargo fmt --all
cargo test -p candle-transformers
cargo clippy -p candle-transformers
Test Qwen2 Example
Simple Query
cargo run --example qwen --features metal --release -- --prompt "Write a poem about butterflies. ." --model "2-1.5b"
Test with very short prompt to ensure single-token decode works
cargo run --example qwen --features metal --release -- --prompt "Hi" --sample-len 10 --model "2-1.5b"