Karen/needle v2#690
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces first-class support for the Needle model family across the Python transpilation/conversion stack and the C++ runtime by adding a new metadata-driven encoder → cross-KV → decoder-step execution route, including optional NPU/CoreML source-encoder export and runtime loading.
Changes:
- Add Needle family detection, naming rules, tokenizer conversion handling, and a new Needle
ModelProfilewith component-pipeline defaults. - Add component-spec generation for a new
encoder_cross_kv_decoder_stepruntime route (used by Needle and refactoring Whisper to use the same route metadata). - Extend the C++ engine/runtime to load per-component metadata from the bundle manifest, run the new route (including external cross-KV cache inputs), and add Needle-specific prompt/tooling support.
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/tests/test_encoder_cross_kv_route.py | Adds a unit test validating Needle component spec routing and metadata. |
| python/cactus/transpile/npu/source.py | Adds CoreML/ANE export for a generic “source encoder” component. |
| python/cactus/transpile/npu/pipeline.py | Wires source_encoder into the NPU export pipeline. |
| python/cactus/transpile/model_profiles.py | Adds Needle model profile and default_max_new_tokens to profiles. |
| python/cactus/transpile/model_adapters.py | Adds Needle component adapters/spec builder and introduces shared encoder→crossKV→step spec helper; refactors Whisper spec generation. |
| python/cactus/transpile/lower.py | Adds lowering support for external cross-KV cache inputs and cached-attention lowering path. |
| python/cactus/transpile/hf_model.py | Writes per-component metadata into the manifest and updates component ordering. |
| python/cactus/transpile/component_plan.py | Carries profile default_max_new_tokens through planning. |
| python/cactus/convert/model_adapters/naming.py | Adds Needle-specific weight naming rewrites (gates + encoder layer prefixing). |
| python/cactus/convert/model_adapters/detection.py | Adds needle to supported family detection. |
| python/cactus/convert/model_adapters/adapters.py | Registers a needle family adapter. |
| python/cactus/convert/cactus_adapters/tokenizer.py | Adds Needle tokenizer/model handling (SentencePiece path + extra special tokens handling). |
| python/cactus/convert/cactus_adapters/config_utils.py | Adds Needle model-type detection and exports decoder start/prompt token IDs (incl. Whisper prompt IDs). |
| python/cactus/cli/model.py | Uses profile-provided default_max_new_tokens when available. |
| cactus/CMakeLists.txt | Adds a macOS curl fallback link strategy when the vendored curl archive is missing. |
| cactus-engine/src/utils.h | Adds parsing for Needle <tool_call> output blocks before falling back to Gemma parsing. |
| cactus-engine/src/transcribe.cpp | Switches Whisper default decoder prompt tokens from hard-coded text to config-driven IDs. |
| cactus-engine/src/tokenizer.cpp | Adds Needle model-type detection and Needle-specific chat prompt formatting. |
| cactus-engine/src/npu_ane.mm | Adds INT32 multi-input support for CoreML multi-array inputs. |
| cactus-engine/src/model.cpp | Adds encoder→crossKV→step route selection via manifest metadata; adds external cross-KV cache packing/quantization helpers; adds Needle decode path and NPU source-encoder loading; extends config parsing. |
| cactus-engine/src/model_npu.cpp | Adds NPU source-encoder loading and INT32 input plumbing for source encoding. |
| cactus-engine/src/engine.h | Extends config with decoder start/prompt token IDs; adds Needle tokenizer type; extends NPU named input typing; adds new decode-route fields. |
| cactus-engine/src/constraints.cpp | Adds Needle-specific tool-call constraining and trie-based name/arg-key token filtering. |
| cactus-engine/src/complete.cpp | Adds Needle tool schema serialization and Needle-specific prompt/tool constraint behavior tweaks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+318
to
+323
| case State::NEEDLE_START: | ||
| if (generated_text_.find("<tool_call>") != std::string::npos) { | ||
| state_ = State::DONE; | ||
| generated_text_.clear(); | ||
| } | ||
| break; |
Comment on lines
+3551
to
+3553
| if (!decoder_start_token_seen) { | ||
| decoder_start_token_id = bos_token_id; | ||
| } |
Comment on lines
+5051
to
+5055
| components: tuple[str, ...] | None = None, | ||
| ) -> list[ComponentModuleSpec] | None: | ||
| input_ids = named_tensors.get("input_ids") | ||
| if input_ids is None: | ||
| return None |
Signed-off-by: jakmro <kubamroz124@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.