Fix Gemma 4 NPU vision encoder on non-square images#718
Merged
Conversation
Signed-off-by: jakmro <kubamroz124@gmail.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Gemma 4 vision encoder NPU/CoreML export and runtime wiring so the vision encoder can accept additional runtime inputs (notably pixel_position_ids), enabling correct behavior on non-square images.
Changes:
- Export the vision encoder CoreML package with multiple named runtime inputs instead of a single
xtensor. - Add NPU runtime support for named multi-input inference with both FP16 and INT32 inputs, plus input-shape/introspection helpers.
- Avoid decoding full images in the tokenizer when only image dimensions are needed (via
stbi_infowrapper).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| python/cactus/transpile/npu/vision.py | Switches CoreML export to accept multiple named runtime inputs and defines CoreML input dtypes. |
| python/cactus/transpile/npu/pipeline.py | Routes vision encoder emission through the new multi-input export path and supports npu_module. |
| python/cactus/transpile/model_adapters.py | Introduces a Gemma4 vision encoder adapter variant for NPU export and configures runtime input count. |
| python/cactus/transpile/component_pipeline.py | Extends ComponentModuleSpec with NPU-specific module + runtime input count. |
| cactus-kernels/src/image.cpp | Adds cactus_image_info() wrapper over stbi_info. |
| cactus-kernels/cactus_kernels.h | Exposes cactus_image_info() in the public kernels header. |
| cactus-engine/src/tokenizer.cpp | Uses cactus_image_info() to compute Gemma4 image soft token counts without loading the full image. |
| cactus-engine/src/npu_ane.mm | Adds INT32 -> MLMultiArray copy support; exposes input presence/shape querying; supports dtype-aware multi-input feeding. |
| cactus-engine/src/npu_ane.h | Extends ANEEncoder interface for input presence/shape querying (and stubs for non-ANE builds). |
| cactus-engine/src/model.cpp | Passes pixel_position_ids into NPU vision encode for Gemma4. |
| cactus-engine/src/model_npu.cpp | Updates NPU vision encode to optionally send pixel_position_ids as INT32 named input. |
| cactus-engine/src/engine.h | Extends NPUNamedInput with dtype + void* data and adds NPU encoder input-introspection APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+21
to
+27
| def forward(self, *runtime_inputs: torch.Tensor) -> torch.Tensor: | ||
| coerced = tuple( | ||
| t if torch.is_floating_point(t) else t.to(torch.long) | ||
| for t in runtime_inputs | ||
| ) | ||
| extra = tuple(getattr(self, f"_baked_{i}") for i in range(self._n_baked)) | ||
| return self.vision(pixel_values, *extra) | ||
| return self.vision(*coerced, *extra) |
Comment on lines
+148
to
+150
| for (size_t i = 0; i < pixel_position_ids->size(); ++i) { | ||
| positions_i32[i] = static_cast<int32_t>((*pixel_position_ids)[i]); | ||
| } |
Signed-off-by: jakmro <kubamroz124@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.