Skip to content

Fix Gemma 4 NPU vision encoder on non-square images#718

Merged
jakmro merged 2 commits into
mainfrom
gemma4-fix
Jun 11, 2026
Merged

Fix Gemma 4 NPU vision encoder on non-square images#718
jakmro merged 2 commits into
mainfrom
gemma4-fix

Conversation

@jakmro

@jakmro jakmro commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

Signed-off-by: jakmro <kubamroz124@gmail.com>
Copilot AI review requested due to automatic review settings June 11, 2026 15:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Gemma 4 vision encoder NPU/CoreML export and runtime wiring so the vision encoder can accept additional runtime inputs (notably pixel_position_ids), enabling correct behavior on non-square images.

Changes:

  • Export the vision encoder CoreML package with multiple named runtime inputs instead of a single x tensor.
  • Add NPU runtime support for named multi-input inference with both FP16 and INT32 inputs, plus input-shape/introspection helpers.
  • Avoid decoding full images in the tokenizer when only image dimensions are needed (via stbi_info wrapper).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
python/cactus/transpile/npu/vision.py Switches CoreML export to accept multiple named runtime inputs and defines CoreML input dtypes.
python/cactus/transpile/npu/pipeline.py Routes vision encoder emission through the new multi-input export path and supports npu_module.
python/cactus/transpile/model_adapters.py Introduces a Gemma4 vision encoder adapter variant for NPU export and configures runtime input count.
python/cactus/transpile/component_pipeline.py Extends ComponentModuleSpec with NPU-specific module + runtime input count.
cactus-kernels/src/image.cpp Adds cactus_image_info() wrapper over stbi_info.
cactus-kernels/cactus_kernels.h Exposes cactus_image_info() in the public kernels header.
cactus-engine/src/tokenizer.cpp Uses cactus_image_info() to compute Gemma4 image soft token counts without loading the full image.
cactus-engine/src/npu_ane.mm Adds INT32 -> MLMultiArray copy support; exposes input presence/shape querying; supports dtype-aware multi-input feeding.
cactus-engine/src/npu_ane.h Extends ANEEncoder interface for input presence/shape querying (and stubs for non-ANE builds).
cactus-engine/src/model.cpp Passes pixel_position_ids into NPU vision encode for Gemma4.
cactus-engine/src/model_npu.cpp Updates NPU vision encode to optionally send pixel_position_ids as INT32 named input.
cactus-engine/src/engine.h Extends NPUNamedInput with dtype + void* data and adds NPU encoder input-introspection APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to +27
def forward(self, *runtime_inputs: torch.Tensor) -> torch.Tensor:
coerced = tuple(
t if torch.is_floating_point(t) else t.to(torch.long)
for t in runtime_inputs
)
extra = tuple(getattr(self, f"_baked_{i}") for i in range(self._n_baked))
return self.vision(pixel_values, *extra)
return self.vision(*coerced, *extra)
Comment on lines +148 to +150
for (size_t i = 0; i < pixel_position_ids->size(); ++i) {
positions_i32[i] = static_cast<int32_t>((*pixel_position_ids)[i]);
}
Signed-off-by: jakmro <kubamroz124@gmail.com>
@jakmro jakmro merged commit dfd3d8b into main Jun 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants