Fix Gemma 4 NPU vision encoder on non-square images by jakmro · Pull Request #718 · cactus-compute/cactus

jakmro · 2026-06-11T15:03:06Z

No description provided.

Signed-off-by: jakmro <kubamroz124@gmail.com>

Copilot

Pull request overview

This PR updates the Gemma 4 vision encoder NPU/CoreML export and runtime wiring so the vision encoder can accept additional runtime inputs (notably pixel_position_ids), enabling correct behavior on non-square images.

Changes:

Export the vision encoder CoreML package with multiple named runtime inputs instead of a single x tensor.
Add NPU runtime support for named multi-input inference with both FP16 and INT32 inputs, plus input-shape/introspection helpers.
Avoid decoding full images in the tokenizer when only image dimensions are needed (via stbi_info wrapper).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
python/cactus/transpile/npu/vision.py	Switches CoreML export to accept multiple named runtime inputs and defines CoreML input dtypes.
python/cactus/transpile/npu/pipeline.py	Routes vision encoder emission through the new multi-input export path and supports `npu_module`.
python/cactus/transpile/model_adapters.py	Introduces a Gemma4 vision encoder adapter variant for NPU export and configures runtime input count.
python/cactus/transpile/component_pipeline.py	Extends `ComponentModuleSpec` with NPU-specific module + runtime input count.
cactus-kernels/src/image.cpp	Adds `cactus_image_info()` wrapper over `stbi_info`.
cactus-kernels/cactus_kernels.h	Exposes `cactus_image_info()` in the public kernels header.
cactus-engine/src/tokenizer.cpp	Uses `cactus_image_info()` to compute Gemma4 image soft token counts without loading the full image.
cactus-engine/src/npu_ane.mm	Adds INT32 -> `MLMultiArray` copy support; exposes input presence/shape querying; supports dtype-aware multi-input feeding.
cactus-engine/src/npu_ane.h	Extends `ANEEncoder` interface for input presence/shape querying (and stubs for non-ANE builds).
cactus-engine/src/model.cpp	Passes `pixel_position_ids` into NPU vision encode for Gemma4.
cactus-engine/src/model_npu.cpp	Updates NPU vision encode to optionally send `pixel_position_ids` as INT32 named input.
cactus-engine/src/engine.h	Extends `NPUNamedInput` with dtype + `void*` data and adds NPU encoder input-introspection APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def forward(self, *runtime_inputs: torch.Tensor) -> torch.Tensor:
+        coerced = tuple(
+            t if torch.is_floating_point(t) else t.to(torch.long)
+            for t in runtime_inputs
+        )
        extra = tuple(getattr(self, f"_baked_{i}") for i in range(self._n_baked))
-        return self.vision(pixel_values, *extra)
+        return self.vision(*coerced, *extra)


+        for (size_t i = 0; i < pixel_position_ids->size(); ++i) {
+            positions_i32[i] = static_cast<int32_t>((*pixel_position_ids)[i]);
+        }


Signed-off-by: jakmro <kubamroz124@gmail.com>

Fix Gemma 4 NPU vision encoder on non-square images

999dcf4

Signed-off-by: jakmro <kubamroz124@gmail.com>

Copilot AI review requested due to automatic review settings June 11, 2026 15:03

Copilot started reviewing on behalf of jakmro June 11, 2026 15:03 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Merge branch 'main' into gemma4-fix

790c96e

Signed-off-by: jakmro <kubamroz124@gmail.com>

jakmro merged commit dfd3d8b into main Jun 11, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Gemma 4 NPU vision encoder on non-square images#718

Fix Gemma 4 NPU vision encoder on non-square images#718
jakmro merged 2 commits into
mainfrom
gemma4-fix

jakmro commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jakmro commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants