[WebGPU] Enable Cast to int64 by default#28804
Open
fanchenkong1 wants to merge 1 commit into
Open
Conversation
Support casting to int64 from float32/float16 via IEEE-754 bit decomposition. T2 now always allows int64; casting from int64 stays gated by enable_int64. Adds cast_op_test.cc coverage for the newly introduced conversions.
30ad07e to
53a160d
Compare
Contributor
qjia7
reviewed
Jun 5, 2026
| } | ||
| } | ||
| } | ||
| sh.MainFunctionBody() << " y[base] = " << values[0] << ";\n"; |
Contributor
There was a problem hiding this comment.
nit: Please use output.SetByOffset instead of indirect accessing y.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the WebGPU EP’s Cast kernel to allow casting to int64 by default (even when enable_int64 is false), and implements float32/float16 → int64 conversion in WGSL via IEEE-754 bit decomposition to avoid CPU fallback and associated device/host sync overhead.
Changes:
- Add an IEEE-754 bit-decomposition path for float → int64 in the WebGPU Cast shader, including lane-safe stores for int64 outputs.
- Adjust WebGPU Cast kernel type constraints so
T2(output) always allows int64, whileT1(input) still gates int64 onenable_int64. - Add CPU-side Cast tests covering large float→int64 values and several int32/uint32/bool→int64 and size%4 regression cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
onnxruntime/core/providers/webgpu/tensor/cast.h |
Extends CastProgram parameters and adds output_size uniform for int64 tail handling. |
onnxruntime/core/providers/webgpu/tensor/cast.cc |
Implements float→int64 WGSL helper and relaxes output type constraints for int64; updates shader codegen paths. |
onnxruntime/test/providers/cpu/tensor/cast_op_test.cc |
Adds test coverage for float32/float16/int32/uint32/bool to int64 conversions and non-multiple-of-4 sizes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
123
to
127
| sh.MainFunctionBody() << " let a0 = " << input.GetByOffset("global_idx * 4") << ";\n" | ||
| << " let a1 = " << input.GetByOffset("global_idx * 4 + 1") << ";\n" | ||
| << " let a2 = " << input.GetByOffset("global_idx * 4 + 2") << ";\n" | ||
| << " let a3 = " << input.GetByOffset("global_idx * 4 + 3") << ";\n" | ||
| << " let a = vec4<i32>(a0, a1, a2, a3);\n"; |
Comment on lines
+157
to
+160
| sh.MainFunctionBody() << " y[base] = " << values[0] << ";\n"; | ||
| for (size_t i = 1; i < 4; ++i) { | ||
| sh.MainFunctionBody() << " if (base + " << i << "u < uniforms.output_size) { y[base + " << i | ||
| << "u] = " << values[i] << "; }\n"; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Support casting to int64 from float32 via IEEE-754 bit decomposition.
float_to_int64helper that emits the truncated-toward-zero value in full int64 range.to_type now always allows int64, regardless ofenable_int64; casting from int64 stays gated byenable_int64.cast_op_test.cccoverage for the newly introduced conversions.Motivation and Context
While running the mask-generation vision encoder (
Xenova/sam-vit-base) on the WebGPU EP via Transformers.js, float32-to-int64 cast nodes fall back to the CPU provider under the default session configuration, because casting to int64 was previously gated behindenable_int64flag, introducing host memcpy and synchronization overhead.Making cast-to-int64 correct across the full int64 range lets it run on the
WebGPU EP by default, keeping these nodes on-device and eliminating the stalls.
Performance Impact
Measured on the
vision_encoder.onnxofXenova/sam-vit-base(mask-generation, SAM ViT-base vision encoder) on theWebGPU EP.
This change yields a 1.2–1.3× speedup on the SAM ViT-base vision encoder under default configuration.