Skip to content

[WebGPU] Enable Cast to int64 by default#28804

Open
fanchenkong1 wants to merge 1 commit into
microsoft:mainfrom
fanchenkong1:enable-webgpu-float2int64
Open

[WebGPU] Enable Cast to int64 by default#28804
fanchenkong1 wants to merge 1 commit into
microsoft:mainfrom
fanchenkong1:enable-webgpu-float2int64

Conversation

@fanchenkong1
Copy link
Copy Markdown
Contributor

@fanchenkong1 fanchenkong1 commented Jun 5, 2026

Description

Support casting to int64 from float32 via IEEE-754 bit decomposition.

  • Introduce a new float_to_int64 helper that emits the truncated-toward-zero value in full int64 range.
  • to_ type now always allows int64, regardless of enable_int64; casting from int64 stays gated by enable_int64.
  • Adds cast_op_test.cc coverage for the newly introduced conversions.

Motivation and Context

While running the mask-generation vision encoder (Xenova/sam-vit-base) on the WebGPU EP via Transformers.js, float32-to-int64 cast nodes fall back to the CPU provider under the default session configuration, because casting to int64 was previously gated behind enable_int64 flag, introducing host memcpy and synchronization overhead.

Making cast-to-int64 correct across the full int64 range lets it run on the
WebGPU EP by default, keeping these nodes on-device and eliminating the stalls.

Performance Impact

Measured on the vision_encoder.onnx of Xenova/sam-vit-base (mask-generation, SAM ViT-base vision encoder) on the
WebGPU EP.

Platform Latency reduction Speedup
Intel Wildcat Lake −22.8% 1.30×
Intel Panther Lake −17.1% 1.21×

This change yields a 1.2–1.3× speedup on the SAM ViT-base vision encoder under default configuration.

@fanchenkong1 fanchenkong1 changed the title [WebGPU EP] Enable Cast to int64 by default [WebGPU] Enable Cast to int64 by default Jun 5, 2026
Support casting to int64 from float32/float16 via IEEE-754 bit
decomposition. T2 now always allows int64; casting from int64 stays
gated by enable_int64. Adds cast_op_test.cc coverage for the newly
introduced conversions.
@fanchenkong1 fanchenkong1 force-pushed the enable-webgpu-float2int64 branch from 30ad07e to 53a160d Compare June 5, 2026 05:54
@fanchenkong1 fanchenkong1 marked this pull request as ready for review June 5, 2026 05:55
@daijh
Copy link
Copy Markdown
Contributor

daijh commented Jun 5, 2026

@qjia7 @guschmue PTAL.

}
}
}
sh.MainFunctionBody() << " y[base] = " << values[0] << ";\n";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please use output.SetByOffset instead of indirect accessing y.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the WebGPU EP’s Cast kernel to allow casting to int64 by default (even when enable_int64 is false), and implements float32/float16 → int64 conversion in WGSL via IEEE-754 bit decomposition to avoid CPU fallback and associated device/host sync overhead.

Changes:

  • Add an IEEE-754 bit-decomposition path for float → int64 in the WebGPU Cast shader, including lane-safe stores for int64 outputs.
  • Adjust WebGPU Cast kernel type constraints so T2 (output) always allows int64, while T1 (input) still gates int64 on enable_int64.
  • Add CPU-side Cast tests covering large float→int64 values and several int32/uint32/bool→int64 and size%4 regression cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/core/providers/webgpu/tensor/cast.h Extends CastProgram parameters and adds output_size uniform for int64 tail handling.
onnxruntime/core/providers/webgpu/tensor/cast.cc Implements float→int64 WGSL helper and relaxes output type constraints for int64; updates shader codegen paths.
onnxruntime/test/providers/cpu/tensor/cast_op_test.cc Adds test coverage for float32/float16/int32/uint32/bool to int64 conversions and non-multiple-of-4 sizes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 123 to 127
sh.MainFunctionBody() << " let a0 = " << input.GetByOffset("global_idx * 4") << ";\n"
<< " let a1 = " << input.GetByOffset("global_idx * 4 + 1") << ";\n"
<< " let a2 = " << input.GetByOffset("global_idx * 4 + 2") << ";\n"
<< " let a3 = " << input.GetByOffset("global_idx * 4 + 3") << ";\n"
<< " let a = vec4<i32>(a0, a1, a2, a3);\n";
Comment on lines +157 to +160
sh.MainFunctionBody() << " y[base] = " << values[0] << ";\n";
for (size_t i = 1; i < 4; ++i) {
sh.MainFunctionBody() << " if (base + " << i << "u < uniforms.output_size) { y[base + " << i
<< "u] = " << values[i] << "; }\n";
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants