layernorm vulkan subgroup reduce optimization#6754
Conversation
- add layernorm_reduce_subgroup shader using subgroupAdd arithmetic - compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces - fallback to shared memory tree reduction when subgroup arithmetic unavailable - add perf_layernorm benchmark perf improvement on RTX 4060 [4096,1,1] fp32: ~27ms -> ~7.5ms (3.6x) perf improvement on RTX 4060 [4096,1,32] fp32: ~11.5ms -> ~1.3ms (8.8x) perf improvement on RTX 4060 [4096,512,1] fp32: ~11.5ms -> ~4.8ms (2.4x) perf improvement on RTX 4060 [1024,1,1] fp32: ~27ms -> ~7ms (3.8x)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba13ae4698
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| float mean = float(r_sum) / float(p.group_size); | ||
| float var = float(r_sqsum) / float(p.group_size) - mean * mean; |
There was a problem hiding this comment.
Avoid cancellation when computing variance
For groups whose values have a large common offset but small spread, computing E[x^2] - mean * mean in float can lose all significant bits or even go slightly negative, so LayerNorm returns badly scaled values or NaNs even though the previous two-pass path used squared deviations from the mean. The same formula appears in the pack4 subgroup shader, so tensors with values like large activations plus small differences will regress only when the new subgroup path is selected.
Useful? React with 👍 / 👎.
The layernorm_reduce_subgroup shaders are only created on devices that support subgroup arithmetic, so the outermost #else (pure shared-memory fallback without subgroup) is dead code. Note: InstanceNorm/GroupNorm/RMSNorm do not have subgroup reduce shaders yet, so there is no dead branch to clean up there.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ef2d42224
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| pipeline_layernorm_norm_pack4->create(LayerShaderType::layernorm_norm_pack4, opt, specializations); | ||
| } | ||
|
|
||
| if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) |
There was a problem hiding this comment.
Honor use_subgroup_ops before creating subgroup pipelines
When callers disable opt.use_subgroup_ops on a device that still reports arithmetic subgroup support, this branch still builds the new subgroup shaders. Those shaders use subgroup extensions/ops unconditionally, while Pipeline::create(..., opt, ...) targets Vulkan 1.0/SPIR-V 1.0 when use_subgroup_ops is false, so LayerNorm pipeline creation can fail or leave an unusable pipeline instead of falling back to the existing reduction path.
Useful? React with 👍 / 👎.
| afp v = buffer_ld1(bottom_top_blob_data, offset + t); | ||
| sum += v; | ||
| sqsum += v * v; |
There was a problem hiding this comment.
Accumulate scalar subgroup sums in fp32
With opt.use_fp16_arithmetic, afp is float16_t, so the scalar subgroup path accumulates both sum and sqsum in half precision. For common LayerNorm groups this can overflow or lose most precision (for example, 256 elements near 20 make sqsum exceed fp16's finite range), whereas the previous fp16_to_fp32 reduction used float accumulators; this produces incorrect variance and NaN/inf-normalized output only when the new scalar subgroup path is selected.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3c433aabf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dispatcher.h = num_groups_total; | ||
| dispatcher.c = 1; |
There was a problem hiding this comment.
Split subgroup dispatch across z
When channels * num_groups_per_channel exceeds the device's Y dispatch limit, this collapsed one-dimensional group_id dispatch records too many Y workgroups; Vulkan only guarantees maxComputeWorkGroupCount[1] >= 65535, so a valid tensor such as c=1024, h=128 normalized over w would request 131072 Y groups. The fallback path kept h=num_groups_per_channel and c=channels, so the new subgroup path should preserve/split the channel dimension instead of packing all groups into Y.
Useful? React with 👍 / 👎.
| if (use_subgroup) | ||
| { | ||
| pipeline_layernorm_reduce_subgroup = new Pipeline(vkdev); | ||
| pipeline_layernorm_reduce_subgroup->set_local_size_xyz(256, 1, 1); |
There was a problem hiding this comment.
Guard fixed 256-thread subgroup pipelines
On subgroup-capable Vulkan devices that expose only the required minimum maxComputeWorkGroupInvocations or maxComputeWorkGroupSize[0] of 128, this fixed local size of 256 is invalid, while the old reduction pipelines used set_optimal_local_size_xyz() to clamp to device limits. In that environment use_subgroup selects this path and pipeline creation/dispatch can fail for LayerNorm instead of falling back to the existing reduction path.
Useful? React with 👍 / 👎.
add layernorm_reduce_subgroup shader using subgroupAdd arithmetic
compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces
fallback to shared memory tree reduction when subgroup arithmetic unavailable
add perf_layernorm benchmark
LayerNorm Vulkan Subgroup Reduce