Skip to content

layernorm vulkan subgroup reduce optimization#6754

Open
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:layernorm-subgroup-reduce
Open

layernorm vulkan subgroup reduce optimization#6754
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:layernorm-subgroup-reduce

Conversation

@futz12
Copy link
Copy Markdown
Contributor

@futz12 futz12 commented May 29, 2026

  • add layernorm_reduce_subgroup shader using subgroupAdd arithmetic

  • compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces

  • fallback to shared memory tree reduction when subgroup arithmetic unavailable

  • add perf_layernorm benchmark

    LayerNorm Vulkan Subgroup Reduce

shape precision baseline optimized speedup
[4096,1,1] fp32 30.74 us 6.51 us 4.7x
[4096,1,1] fp16ps 30.64 us 5.96 us 5.1x
[4096,1,1] fp16psa 30.62 us 6.07 us 5.0x
[4096,1,1] bf16ps 32.17 us 6.08 us 5.3x
[4096,1,32] fp32 42.05 us 14.70 us 2.9x
[4096,1,32] fp16ps 41.59 us 13.10 us 3.2x
[4096,1,32] fp16psa 41.82 us 12.30 us 3.4x
[4096,1,32] bf16ps 42.28 us 12.90 us 3.3x
[16384,1,1] fp32 39.77 us 11.90 us 3.3x
[16384,1,1] fp16ps 39.54 us 12.10 us 3.3x
[16384,1,1] fp16psa 39.53 us 12.30 us 3.2x
[16384,1,1] bf16ps 39.64 us 10.10 us 3.9x
[5120,1,1] fp32 35.40 us 7.26 us 4.9x
[5120,1,1] fp16ps 32.70 us 6.61 us 4.9x
[5120,1,1] fp16psa 32.60 us 6.68 us 4.9x
[5120,1,1] bf16ps 32.10 us 6.66 us 4.8x
[4096,512,1] fp32 112.20 us 44.20 us 2.5x
[4096,512,1] fp16ps 113.40 us 43.05 us 2.6x
[4096,512,1] fp16psa 113.30 us 43.29 us 2.6x
[4096,512,1] bf16ps 113.50 us 42.82 us 2.7x
[1024,1,1] fp32 27.33 us 6.13 us 4.5x
[1024,1,1] fp16ps 26.88 us 5.21 us 5.2x
[1024,1,1] fp16psa 26.89 us 5.25 us 5.1x
[1024,1,1] bf16ps 27.21 us 5.32 us 5.1x
[768,1,1] fp32 26.77 us 5.85 us 4.6x
[768,1,1] fp16ps 26.86 us 5.40 us 5.0x
[768,1,1] fp16psa 26.66 us 5.35 us 5.0x
[768,1,1] bf16ps 26.78 us 5.26 us 5.1x

- add layernorm_reduce_subgroup shader using subgroupAdd arithmetic
- compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces
- fallback to shared memory tree reduction when subgroup arithmetic unavailable
- add perf_layernorm benchmark

perf improvement on RTX 4060 [4096,1,1] fp32: ~27ms -> ~7.5ms (3.6x)
perf improvement on RTX 4060 [4096,1,32] fp32: ~11.5ms -> ~1.3ms (8.8x)
perf improvement on RTX 4060 [4096,512,1] fp32: ~11.5ms -> ~4.8ms (2.4x)
perf improvement on RTX 4060 [1024,1,1] fp32: ~27ms -> ~7ms (3.8x)
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba13ae4698

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +72 to +73
float mean = float(r_sum) / float(p.group_size);
float var = float(r_sqsum) / float(p.group_size) - mean * mean;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid cancellation when computing variance

For groups whose values have a large common offset but small spread, computing E[x^2] - mean * mean in float can lose all significant bits or even go slightly negative, so LayerNorm returns badly scaled values or NaNs even though the previous two-pass path used squared deviations from the mean. The same formula appears in the pack4 subgroup shader, so tensors with values like large activations plus small differences will regress only when the new subgroup path is selected.

Useful? React with 👍 / 👎.

The layernorm_reduce_subgroup shaders are only created on devices that
support subgroup arithmetic, so the outermost #else (pure shared-memory
fallback without subgroup) is dead code.

Note: InstanceNorm/GroupNorm/RMSNorm do not have subgroup reduce shaders
yet, so there is no dead branch to clean up there.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ef2d42224

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/layer/vulkan/layernorm_vulkan.cpp Outdated
pipeline_layernorm_norm_pack4->create(LayerShaderType::layernorm_norm_pack4, opt, specializations);
}

if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor use_subgroup_ops before creating subgroup pipelines

When callers disable opt.use_subgroup_ops on a device that still reports arithmetic subgroup support, this branch still builds the new subgroup shaders. Those shaders use subgroup extensions/ops unconditionally, while Pipeline::create(..., opt, ...) targets Vulkan 1.0/SPIR-V 1.0 when use_subgroup_ops is false, so LayerNorm pipeline creation can fail or leave an unusable pipeline instead of falling back to the existing reduction path.

Useful? React with 👍 / 👎.

Comment on lines +40 to +42
afp v = buffer_ld1(bottom_top_blob_data, offset + t);
sum += v;
sqsum += v * v;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accumulate scalar subgroup sums in fp32

With opt.use_fp16_arithmetic, afp is float16_t, so the scalar subgroup path accumulates both sum and sqsum in half precision. For common LayerNorm groups this can overflow or lose most precision (for example, 256 elements near 20 make sqsum exceed fp16's finite range), whereas the previous fp16_to_fp32 reduction used float accumulators; this produces incorrect variance and NaN/inf-normalized output only when the new scalar subgroup path is selected.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3c433aabf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +248 to +249
dispatcher.h = num_groups_total;
dispatcher.c = 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split subgroup dispatch across z

When channels * num_groups_per_channel exceeds the device's Y dispatch limit, this collapsed one-dimensional group_id dispatch records too many Y workgroups; Vulkan only guarantees maxComputeWorkGroupCount[1] >= 65535, so a valid tensor such as c=1024, h=128 normalized over w would request 131072 Y groups. The fallback path kept h=num_groups_per_channel and c=channels, so the new subgroup path should preserve/split the channel dimension instead of packing all groups into Y.

Useful? React with 👍 / 👎.

if (use_subgroup)
{
pipeline_layernorm_reduce_subgroup = new Pipeline(vkdev);
pipeline_layernorm_reduce_subgroup->set_local_size_xyz(256, 1, 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard fixed 256-thread subgroup pipelines

On subgroup-capable Vulkan devices that expose only the required minimum maxComputeWorkGroupInvocations or maxComputeWorkGroupSize[0] of 128, this fixed local size of 256 is invalid, while the old reduction pipelines used set_optimal_local_size_xyz() to clamp to device limits. In that environment use_subgroup selects this path and pipeline creation/dispatch can fail for LayerNorm instead of falling back to the existing reduction path.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant