layernorm vulkan subgroup reduce optimization by futz12 · Pull Request #6754 · Tencent/ncnn

futz12 · 2026-05-29T04:55:58Z

add layernorm_reduce_subgroup shader using subgroupAdd arithmetic
compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces
fallback to shared memory tree reduction when subgroup arithmetic unavailable
add perf_layernorm benchmark

LayerNorm Vulkan Subgroup Reduce

shape	precision	baseline	optimized	speedup
[4096,1,1]	fp32	30.74 us	6.51 us	4.7x
[4096,1,1]	fp16ps	30.64 us	5.96 us	5.1x
[4096,1,1]	fp16psa	30.62 us	6.07 us	5.0x
[4096,1,1]	bf16ps	32.17 us	6.08 us	5.3x
[4096,1,32]	fp32	42.05 us	14.70 us	2.9x
[4096,1,32]	fp16ps	41.59 us	13.10 us	3.2x
[4096,1,32]	fp16psa	41.82 us	12.30 us	3.4x
[4096,1,32]	bf16ps	42.28 us	12.90 us	3.3x
[16384,1,1]	fp32	39.77 us	11.90 us	3.3x
[16384,1,1]	fp16ps	39.54 us	12.10 us	3.3x
[16384,1,1]	fp16psa	39.53 us	12.30 us	3.2x
[16384,1,1]	bf16ps	39.64 us	10.10 us	3.9x
[5120,1,1]	fp32	35.40 us	7.26 us	4.9x
[5120,1,1]	fp16ps	32.70 us	6.61 us	4.9x
[5120,1,1]	fp16psa	32.60 us	6.68 us	4.9x
[5120,1,1]	bf16ps	32.10 us	6.66 us	4.8x
[4096,512,1]	fp32	112.20 us	44.20 us	2.5x
[4096,512,1]	fp16ps	113.40 us	43.05 us	2.6x
[4096,512,1]	fp16psa	113.30 us	43.29 us	2.6x
[4096,512,1]	bf16ps	113.50 us	42.82 us	2.7x
[1024,1,1]	fp32	27.33 us	6.13 us	4.5x
[1024,1,1]	fp16ps	26.88 us	5.21 us	5.2x
[1024,1,1]	fp16psa	26.89 us	5.25 us	5.1x
[1024,1,1]	bf16ps	27.21 us	5.32 us	5.1x
[768,1,1]	fp32	26.77 us	5.85 us	4.6x
[768,1,1]	fp16ps	26.86 us	5.40 us	5.0x
[768,1,1]	fp16psa	26.66 us	5.35 us	5.0x
[768,1,1]	bf16ps	26.78 us	5.26 us	5.1x

- add layernorm_reduce_subgroup shader using subgroupAdd arithmetic - compute mean and variance in single dispatch, eliminating ~10+ ping-pong reduces - fallback to shared memory tree reduction when subgroup arithmetic unavailable - add perf_layernorm benchmark perf improvement on RTX 4060 [4096,1,1] fp32: ~27ms -> ~7.5ms (3.6x) perf improvement on RTX 4060 [4096,1,32] fp32: ~11.5ms -> ~1.3ms (8.8x) perf improvement on RTX 4060 [4096,512,1] fp32: ~11.5ms -> ~4.8ms (2.4x) perf improvement on RTX 4060 [1024,1,1] fp32: ~27ms -> ~7ms (3.8x)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba13ae4698

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T04:59:47Z

+            float mean = float(r_sum) / float(p.group_size);
+            float var = float(r_sqsum) / float(p.group_size) - mean * mean;


Avoid cancellation when computing variance

For groups whose values have a large common offset but small spread, computing E[x^2] - mean * mean in float can lose all significant bits or even go slightly negative, so LayerNorm returns badly scaled values or NaNs even though the previous two-pass path used squared deviations from the mean. The same formula appears in the pack4 subgroup shader, so tensors with values like large activations plus small differences will regress only when the new subgroup path is selected.

Useful? React with 👍 / 👎.

The layernorm_reduce_subgroup shaders are only created on devices that support subgroup arithmetic, so the outermost #else (pure shared-memory fallback without subgroup) is dead code. Note: InstanceNorm/GroupNorm/RMSNorm do not have subgroup reduce shaders yet, so there is no dead branch to clean up there.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ef2d42224

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T08:41:38Z

        pipeline_layernorm_norm_pack4->create(LayerShaderType::layernorm_norm_pack4, opt, specializations);
    }

+    if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)


Honor use_subgroup_ops before creating subgroup pipelines

When callers disable opt.use_subgroup_ops on a device that still reports arithmetic subgroup support, this branch still builds the new subgroup shaders. Those shaders use subgroup extensions/ops unconditionally, while Pipeline::create(..., opt, ...) targets Vulkan 1.0/SPIR-V 1.0 when use_subgroup_ops is false, so LayerNorm pipeline creation can fail or leave an unusable pipeline instead of falling back to the existing reduction path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-29T08:41:38Z

+        afp v = buffer_ld1(bottom_top_blob_data, offset + t);
+        sum += v;
+        sqsum += v * v;


Accumulate scalar subgroup sums in fp32

With opt.use_fp16_arithmetic, afp is float16_t, so the scalar subgroup path accumulates both sum and sqsum in half precision. For common LayerNorm groups this can overflow or lose most precision (for example, 256 elements near 20 make sqsum exceed fp16's finite range), whereas the previous fp16_to_fp32 reduction used float accumulators; this produces incorrect variance and NaN/inf-normalized output only when the new scalar subgroup path is selected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3c433aabf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T05:41:50Z

+        dispatcher.h = num_groups_total;
+        dispatcher.c = 1;


Split subgroup dispatch across z

When channels * num_groups_per_channel exceeds the device's Y dispatch limit, this collapsed one-dimensional group_id dispatch records too many Y workgroups; Vulkan only guarantees maxComputeWorkGroupCount[1] >= 65535, so a valid tensor such as c=1024, h=128 normalized over w would request 131072 Y groups. The fallback path kept h=num_groups_per_channel and c=channels, so the new subgroup path should preserve/split the channel dimension instead of packing all groups into Y.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-02T05:41:50Z

+    if (use_subgroup)
+    {
+        pipeline_layernorm_reduce_subgroup = new Pipeline(vkdev);
+        pipeline_layernorm_reduce_subgroup->set_local_size_xyz(256, 1, 1);


Guard fixed 256-thread subgroup pipelines

On subgroup-capable Vulkan devices that expose only the required minimum maxComputeWorkGroupInvocations or maxComputeWorkGroupSize[0] of 128, this fixed local size of 256 is invalid, while the old reduction pipelines used set_optimal_local_size_xyz() to clamp to device limits. In that environment use_subgroup selects this path and pipeline creation/dispatch can fail for LayerNorm instead of falling back to the existing reduction path.

Useful? React with 👍 / 👎.

github-actions Bot added vulkan test labels May 29, 2026

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

fix route logic

f3c433a

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layernorm vulkan subgroup reduce optimization#6754

layernorm vulkan subgroup reduce optimization#6754
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:layernorm-subgroup-reduce

futz12 commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		float mean = float(r_sum) / float(p.group_size);
		float var = float(r_sqsum) / float(p.group_size) - mean * mean;

Conversation

futz12 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

futz12 commented May 29, 2026 •

edited

Loading