[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

SS-JIA · 2025-03-23T23:53:05Z

Stack from ghstack (oldest at bottom):

-> [ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

Context

Currently, for the q_8w_linear shader, both the texture and the buffer variants use the same global work group and local work group setting.

Specially, the global work group is set to {out.numel(), 1, 1} and the local work group is set to {64, 1, 1}.

However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:

Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total
All work groups will be requesting the same row of B
One work group will load 65 unique rows from A and B

Compare this to a local work group size of {8, 8, 1}

Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B
One work group will load 16 unique rows total from A and B

Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.

Changes

Modify the q_8w_linear shader to use {8, 8, 1} local wg if possible. If M is small, then instead use {4, 16, 1} or {2, 32, 1} to reduce the number of inactive invocations.

Differential Revision: D71706489

## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]

pytorch-bot · 2025-03-23T23:53:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9531

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit a3a3d85 with merge base 7159650 ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels
pull / test-llava-runner-linux / linux-job (gh)
RuntimeError: Command docker exec -t 208798d1fa15267b5bbad3bef2d5ac147954ef4e02413f376c0641e53295c674 /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) ghstack-source-id: 273548740 Pull Request resolved: #9531

facebook-github-bot · 2025-03-23T23:53:18Z

This pull request was exported from Phabricator. Differential Revision: D71706489

github-actions · 2025-03-23T23:53:45Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]

Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274198011 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)

facebook-github-bot · 2025-03-26T16:01:00Z

This pull request was exported from Phabricator. Differential Revision: D71706489

## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]

Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274260277 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)

facebook-github-bot · 2025-03-26T19:38:50Z

This pull request was exported from Phabricator. Differential Revision: D71706489

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 23, 2025

facebook-github-bot added the fb-exported label Mar 23, 2025

yipjustin approved these changes Mar 26, 2025

View reviewed changes

facebook-github-bot merged commit e918ec2 into gh/SS-JIA/200/base Mar 26, 2025
80 of 83 checks passed

facebook-github-bot deleted the gh/SS-JIA/200/head branch March 26, 2025 22:17

facebook-github-bot temporarily deployed to cherry-pick-bot March 26, 2025 22:17 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Mar 26, 2025

[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9664

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

Uh oh!

SS-JIA commented Mar 23, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 23, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 23, 2025

Uh oh!

github-actions bot commented Mar 23, 2025

Uh oh!

facebook-github-bot commented Mar 26, 2025

Uh oh!

facebook-github-bot commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531

Uh oh!

Conversation

SS-JIA commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Uh oh!

pytorch-bot bot commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9531

❌ 2 New Failures

Uh oh!

facebook-github-bot commented Mar 23, 2025

Uh oh!

github-actions bot commented Mar 23, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Mar 26, 2025

Uh oh!

facebook-github-bot commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

SS-JIA commented Mar 23, 2025 •

edited

Loading

pytorch-bot bot commented Mar 23, 2025 •

edited

Loading

This PR needs a `release notes:` label