-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531
base: gh/SS-JIA/200/base
Are you sure you want to change the base?
Conversation
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9531
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 44aace8 with merge base 76ae537 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) ghstack-source-id: 273548740 Pull Request resolved: #9531
This pull request was exported from Phabricator. Differential Revision: D71706489 |
This PR needs a
|
Stack from ghstack (oldest at bottom):
Context
Currently, for the
q_8w_linear
shader, both the texture and the buffer variants use the same global work group and local work group setting.Specially, the global work group is set to
{out.numel(), 1, 1}
and the local work group is set to{64, 1, 1}
.However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
Compare this to a local work group size of
{8, 8, 1}
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
Changes
Modify the
q_8w_linear
shader to use{8, 8, 1}
local wg if possible. IfM
is small, then instead use{4, 16, 1}
or{2, 32, 1}
to reduce the number of inactive invocations.Differential Revision: D71706489