Skip to content

[GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space #30491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

steve-y
Copy link
Contributor

@steve-y steve-y commented May 11, 2025

Details:

  • Set minimum memory of count for reduce mean mode of scatter_elements_update
  • Fix typos and remove redundant spaces
  • Add unit test

Tickets:

  • 155068

@steve-y steve-y requested review from a team as code owners May 11, 2025 12:25
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label May 11, 2025
__local int count_v[1];
__local int count_k[COUNT_LIMIT/64];
__local int count_v[COUNT_LIMIT/64];
count_length = COUNT_LIMIT/64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about implementing it from jitter-side? I think that will be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix was implemented from jitter-side as you commented.
Thanks for your comment.

@steve-y steve-y changed the title [GPU] Set minimum memory of count for reduce mean, fix typos and remove space [GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space May 12, 2025
@steve-y steve-y force-pushed the sy/fix_reduce_mean branch from 53c2aa7 to dd8ef87 Compare May 13, 2025 00:00
@@ -201,6 +201,7 @@ KernelsData ScatterElementsUpdateKernelRef::GetKernelsData(const Params& params)
if (i == 1) {
cldnn_jit.AddConstant(MakeJitConstant("IS_SECOND_ITER", "true"));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LIMIT", params.engineInfo.maxLocalMemSize));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_MINIMUM", params.engineInfo.maxLocalMemSize/64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What does it mean that COUNT_LENGTH==0? Then I guess total workitem size is just 0 and no code will be executed. Isn't it?

  • what about just setting COUNT_LENGTH = dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2] if dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2] != 0 else COUNT_MINIMUM? Then you don't need to introduce additional variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggested code did not introduce additional variable, and it was applied. Thanks for your comments.

@steve-y steve-y force-pushed the sy/fix_reduce_mean branch from dd8ef87 to 09c57b5 Compare May 27, 2025 06:16
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LENGTH", dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2]));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LENGTH", dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2] != 0 ?
dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2] :
params.engineInfo.maxLocalMemSize/64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dynamic shape, always use slm? why? I think it should use global mem if we cannot ensure the size to fit in the local mem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
As __global didn't work for function calling in cl kernel, it was updated to use __local and exception. Thanks for your comments.

@steve-y steve-y force-pushed the sy/fix_reduce_mean branch 4 times, most recently from 08ff7c1 to fe00c91 Compare June 2, 2025 05:31
@@ -199,9 +199,15 @@ KernelsData ScatterElementsUpdateKernelRef::GetKernelsData(const Params& params)
auto entry_point = GetEntryPoint(kernelName, newParams.layerID, params, i);

if (i == 1) {
auto count_limit = params.engineInfo.maxLocalMemSize;
auto count_length = dispatchData.gws[0] * dispatchData.gws[1] * dispatchData.gws[2];
auto min_count_length = count_limit / 64;
Copy link
Contributor

@yeonbok yeonbok Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what is this 64 for ? if we need to add a hard coded number, please use a const variable name with more info e.g., const int var_name_for_purpose = 64;
  2. Also having ref kernel be limited to # of element fit for local mem is not acceptable. e.g., desktop igpus w/ local mem size 64K/64 => 1024 elements is not acceptable.
  3. So we should be able to handle such cases not fit for local mem size.
  4. i.e., we should have two versions 1) use slm 2) not use slm. Please check softmax implementation

Copy link
Contributor Author

@steve-y steve-y Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR was updated like;

  1. As count values could be known while execution, allocatable local memory needed to be provided.
  2. Max allocatable local memory size was found. It was about maxLocalMemSize / 8, not maxLocalMemSize. (It was checked using cliloader.)
  3. As count had k and v, each allocatable memory was maxLocalMemSize / 8 / 2.
  4. If required memory was over allocatable memory, it was allocated in global memory. And if not, it was allocated in local memory.
  5. As __global could not be used in function call, macro functions were used instead.
  6. [DRAFT] [GPU] Set internal memory of count for reduce mean #30930 uses internal buffer for dynamic case.

Thanks for your comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I request was to use SLM when available (i.e., count_length is less than the limit) and use global memory when not availalbe (i.e., count_length is larger than the limit). Currently if dynamic shape, always using global memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COUNT_LENGTH is used for array size for count_k and count_v, so I thought COUNT_LENGTH should be fixed before run and not changed when running. And COUNT_LENGTH == COUNT_LIMIT for dynamic shape, __local is used with its maximum size. What do you think?

@steve-y steve-y force-pushed the sy/fix_reduce_mean branch from fe00c91 to eab2f42 Compare June 10, 2025 03:01
@steve-y steve-y changed the title [GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space [DO NOT REVIEW] [GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space Jun 10, 2025
@steve-y steve-y changed the title [DO NOT REVIEW] [GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space [GPU] Set minimum memory of count for reduce mean mode of scatter_elements_update, fix typos and remove space Jun 10, 2025
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LIMIT", params.engineInfo.maxLocalMemSize));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LENGTH", newParams.inputs[1].LogicalSize()));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LIMIT", maxAllocatableMemSize));
cldnn_jit.AddConstant(MakeJitConstant("COUNT_LENGTH", newParams.inputs[1].LogicalSize() != 0 ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why if dynamic, always count_length is max slm size?
We should define actual size using the shape something like INPUT_SIZE_XXX.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As count_length is used for array size of count, it should be fixed before run, so it is max slm size.
I think I didn't understand how INPUT_SIZE_XXX... used, and could you explain more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants