Skip to content

SYCL: fix index order #2488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

psychocoderHPC
Copy link
Member

The fast moving index in SYCLs nd_item is the right most equal to alpaka's index order.
In our code base we implemented it equal to CUDA's index order where the left most index is the fast moving index.

This PR should be back ported to develop version 1.3

You can read more about it https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/developer-guide-reference/2023-2/cuda-and-sycl-programming-model-comparison.html

thanks to @SimeonEhrig to pointing me to this issue.

AuroraPerego
AuroraPerego previously approved these changes Mar 27, 2025
@psychocoderHPC
Copy link
Member Author

I updated the PR, because I missed to change TaskKernelGenericSycl.hpp. There we translate the alpaka WorkGroup into SYCL ranges.

@psychocoderHPC psychocoderHPC force-pushed the fix-syclIndexOrder branch 2 times, most recently from 5c30c29 to 9797f22 Compare March 27, 2025 15:41
@AuroraPerego AuroraPerego self-requested a review March 27, 2025 17:47
@psychocoderHPC psychocoderHPC marked this pull request as draft March 28, 2025 13:08
@psychocoderHPC psychocoderHPC marked this pull request as ready for review March 28, 2025 13:57
@psychocoderHPC
Copy link
Member Author

@AuroraPerego could you try this PR on a CPU SYCL device. On an intel GPU all tests pass but for unknown reasons the CI fails where we executed the code on a CPU accelerator.

error is:

ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1)'
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'

The fast moving index in SYCLs `nd_item` is the rigth most equal to
alpaka's index order.
In our code base we implemented it equal to CUDA's index order where the
left most index is the fast moving index.
@psychocoderHPC
Copy link
Member Author

I thing a comment @fwyzard is maybe a good starting point for the current problem seen on FPGA emulation and CPU

// Workaround for a weird bug in oneAPI 2024.x targetting the CPU backend and FPGA emulator.
if constexpr(accMatchesTags<TAcc, TagCpuSycl, TagFpgaSyclIntel>)
{
// SYCL accelerator specific code
return acc.m_item_workdiv.get_global_linear_id() == 0;
}

In the original alpaka code we permuted the indices twice. Once before the kernel start to calculate the grid size and within the kernel, we permutate all sycl indices back. If we linearized the permutated indices we could differ from what get_global_linear_id() or if we talk about warp function get_sub_group().get_local_linear_id()`

I try currently to find if the AI is hallucinating or the following is true.

There is a known issue with the get_global_linear_id() function in SYCL, which affects FPGA and CPU devices but not GPUs. The problem arises because this function relies on the get_global_id() function, which returns the global ID of a work item within a work group.

On FPGAs, the global ID calculation is different due to the way FPGAs handle parallelism. Specifically, FPGAs use a concept called "work-item replication," where multiple work items are executed in parallel within a single processing element. As a result, the get_global_id() function can return an ID that is not unique across the entire work group, leading to incorrect results when using get_global_linear_id().

Similarly, on CPUs, the get_global_linear_id() function can be affected by the way CPUs handle parallelism. Since CPUs execute work items sequentially within a work group, the get_global_id() function may return an ID that is not representative of the actual global index of the work item.

On the other hand, GPUs are designed to handle massive parallelism and have optimized architectures for handling work-item IDs. As a result, the get_global_linear_id() function typically works as expected on GPUs.

To work around this issue, you can use alternative methods to calculate the global linear ID, such as manually calculating the ID based on the work group size and the local ID of the work item. Alternatively, you can use the get_global_id() function with the range<3> parameter to get the global ID of the work item in three-dimensional space and then calculate the linear ID manually.

It's worth noting that this issue may be addressed in future versions of the SYCL specification or by specific SYCL implementations. If you're experiencing issues with get_global_linear_id() on FPGAs or CPUs, I recommend checking the documentation of your SYCL implementation or reaching out to the vendor for more information.

@psychocoderHPC psychocoderHPC marked this pull request as draft March 28, 2025 17:10
@psychocoderHPC
Copy link
Member Author

I have set this PR to draft and added in the last commit debug output.

@psychocoderHPC
Copy link
Member Author

It could be that we are not allowed to run the Any test in SYCL. If I not miss something than sycl::any_of_group() is a collective function but we early terminate some threads and call any only with a few work_items in the group.

// Some threads quit the kernel to test that the warp operations
// properly operate on the active threads only
if(threadIdxInWarp % 2)
return;
for(auto idx = 0; idx < warpExtent; idx++)
{
ALPAKA_CHECK(*success, alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1);

4.17.2. Group functions
SYCL provides a number of functions that expose functionality tied to groups of work-items (such as
group barriers and collective operations). These group functions act as synchronization points and must
be encountered in converged control flow by all work-items in the group.
The behavior of every group function is as follows:
• Each work-item in the group arrives at the synchronization point associated with the group function,
then blocks until any operation(s) specified by the group function have completed.
• Once all work-items in the group have arrived, an unspecified subset of those work-items cooperate
to execute any operation(s) specified by the group function.
• When the set of cooperating work-items have completed execution of all operation(s) specified by the
group function, all work-items blocked on the synchronization point associated with the group function are unblocked.

@AuroraPerego
Copy link
Contributor

It could be that we are not allowed to run the Any test in SYCL. If I not miss something than sycl::any_of_group() is a collective function but we early terminate some threads and call any only with a few work_items in the group.

We have already disabled the tests for sycl::all_of_group() for the same reason, so probably yes.

disable FBGA SYCL tests

CI_FILTER: linux_icpx
@fwyzard
Copy link
Contributor

fwyzard commented Mar 29, 2025

I try currently to find if the AI is hallucinating or the following is true.

I strongly believe the AI is hallucinating, that would be a very weird and common bug.

Looking in the source code, get_global_linear_id() is an always-inline function that computes the linear id from the N-dimensional values returned by get_global_id(), get_global_range() and get_offset().

So, if the N-dimensional values are correct, it would be extremely surprising that the linear id is wrong...

@fwyzard
Copy link
Contributor

fwyzard commented Mar 29, 2025

We have already disabled the tests for sycl::all_of_group() for the same reason, so probably yes.

@psychocoderHPC, see #2470 .
Looks like we missed the sycl::any_of_group because it was (by chance ?) passing the unit test.

We need to agree what the behaviour of the alpaka warp functions should be, and in case implement #2485.

@psychocoderHPC
Copy link
Member Author

@psychocoderHPC, see #2470 .
Looks like we missed the sycl::any_of_group because it was (by chance ?) passing the unit test.

Yes and we missed shfl, this test is failing too.

We need to agree what the behaviour of the alpaka warp functions should be, and in case implement #2485.

Yes, we should find an agreement in the next meeting mid of April.

@psychocoderHPC
Copy link
Member Author

I currently try to understand the output of my last debug test. I disabled FPGa and run CPU only. Within the any warp function I added some debug output. The strange thing is that the test passed many times with different warp sizes. Later it starts to fail with a warp size of 32 and only thread Zero is writing the debug output.

[... my working cases with different warp sizes 4/8/16/32 ...]
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'

I do not currently have an explanation for why we see so many valid outputs and then fail with thread zero only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants