SYCL: fix index order #2488

psychocoderHPC · 2025-03-27T08:04:20Z

The fast moving index in SYCLs nd_item is the right most equal to alpaka's index order.
In our code base we implemented it equal to CUDA's index order where the left most index is the fast moving index.

This PR should be back ported to develop version 1.3

You can read more about it https://www.intel.com/content/www/us/en/docs/dpcpp-compatibility-tool/developer-guide-reference/2023-2/cuda-and-sycl-programming-model-comparison.html

thanks to @SimeonEhrig to pointing me to this issue.

psychocoderHPC · 2025-03-27T09:20:33Z

I updated the PR, because I missed to change TaskKernelGenericSycl.hpp. There we translate the alpaka WorkGroup into SYCL ranges.

psychocoderHPC · 2025-03-28T13:59:05Z

@AuroraPerego could you try this PR on a CPU SYCL device. On an intel GPU all tests pass but for unknown reasons the CI fails where we executed the code on a CPU accelerator.

error is:

ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1)'
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'

psychocoderHPC · 2025-03-28T16:23:06Z

I thing a comment @fwyzard is maybe a good starting point for the current problem seen on FPGA emulation and CPU

alpaka/include/alpaka/exec/Once.hpp

Lines 33 to 38 in 040feb6

    
           // Workaround for a weird bug in oneAPI 2024.x targetting the CPU backend and FPGA emulator. 
        
           if constexpr(accMatchesTags<TAcc, TagCpuSycl, TagFpgaSyclIntel>) 
        
           { 
        
               // SYCL accelerator specific code 
        
               return acc.m_item_workdiv.get_global_linear_id() == 0; 
        
           }

In the original alpaka code we permuted the indices twice. Once before the kernel start to calculate the grid size and within the kernel, we permutate all sycl indices back. If we linearized the permutated indices we could differ from what get_global_linear_id() or if we talk about warp function get_sub_group().get_local_linear_id()`

I try currently to find if the AI is hallucinating or the following is true.

There is a known issue with the get_global_linear_id() function in SYCL, which affects FPGA and CPU devices but not GPUs. The problem arises because this function relies on the get_global_id() function, which returns the global ID of a work item within a work group.

On FPGAs, the global ID calculation is different due to the way FPGAs handle parallelism. Specifically, FPGAs use a concept called "work-item replication," where multiple work items are executed in parallel within a single processing element. As a result, the get_global_id() function can return an ID that is not unique across the entire work group, leading to incorrect results when using get_global_linear_id().

Similarly, on CPUs, the get_global_linear_id() function can be affected by the way CPUs handle parallelism. Since CPUs execute work items sequentially within a work group, the get_global_id() function may return an ID that is not representative of the actual global index of the work item.

On the other hand, GPUs are designed to handle massive parallelism and have optimized architectures for handling work-item IDs. As a result, the get_global_linear_id() function typically works as expected on GPUs.

To work around this issue, you can use alternative methods to calculate the global linear ID, such as manually calculating the ID based on the work group size and the local ID of the work item. Alternatively, you can use the get_global_id() function with the range<3> parameter to get the global ID of the work item in three-dimensional space and then calculate the linear ID manually.

It's worth noting that this issue may be addressed in future versions of the SYCL specification or by specific SYCL implementations. If you're experiencing issues with get_global_linear_id() on FPGAs or CPUs, I recommend checking the documentation of your SYCL implementation or reaching out to the vendor for more information.

psychocoderHPC · 2025-03-28T17:11:16Z

I have set this PR to draft and added in the last commit debug output.

psychocoderHPC · 2025-03-28T17:24:05Z

It could be that we are not allowed to run the Any test in SYCL. If I not miss something than sycl::any_of_group() is a collective function but we early terminate some threads and call any only with a few work_items in the group.

alpaka/test/unit/warp/src/Any.cpp

Lines 47 to 54 in 040feb6

    
           // Some threads quit the kernel to test that the warp operations 
        
           // properly operate on the active threads only 
        
           if(threadIdxInWarp % 2) 
        
               return; 
        
           for(auto idx = 0; idx < warpExtent; idx++) 
        
           { 
        
               ALPAKA_CHECK(*success, alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1);

4.17.2. Group functions
SYCL provides a number of functions that expose functionality tied to groups of work-items (such as
group barriers and collective operations). These group functions act as synchronization points and must
be encountered in converged control flow by all work-items in the group.
The behavior of every group function is as follows:
• Each work-item in the group arrives at the synchronization point associated with the group function,
then blocks until any operation(s) specified by the group function have completed.
• Once all work-items in the group have arrived, an unspecified subset of those work-items cooperate
to execute any operation(s) specified by the group function.
• When the set of cooperating work-items have completed execution of all operation(s) specified by the
group function, all work-items blocked on the synchronization point associated with the group function are unblocked.

AuroraPerego · 2025-03-28T17:33:27Z

It could be that we are not allowed to run the Any test in SYCL. If I not miss something than sycl::any_of_group() is a collective function but we early terminate some threads and call any only with a few work_items in the group.

We have already disabled the tests for sycl::all_of_group() for the same reason, so probably yes.

fwyzard · 2025-03-29T18:56:11Z

I try currently to find if the AI is hallucinating or the following is true.

I strongly believe the AI is hallucinating, that would be a very weird and common bug.

Looking in the source code, get_global_linear_id() is an always-inline function that computes the linear id from the N-dimensional values returned by get_global_id(), get_global_range() and get_offset().

So, if the N-dimensional values are correct, it would be extremely surprising that the linear id is wrong...

fwyzard · 2025-03-29T19:08:44Z

We have already disabled the tests for sycl::all_of_group() for the same reason, so probably yes.

@psychocoderHPC, see #2470 .
Looks like we missed the sycl::any_of_group because it was (by chance ?) passing the unit test.

We need to agree what the behaviour of the alpaka warp functions should be, and in case implement #2485.

psychocoderHPC · 2025-03-31T14:05:34Z

@psychocoderHPC, see #2470 .
Looks like we missed the sycl::any_of_group because it was (by chance ?) passing the unit test.

Yes and we missed shfl, this test is failing too.

We need to agree what the behaviour of the alpaka warp functions should be, and in case implement #2485.

Yes, we should find an agreement in the next meeting mid of April.

psychocoderHPC · 2025-03-31T14:13:30Z

I currently try to understand the output of my last debug test. I disabled FPGa and run CPU only. Within the any warp function I added some debug output. The strange thing is that the test passed many times with different warp sizes. Later it starts to fail with a warp size of 32 and only thread Zero is writing the debug output.

[... my working cases with different warp sizes 4/8/16/32 ...]
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=4
id=2 lid=2 max=4
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 0 : 1) == 1)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
id=0 lid=0 max=32
ALPAKA_CHECK failed because '!(alpaka::warp::any(acc, threadIdxInWarp == idx ? 1 : 0) == expected)'

I do not currently have an explanation for why we see so many valid outputs and then fail with thread zero only.

The fast moving index in SYCLs `nd_item` is the rigth most equal to alpaka's index order. In our code base we implemented it equal to CUDA's index order where the left most index is the fast moving index.

psychocoderHPC · 2025-06-10T11:39:38Z

This PR is now ready!

fwyzard · 2025-06-10T11:41:42Z

If the CI is happy, I'm happy :-)

The fast moving index in SYCL's nd_item is the rightmost one, equal to alpaka's index order. In our code base we implemented it equal to CUDA's index order where the left most index is the fast moving index.

psychocoderHPC added Type:Bug Backend:SYCL labels Mar 27, 2025

psychocoderHPC added this to the 2.0.0 milestone Mar 27, 2025

psychocoderHPC requested review from fwyzard and AuroraPerego March 27, 2025 08:04

AuroraPerego previously approved these changes Mar 27, 2025

View reviewed changes

psychocoderHPC dismissed AuroraPerego’s stale review via 7d3c32b March 27, 2025 09:19

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 2094231 to 7d3c32b Compare March 27, 2025 09:19

psychocoderHPC force-pushed the fix-syclIndexOrder branch 2 times, most recently from 5c30c29 to 9797f22 Compare March 27, 2025 15:41

AuroraPerego self-requested a review March 27, 2025 17:47

psychocoderHPC marked this pull request as draft March 28, 2025 13:08

psychocoderHPC force-pushed the fix-syclIndexOrder branch from a36bd58 to 9797f22 Compare March 28, 2025 13:43

psychocoderHPC marked this pull request as ready for review March 28, 2025 13:57

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 9797f22 to e0c8877 Compare March 28, 2025 14:23

psychocoderHPC marked this pull request as draft March 28, 2025 17:10

psychocoderHPC force-pushed the fix-syclIndexOrder branch from ed7181a to 4475f66 Compare March 28, 2025 17:14

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 4475f66 to 3a3484d Compare March 28, 2025 17:42

psychocoderHPC mentioned this pull request Jun 2, 2025

CI: disable SYCL warp tests #2501

Merged

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 3a3484d to 25859b6 Compare June 2, 2025 07:50

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 25859b6 to 4e85764 Compare June 3, 2025 08:06

SYCL: fix index order

bc3009d

The fast moving index in SYCLs `nd_item` is the rigth most equal to alpaka's index order. In our code base we implemented it equal to CUDA's index order where the left most index is the fast moving index.

psychocoderHPC force-pushed the fix-syclIndexOrder branch from 4e85764 to bc3009d Compare June 10, 2025 08:15

psychocoderHPC marked this pull request as ready for review June 10, 2025 11:39

fwyzard approved these changes Jun 10, 2025

View reviewed changes

fwyzard merged commit 5695950 into alpaka-group:develop Jun 10, 2025
25 checks passed

This was referenced Jun 23, 2025

[1.3] SYCL: fix index order #2526

Merged

[1.3] CI: disable SYCL warp tests #2527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SYCL: fix index order #2488

SYCL: fix index order #2488

Uh oh!

psychocoderHPC commented Mar 27, 2025

Uh oh!

psychocoderHPC commented Mar 27, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

AuroraPerego commented Mar 28, 2025

Uh oh!

fwyzard commented Mar 29, 2025 •

edited

Loading

Uh oh!

fwyzard commented Mar 29, 2025

Uh oh!

psychocoderHPC commented Mar 31, 2025

Uh oh!

psychocoderHPC commented Mar 31, 2025

Uh oh!

psychocoderHPC commented Jun 10, 2025

Uh oh!

fwyzard commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

SYCL: fix index order #2488

SYCL: fix index order #2488

Uh oh!

Conversation

psychocoderHPC commented Mar 27, 2025

Uh oh!

psychocoderHPC commented Mar 27, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

psychocoderHPC commented Mar 28, 2025

Uh oh!

AuroraPerego commented Mar 28, 2025

Uh oh!

fwyzard commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fwyzard commented Mar 29, 2025

Uh oh!

psychocoderHPC commented Mar 31, 2025

Uh oh!

psychocoderHPC commented Mar 31, 2025

Uh oh!

psychocoderHPC commented Jun 10, 2025

Uh oh!

fwyzard commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fwyzard commented Mar 29, 2025 •

edited

Loading

fwyzard commented Jun 10, 2025 •

edited

Loading