[experimental] Use kernel foundry DBSCAN get_core optimizations#3592
[experimental] Use kernel foundry DBSCAN get_core optimizations#3592ethanglaser wants to merge 15 commits into
Conversation
|
/intelci: run |
| ++iter; | ||
|
|
||
| const Float v = xi[k] - xj[k]; | ||
| sum = sycl::fma(v, v, sum); |
There was a problem hiding this comment.
@Vika-F Would it be a problem to have std::fma in a header?
There was a problem hiding this comment.
I guess its not possible inside the parallel_for
|
/intelci: run |
Alexandr-Solovev
left a comment
There was a problem hiding this comment.
The changes are really good. Please fix clang format and run+attach oneDAL/sklearenx benchmarks.
| const bk::event_vector& deps) { | ||
| const std::int64_t local_row_count = data.get_dimension(0); | ||
| const std::int64_t column_count = data.get_dimension(1); | ||
| const std::int64_t row_count64 = data.get_dimension(0); |
There was a problem hiding this comment.
May be the naming update is redundat here
| auto sg = item.get_sub_group(); | ||
| const std::uint32_t sg_id = sg.get_group_id()[0]; | ||
| if (sg_id > 0) | ||
| sycl::sub_group sg = item.get_sub_group(); |
There was a problem hiding this comment.
Not sure that the new one is better. Looks the same, may be redundat
| const std::uint32_t wg_id = item.get_global_id(1); | ||
| if (wg_id >= local_row_count) | ||
| const std::uint32_t row_count = static_cast<std::uint32_t>(row_count64); | ||
| const std::uint32_t col_count = static_cast<std::uint32_t>(col_count64); |
There was a problem hiding this comment.
ONEDAL_ASSERT(row_count64 <= std::numeric_limitsstd::uint32_t::max());
ONEDAL_ASSERT(col_count64 <= std::numeric_limitsstd::uint32_t::max());
Could be overflow here
There was a problem hiding this comment.
Switched it back to int64 because it didn't seem worth it
There was a problem hiding this comment.
Added assertion
| ++iter; | ||
|
|
||
| const Float v = xi[k] - xj[k]; | ||
| sum = sycl::fma(v, v, sum); |
There was a problem hiding this comment.
I guess its not possible inside the parallel_for
|
@ethanglaser Do we have benchmarks for these changes? |
This is with the first kernel foundry result, restoring to that now. |
|
Applied some simplifications, address some comments, and added same optimizations to sendrecv_replace kernel. I will rerun unitrace and check aurora benchmarks next week when CI machines are back online |
|
/intelci: run |
|
/intelci: run |
|
/intelci: run |
|
/intelci: run |
There was a problem hiding this comment.
Pull request overview
This PR updates the oneAPI GPU backend implementation of DBSCAN “get_core” kernels to use a more optimized distance accumulation / pruning approach (kernel-foundry style), with some additional type narrowing to 32-bit indices.
Changes:
- Refactors
get_core_wide_kernelinner loops to use subgroup lanes,sycl::fma, and periodic early-pruning reductions. - Applies the same pruning/refactor approach to
get_core_send_recv_replace_wide_kernel. - Adds dimension upper-bound assertions in
get_core_wide_kernelbefore narrowing touint32_t.
| const std::uint32_t base_i = row_i * col_count; | ||
| const Float* const xi = data_ptr + base_i; | ||
|
|
||
| Float count = neighbours_ptr[row_i]; | ||
|
|
||
| for (std::uint32_t j = 0; j < row_count; ++j) { | ||
| const Float* const xj = data_ptr + (j * col_count); |
There was a problem hiding this comment.
base_i/xj pointer offsets are computed using 32-bit multiplication (row_i * col_count and j * col_count). Even though row_count/col_count are individually asserted to fit in uint32_t, their product can still overflow uint32_t, leading to incorrect pointer arithmetic and potential out-of-bounds reads on the device. Consider using std::uint64_t/std::size_t for offsets (or add an assert that row_count * col_count fits) before doing pointer arithmetic.
| const std::uint32_t row_count_local = | ||
| static_cast<std::uint32_t>(local_row_count); | ||
| const std::uint32_t row_count_repl = | ||
| static_cast<std::uint32_t>(row_count_replace); | ||
| const std::uint32_t col_count = static_cast<std::uint32_t>(column_count); |
There was a problem hiding this comment.
local_row_count, row_count_replace, and column_count are cast to std::uint32_t without any upper-bound checks. If any dimension exceeds uint32_t::max(), the cast will truncate and the kernel will compute incorrect results (and may produce invalid pointer offsets later). Add ONEDAL_ASSERT upper-bound checks similar to get_core_wide_kernel before these casts, or keep the indices in 64-bit types.
| const std::uint32_t base_i = row_i * col_count; | ||
| const Float* const xi = data_ptr + base_i; | ||
|
|
||
| Float count = neighbours_ptr[row_i]; | ||
|
|
||
| for (std::uint32_t j = 0; j < row_count_repl; ++j) { | ||
| const Float* const xj = data_replace_ptr + (j * col_count); | ||
|
|
There was a problem hiding this comment.
Like in get_core_wide_kernel, base_i and xj offsets are computed via 32-bit multiplication (row_i * col_count, j * col_count). This can overflow uint32_t when the total element count exceeds 2^32-1, resulting in incorrect pointer arithmetic. Prefer std::uint64_t/std::size_t offsets (or an explicit overflow-preventing assert on the product).
|
/intelci: run |
|
How are we feeling about this @david-cortes-intel @Alexandr-Solovev @Vika-F @avolkov-intel ? |
| const std::uint32_t base_i = row_i * col_count; | ||
| const Float* const xi = data_ptr + base_i; | ||
|
|
||
| Float count = neighbours_ptr[row_i]; |
There was a problem hiding this comment.
I see this is a Float, but it deals with counts. Shouldn't it be changed to an integer type?

Description
Checklist:
Completeness and readability
Testing
Performance