[experimental] Use kernel foundry DBSCAN get_core optimizations by ethanglaser · Pull Request #3592 · uxlfoundation/oneDAL

ethanglaser · 2026-04-03T15:23:41Z

Description

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

ethanglaser · 2026-04-03T15:24:05Z

/intelci: run

david-cortes-intel · 2026-04-07T06:37:46Z

+                            ++iter;
+
+                            const Float v = xi[k] - xj[k];
+                            sum = sycl::fma(v, v, sum);


@Vika-F Would it be a problem to have std::fma in a header?

I guess its not possible inside the parallel_for

ethanglaser · 2026-04-13T15:48:11Z

/intelci: run

Alexandr-Solovev

The changes are really good. Please fix clang format and run+attach oneDAL/sklearenx benchmarks.

Alexandr-Solovev · 2026-04-16T16:39:56Z

                    const bk::event_vector& deps) {
-        const std::int64_t local_row_count = data.get_dimension(0);
-        const std::int64_t column_count = data.get_dimension(1);
+        const std::int64_t row_count64 = data.get_dimension(0);


May be the naming update is redundat here

Alexandr-Solovev · 2026-04-16T16:43:58Z

-                    auto sg = item.get_sub_group();
-                    const std::uint32_t sg_id = sg.get_group_id()[0];
-                    if (sg_id > 0)
+                    sycl::sub_group sg = item.get_sub_group();


Not sure that the new one is better. Looks the same, may be redundat

Update: the combined changes of a5477ec and 8ecfdbd did significantly reduce performance so I largely restored them

Alexandr-Solovev · 2026-04-16T16:47:39Z

-                    const std::uint32_t wg_id = item.get_global_id(1);
-                    if (wg_id >= local_row_count)
+                    const std::uint32_t row_count = static_cast<std::uint32_t>(row_count64);
+                    const std::uint32_t col_count = static_cast<std::uint32_t>(col_count64);


ONEDAL_ASSERT(row_count64 <= std::numeric_limitsstd::uint32_t::max());
ONEDAL_ASSERT(col_count64 <= std::numeric_limitsstd::uint32_t::max());
Could be overflow here

Switched it back to int64 because it didn't seem worth it

Added assertion

Alexandr-Solovev · 2026-04-16T16:51:00Z

+                            ++iter;
+
+                            const Float v = xi[k] - xj[k];
+                            sum = sycl::fma(v, v, sum);


I guess its not possible inside the parallel_for

david-cortes-intel · 2026-04-17T07:20:52Z

@ethanglaser Do we have benchmarks for these changes?

ethanglaser · 2026-04-17T18:47:19Z

@ethanglaser Do we have benchmarks for these changes?

This is with the first kernel foundry result, restoring to that now.

ethanglaser · 2026-04-17T19:39:35Z

Applied some simplifications, address some comments, and added same optimizations to sendrecv_replace kernel. I will rerun unitrace and check aurora benchmarks next week when CI machines are back online

ethanglaser · 2026-04-20T17:34:12Z

/intelci: run

ethanglaser · 2026-04-20T21:59:13Z

/intelci: run

ethanglaser · 2026-04-20T22:05:03Z

/intelci: run

ethanglaser · 2026-04-20T22:12:15Z

/intelci: run

Copilot

Pull request overview

This PR updates the oneAPI GPU backend implementation of DBSCAN “get_core” kernels to use a more optimized distance accumulation / pruning approach (kernel-foundry style), with some additional type narrowing to 32-bit indices.

Changes:

Refactors get_core_wide_kernel inner loops to use subgroup lanes, sycl::fma, and periodic early-pruning reductions.
Applies the same pruning/refactor approach to get_core_send_recv_replace_wide_kernel.
Adds dimension upper-bound assertions in get_core_wide_kernel before narrowing to uint32_t.

Copilot · 2026-04-21T00:26:28Z

+                    const std::uint32_t base_i = row_i * col_count;
+                    const Float* const xi = data_ptr + base_i;
+
+                    Float count = neighbours_ptr[row_i];
+
+                    for (std::uint32_t j = 0; j < row_count; ++j) {
+                        const Float* const xj = data_ptr + (j * col_count);


base_i/xj pointer offsets are computed using 32-bit multiplication (row_i * col_count and j * col_count). Even though row_count/col_count are individually asserted to fit in uint32_t, their product can still overflow uint32_t, leading to incorrect pointer arithmetic and potential out-of-bounds reads on the device. Consider using std::uint64_t/std::size_t for offsets (or add an assert that row_count * col_count fits) before doing pointer arithmetic.

Copilot · 2026-04-21T00:26:29Z

+                    const std::uint32_t row_count_local =
+                        static_cast<std::uint32_t>(local_row_count);
+                    const std::uint32_t row_count_repl =
+                        static_cast<std::uint32_t>(row_count_replace);
+                    const std::uint32_t col_count = static_cast<std::uint32_t>(column_count);


local_row_count, row_count_replace, and column_count are cast to std::uint32_t without any upper-bound checks. If any dimension exceeds uint32_t::max(), the cast will truncate and the kernel will compute incorrect results (and may produce invalid pointer offsets later). Add ONEDAL_ASSERT upper-bound checks similar to get_core_wide_kernel before these casts, or keep the indices in 64-bit types.

Copilot · 2026-04-21T00:26:29Z

+                    const std::uint32_t base_i = row_i * col_count;
+                    const Float* const xi = data_ptr + base_i;
+
+                    Float count = neighbours_ptr[row_i];
+
+                    for (std::uint32_t j = 0; j < row_count_repl; ++j) {
+                        const Float* const xj = data_replace_ptr + (j * col_count);
+


Like in get_core_wide_kernel, base_i and xj offsets are computed via 32-bit multiplication (row_i * col_count, j * col_count). This can overflow uint32_t when the total element count exceeds 2^32-1, resulting in incorrect pointer arithmetic. Prefer std::uint64_t/std::size_t offsets (or an explicit overflow-preventing assert on the product).

ethanglaser · 2026-04-29T18:55:35Z

/intelci: run

ethanglaser · 2026-05-12T15:10:21Z

How are we feeling about this @david-cortes-intel @Alexandr-Solovev @Vika-F @avolkov-intel ?

david-cortes-intel · 2026-05-20T09:44:53Z

+                    const std::uint32_t base_i = row_i * col_count;
+                    const Float* const xi = data_ptr + base_i;
+
+                    Float count = neighbours_ptr[row_i];


I see this is a Float, but it deals with counts. Shouldn't it be changed to an integer type?

Use kernel foundry DBSCAN get_core optimizations

2978a51

ethanglaser changed the title ~~Use kernel foundry DBSCAN get_core optimizations~~ [experimental] Use kernel foundry DBSCAN get_core optimizations Apr 3, 2026

david-cortes-intel reviewed Apr 7, 2026

View reviewed changes

ethanglaser and others added 2 commits April 13, 2026 08:46

Take 2 on optimization

5036a47

Merge branch 'main' into dev/eglaser-dbscan-kernel

ac227a1

resolve errors

49d822c

Alexandr-Solovev reviewed Apr 16, 2026

View reviewed changes

ethanglaser added 2 commits April 17, 2026 11:55

restore to original

5fa381f

Merge branch 'main' into dev/eglaser-dbscan-kernel

44bcadc

ethanglaser added the perf Performance optimization label Apr 17, 2026

ethanglaser added 3 commits April 17, 2026 12:32

simplify kernel diff from main

8ecfdbd

apply optimizations to sendrecv_replace kernel

ecec4dd

restore unnecessary diff

a5477ec

clang formatting

9ec2061

restore restored diff due to weak perf of last

61fd8b7

add overflow assertions

5a9d63a

clang formatting

0fd4cbc

ethanglaser marked this pull request as ready for review April 21, 2026 00:22

ethanglaser requested a review from avolkov-intel as a code owner April 21, 2026 00:22

Copilot AI review requested due to automatic review settings April 21, 2026 00:22

ethanglaser requested a review from icfaust as a code owner April 21, 2026 00:22

Copilot started reviewing on behalf of ethanglaser April 21, 2026 00:22 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

ethanglaser requested review from Alexandr-Solovev and david-cortes-intel April 21, 2026 17:17

Merge branch 'main' into dev/eglaser-dbscan-kernel

32d81c3

Merge branch 'main' into dev/eglaser-dbscan-kernel

6ffbcca

ethanglaser requested a review from Vika-F May 12, 2026 15:09

david-cortes-intel reviewed May 20, 2026

View reviewed changes

Conversation

ethanglaser commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ethanglaser commented Apr 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ethanglaser commented Apr 13, 2026

Uh oh!

Alexandr-Solovev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel commented Apr 17, 2026

Uh oh!

ethanglaser commented Apr 17, 2026

Uh oh!

ethanglaser commented Apr 17, 2026

Uh oh!

ethanglaser commented Apr 20, 2026

Uh oh!

ethanglaser commented Apr 20, 2026

Uh oh!

ethanglaser commented Apr 20, 2026

Uh oh!

ethanglaser commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ethanglaser commented Apr 29, 2026

Uh oh!

ethanglaser commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ethanglaser commented Apr 3, 2026 •

edited

Loading