Skip to content

[experimental] Use kernel foundry DBSCAN get_core optimizations#3592

Open
ethanglaser wants to merge 15 commits into
uxlfoundation:mainfrom
ethanglaser:dev/eglaser-dbscan-kernel
Open

[experimental] Use kernel foundry DBSCAN get_core optimizations#3592
ethanglaser wants to merge 15 commits into
uxlfoundation:mainfrom
ethanglaser:dev/eglaser-dbscan-kernel

Conversation

@ethanglaser
Copy link
Copy Markdown
Contributor

@ethanglaser ethanglaser commented Apr 3, 2026

Description


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

Performance

  • I have measured performance for affected algorithms using scikit-learn_bench and provided at least a summary table with measured data, if performance change is expected.
  • I have provided justification why performance and/or quality metrics have changed or why changes are not expected.
  • I have extended the benchmarking suite and provided a corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

@ethanglaser ethanglaser changed the title Use kernel foundry DBSCAN get_core optimizations [experimental] Use kernel foundry DBSCAN get_core optimizations Apr 3, 2026
@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

++iter;

const Float v = xi[k] - xj[k];
sum = sycl::fma(v, v, sum);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Vika-F Would it be a problem to have std::fma in a header?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its not possible inside the parallel_for

@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

Copy link
Copy Markdown
Contributor

@Alexandr-Solovev Alexandr-Solovev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are really good. Please fix clang format and run+attach oneDAL/sklearenx benchmarks.

const bk::event_vector& deps) {
const std::int64_t local_row_count = data.get_dimension(0);
const std::int64_t column_count = data.get_dimension(1);
const std::int64_t row_count64 = data.get_dimension(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be the naming update is redundat here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored

Comment on lines -99 to +101
auto sg = item.get_sub_group();
const std::uint32_t sg_id = sg.get_group_id()[0];
if (sg_id > 0)
sycl::sub_group sg = item.get_sub_group();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that the new one is better. Looks the same, may be redundat

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the combined changes of a5477ec and 8ecfdbd did significantly reduce performance so I largely restored them

const std::uint32_t wg_id = item.get_global_id(1);
if (wg_id >= local_row_count)
const std::uint32_t row_count = static_cast<std::uint32_t>(row_count64);
const std::uint32_t col_count = static_cast<std::uint32_t>(col_count64);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONEDAL_ASSERT(row_count64 <= std::numeric_limitsstd::uint32_t::max());
ONEDAL_ASSERT(col_count64 <= std::numeric_limitsstd::uint32_t::max());
Could be overflow here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched it back to int64 because it didn't seem worth it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assertion

++iter;

const Float v = xi[k] - xj[k];
sum = sycl::fma(v, v, sum);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess its not possible inside the parallel_for

@david-cortes-intel
Copy link
Copy Markdown
Contributor

@ethanglaser Do we have benchmarks for these changes?

@ethanglaser
Copy link
Copy Markdown
Contributor Author

@ethanglaser Do we have benchmarks for these changes?

image

This is with the first kernel foundry result, restoring to that now.

@ethanglaser ethanglaser added the perf Performance optimization label Apr 17, 2026
@ethanglaser
Copy link
Copy Markdown
Contributor Author

Applied some simplifications, address some comments, and added same optimizations to sendrecv_replace kernel. I will rerun unitrace and check aurora benchmarks next week when CI machines are back online

@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

@ethanglaser ethanglaser marked this pull request as ready for review April 21, 2026 00:22
Copilot AI review requested due to automatic review settings April 21, 2026 00:22
@ethanglaser ethanglaser requested a review from icfaust as a code owner April 21, 2026 00:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the oneAPI GPU backend implementation of DBSCAN “get_core” kernels to use a more optimized distance accumulation / pruning approach (kernel-foundry style), with some additional type narrowing to 32-bit indices.

Changes:

  • Refactors get_core_wide_kernel inner loops to use subgroup lanes, sycl::fma, and periodic early-pruning reductions.
  • Applies the same pruning/refactor approach to get_core_send_recv_replace_wide_kernel.
  • Adds dimension upper-bound assertions in get_core_wide_kernel before narrowing to uint32_t.

Comment on lines +124 to +130
const std::uint32_t base_i = row_i * col_count;
const Float* const xi = data_ptr + base_i;

Float count = neighbours_ptr[row_i];

for (std::uint32_t j = 0; j < row_count; ++j) {
const Float* const xj = data_ptr + (j * col_count);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_i/xj pointer offsets are computed using 32-bit multiplication (row_i * col_count and j * col_count). Even though row_count/col_count are individually asserted to fit in uint32_t, their product can still overflow uint32_t, leading to incorrect pointer arithmetic and potential out-of-bounds reads on the device. Consider using std::uint64_t/std::size_t for offsets (or add an assert that row_count * col_count fits) before doing pointer arithmetic.

Copilot uses AI. Check for mistakes.
Comment on lines +324 to +328
const std::uint32_t row_count_local =
static_cast<std::uint32_t>(local_row_count);
const std::uint32_t row_count_repl =
static_cast<std::uint32_t>(row_count_replace);
const std::uint32_t col_count = static_cast<std::uint32_t>(column_count);
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_row_count, row_count_replace, and column_count are cast to std::uint32_t without any upper-bound checks. If any dimension exceeds uint32_t::max(), the cast will truncate and the kernel will compute incorrect results (and may produce invalid pointer offsets later). Add ONEDAL_ASSERT upper-bound checks similar to get_core_wide_kernel before these casts, or keep the indices in 64-bit types.

Copilot uses AI. Check for mistakes.
Comment on lines +342 to +349
const std::uint32_t base_i = row_i * col_count;
const Float* const xi = data_ptr + base_i;

Float count = neighbours_ptr[row_i];

for (std::uint32_t j = 0; j < row_count_repl; ++j) {
const Float* const xj = data_replace_ptr + (j * col_count);

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in get_core_wide_kernel, base_i and xj offsets are computed via 32-bit multiplication (row_i * col_count, j * col_count). This can overflow uint32_t when the total element count exceeds 2^32-1, resulting in incorrect pointer arithmetic. Prefer std::uint64_t/std::size_t offsets (or an explicit overflow-preventing assert on the product).

Copilot uses AI. Check for mistakes.
@ethanglaser
Copy link
Copy Markdown
Contributor Author

/intelci: run

@ethanglaser ethanglaser requested a review from Vika-F May 12, 2026 15:09
@ethanglaser
Copy link
Copy Markdown
Contributor Author

How are we feeling about this @david-cortes-intel @Alexandr-Solovev @Vika-F @avolkov-intel ?

const std::uint32_t base_i = row_i * col_count;
const Float* const xi = data_ptr + base_i;

Float count = neighbours_ptr[row_i];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a Float, but it deals with counts. Shouldn't it be changed to an integer type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf Performance optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants