Optimize euclidean distance in host refine phase #689

anstellaire · 2025-02-13T13:01:59Z

Issue

Original code (below) generated serial assembly and used strictly-ordered fadda instruction on ARM with gcc & clang. That resulted in suboptimal performance.

for (size_t k = 0; k < dim; k++) {
  distance += DC::template eval<DistanceT>(query[k], row[k]);
}

Proposed solution

This PR provides euclidean distance optimized with partial vector sum (below), that helps vectorization but loses strcictly-ordered compliance.

template <typename DC, typename DistanceT, typename DataT>
DistanceT euclidean_distance_squared_generic(DataT const* a, DataT const* b, size_t n) {
  size_t constexpr max_vreg_len = 512 / (8 * sizeof(DistanceT));

  // max_vreg_len is a power of two
  size_t n_rounded = n & (0xFFFFFFFF ^ (max_vreg_len - 1));
  DistanceT distance[max_vreg_len] = {0};

  for (size_t i = 0; i < n_rounded; i += max_vreg_len) {
    for (size_t j = 0; j < max_vreg_len; ++j) {
      distance[j] += DC::template eval<DistanceT>(a[i + j], b[i + j]);
    }
  }

  for (size_t i = n_rounded; i < n; ++i) {
    distance[i] += DC::template eval<DistanceT>(a[i], b[i]);
  }

  for (size_t i = 1; i < max_vreg_len; ++i) {
    distance[0] += distance[i];
  }

  return distance[0];
}

In addition, it has an implementation with NEON intrinsics which provides further speedup on certain test cases (can be removed if arch-specific code is undesired).

Results

copy-pr-bot · 2025-02-13T13:02:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2025-02-13T16:13:05Z

/ok to test

anstellaire · 2025-02-14T09:10:32Z

/ok to test

UPD:
@cjnolet, seems like CI is triggered only by repository members, could you please do it one more time?
I changed formatting with clang-format.

lowener · 2025-02-19T12:32:23Z

/ok to test

tfeher

Thanks @anstellaire for the PR! It is a clean implementation and it looks good overall.

You have changed the distance computation for the large batch size case, but did not change for the small batch case (which is handled in a separate branch here). Is this because your benchmarks have shown no improvement for the small batch case? Or is it the other way around, and we do not see improvement for small batch cases, because the new distance computation routines are not used there?

(In any case, we can limit the scope of this PR to the large batch case, but please clarify the question above.)

tfeher

Thank Anna for the PR. The changes look good to me! The remaining question about small batch refinement can be discussed separately.

anstellaire · 2025-03-21T09:38:11Z

You have changed the distance computation for the large batch size case, but did not change for the small batch case (which is handled in a separate branch here). Is this because your benchmarks have shown no improvement for the small batch case?

Correct, on small batch size I saw a minor performance degradation, so I decided to apply optimization only to a large batch.

tfeher · 2025-03-22T09:53:16Z

/ok to test

tfeher · 2025-03-22T10:31:14Z

/ok to test

tfeher · 2025-05-02T10:06:50Z

/ok to test 2de39ef

tfeher · 2025-05-02T13:58:57Z

/ok to test 18fe20f

tfeher · 2025-05-05T09:05:19Z

/merge

anstellaire requested a review from a team as a code owner February 13, 2025 13:02

github-actions bot added the cpp label Feb 13, 2025

anstellaire force-pushed the branch-25.04 branch from 1448185 to 5026363 Compare February 13, 2025 13:17

anstellaire mentioned this pull request Feb 13, 2025

Optimize euclidian distance in raft refine phase rapidsai/raft#2574

Closed

cjnolet assigned anstellaire Feb 13, 2025

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 13, 2025

anstellaire force-pushed the branch-25.04 branch from 0c5efd2 to 3c8f924 Compare February 14, 2025 09:09

tfeher reviewed Mar 19, 2025

View reviewed changes

tfeher approved these changes Mar 20, 2025

View reviewed changes

anstellaire and others added 5 commits March 21, 2025 15:25

Add euclidian distance optimization: generic + written in NEON

e0a5021

Refactor of vreg length calculation

eb429aa

Remove unused implementation

2373186

Fix typecast in loops

3a032ac

Change code style with clang-format

836af24

anstellaire force-pushed the branch-25.04 branch from 3c8f924 to 836af24 Compare March 21, 2025 11:25

tfeher changed the base branch from branch-25.04 to branch-25.06 April 24, 2025 15:16

Merge branch 'branch-25.06' into branch-25.04

2de39ef

Fix segfault in tests/test_refine.py

18fe20f

anstellaire force-pushed the branch-25.04 branch 2 times, most recently from 4bfee0e to 18fe20f Compare May 2, 2025 13:23

rapids-bot bot merged commit 7affbc0 into rapidsai:branch-25.06 May 5, 2025
67 checks passed

github-project-automation bot moved this from In Progress to Done in Vector Search, ML, & Data Mining Release Board May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize euclidean distance in host refine phase #689

Optimize euclidean distance in host refine phase #689

Uh oh!

anstellaire commented Feb 13, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 13, 2025

Uh oh!

cjnolet commented Feb 13, 2025

Uh oh!

anstellaire commented Feb 14, 2025 •

edited

Loading

Uh oh!

lowener commented Feb 19, 2025

Uh oh!

tfeher left a comment

Uh oh!

tfeher left a comment

Uh oh!

anstellaire commented Mar 21, 2025

Uh oh!

tfeher commented Mar 22, 2025

Uh oh!

tfeher commented Mar 22, 2025

Uh oh!

tfeher commented May 2, 2025

Uh oh!

tfeher commented May 2, 2025

Uh oh!

tfeher commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize euclidean distance in host refine phase #689

Optimize euclidean distance in host refine phase #689

Uh oh!

Conversation

anstellaire commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Proposed solution

Results

Uh oh!

copy-pr-bot bot commented Feb 13, 2025

Uh oh!

cjnolet commented Feb 13, 2025

Uh oh!

anstellaire commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lowener commented Feb 19, 2025

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

anstellaire commented Mar 21, 2025

Uh oh!

tfeher commented Mar 22, 2025

Uh oh!

tfeher commented Mar 22, 2025

Uh oh!

tfeher commented May 2, 2025

Uh oh!

tfeher commented May 2, 2025

Uh oh!

tfeher commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

anstellaire commented Feb 13, 2025 •

edited

Loading

anstellaire commented Feb 14, 2025 •

edited

Loading