Skip to content

Optimize euclidean distance in host refine phase #689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: branch-25.06
Choose a base branch
from

Conversation

anstellaire
Copy link

@anstellaire anstellaire commented Feb 13, 2025

Issue

Original code (below) generated serial assembly and used strictly-ordered fadda instruction on ARM with gcc & clang. That resulted in suboptimal performance.

for (size_t k = 0; k < dim; k++) {
  distance += DC::template eval<DistanceT>(query[k], row[k]);
}

Proposed solution

This PR provides euclidean distance optimized with partial vector sum (below), that helps vectorization but loses strcictly-ordered compliance.

template <typename DC, typename DistanceT, typename DataT>
DistanceT euclidean_distance_squared_generic(DataT const* a, DataT const* b, size_t n) {
  size_t constexpr max_vreg_len = 512 / (8 * sizeof(DistanceT));

  // max_vreg_len is a power of two
  size_t n_rounded = n & (0xFFFFFFFF ^ (max_vreg_len - 1));
  DistanceT distance[max_vreg_len] = {0};

  for (size_t i = 0; i < n_rounded; i += max_vreg_len) {
    for (size_t j = 0; j < max_vreg_len; ++j) {
      distance[j] += DC::template eval<DistanceT>(a[i + j], b[i + j]);
    }
  }

  for (size_t i = n_rounded; i < n; ++i) {
    distance[i] += DC::template eval<DistanceT>(a[i], b[i]);
  }

  for (size_t i = 1; i < max_vreg_len; ++i) {
    distance[0] += distance[i];
  }

  return distance[0];
}

In addition, it has an implementation with NEON intrinsics which provides further speedup on certain test cases (can be removed if arch-specific code is undesired).

Results

image

@anstellaire anstellaire requested a review from a team as a code owner February 13, 2025 13:02
Copy link

copy-pr-bot bot commented Feb 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Feb 13, 2025
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 13, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 13, 2025

/ok to test

@anstellaire
Copy link
Author

anstellaire commented Feb 14, 2025

/ok to test

UPD:
@cjnolet, seems like CI is triggered only by repository members, could you please do it one more time?
I changed formatting with clang-format.

@lowener
Copy link
Contributor

lowener commented Feb 19, 2025

/ok to test

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anstellaire for the PR! It is a clean implementation and it looks good overall.

You have changed the distance computation for the large batch size case, but did not change for the small batch case (which is handled in a separate branch here). Is this because your benchmarks have shown no improvement for the small batch case? Or is it the other way around, and we do not see improvement for small batch cases, because the new distance computation routines are not used there?

(In any case, we can limit the scope of this PR to the large batch case, but please clarify the question above.)

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank Anna for the PR. The changes look good to me! The remaining question about small batch refinement can be discussed separately.

@anstellaire
Copy link
Author

You have changed the distance computation for the large batch size case, but did not change for the small batch case (which is handled in a separate branch here). Is this because your benchmarks have shown no improvement for the small batch case?

Correct, on small batch size I saw a minor performance degradation, so I decided to apply optimization only to a large batch.

@tfeher
Copy link
Contributor

tfeher commented Mar 22, 2025

/ok to test

1 similar comment
@tfeher
Copy link
Contributor

tfeher commented Mar 22, 2025

/ok to test

@tfeher tfeher changed the base branch from branch-25.04 to branch-25.06 April 24, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants