Skip to content

Conversation

0ctopus13prime
Copy link
Collaborator

Description

RFC : #2875

This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.

The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [O] New functionality includes testing.
  • [O] New functionality has been documented.
  • [O] API changes companion pull request created.
  • [O] Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@0ctopus13prime
Copy link
Collaborator Author

0ctopus13prime commented Oct 6, 2025

Performance improvement summarization

With early termination + Native scoring for FP16,
I'm seeing 70% QPS improvement compared to when using Faiss C++ for multi segments scenario with Cohere-10M with 1-2% recall drop.
For the single segment case, this is showing slightly less QPS than Faiss C++ which I suspect it's due to the JNI call + Reflection logic extracting MemorySegment[] overhead.

This is what we saw in the POC performance benchmark in the RFC.
Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@@ -0,0 +1,562 @@
#include <algorithm>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, mistakenly included V2.
This will be removed in the next revision.

@@ -0,0 +1,538 @@
#include <algorithm>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, mistakenly included V2.
This will be removed in the next revision.

@@ -0,0 +1,690 @@
#include <algorithm>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, mistakenly included V2.
This will be removed in the next revision.

@navneet1v
Copy link
Collaborator

@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed.

@navneet1v
Copy link
Collaborator

Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@0ctopus13prime why we are not implementing V2 here and going with V1?

@0ctopus13prime
Copy link
Collaborator Author

@navneet1v
yeah, I think it's a flacky test, as KNN1030CodecTests complains that Settings was null. Probably initialization is missing.
For the window, it's MinGW compilation related issue, I'm fixing it right now.

Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅
Since V2 will only have intrinsic for each chip, it will be much easier compared to this version.

@0ctopus13prime 0ctopus13prime force-pushed the use-faise-scoring-v1 branch 5 times, most recently from 060028c to 2ed780f Compare October 6, 2025 16:47
@0ctopus13prime 0ctopus13prime changed the base branch from main to feature/fp16-faiss-bulk October 6, 2025 16:52
Signed-off-by: Doo Yong Kim <[email protected]>
@0ctopus13prime
Copy link
Collaborator Author

Hi @Vikasht34 @shatejas
Fixed build failure in Windows. Please take it a look!

private final KnnCollectorManager knnCollectorManager;

// Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more.
private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager {
Copy link
Collaborator Author

@0ctopus13prime 0ctopus13prime Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not be put into main, it's for benchmarking.

@Vikasht34
Copy link
Collaborator

Will look into PR tomorrow morning

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working through the PR - cpp review is pending

public static FloatVectorValues getBottomFloatVectorValues(KnnVectorValues knnVectorValues) {
if (knnVectorValues instanceof FloatVectorValues floatVectorValues) {
while (floatVectorValues instanceof WrappedFloatVectorValues wrappedFloatVectorValues) {
floatVectorValues = wrappedFloatVectorValues.nestedVectorValues;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, are we expecting multiple layers of this? From what I understand these will always be flat values from faiss graph. Do we really need a while loop here? maybe just an assert !(wrappedFloatVectorValues.nestedVectorValues instanceof WrappedFloatVectorValues is sufficient before returning

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not expecting it to be multi-layers. I just added a general way to handle wrapper.
Will get rid of loop in the next rev.

public abstract class WrappedFloatVectorValues extends FloatVectorValues {

// The wrapped (nested) {@link FloatVectorValues} instance.
protected final FloatVectorValues nestedVectorValues;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefix nested can be confusing considering knn supported nested index mapping. I understand this is in terms of how faiss stores index in nested fashion, but maybe flatVectorValues or something else to avoid the confusion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will just use flatVectorValues.
Will update in the next rev.

// Collect only chunk that overlap with baseOffset or placed after baseOffset
addressAndSize[addressIndex] = address;
addressIndex += 2;
addressAndSize[sizeIndex] = chunkSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking if we need to manipulate the final chunkSize based on size of faissSection? You only need access to flat vectors right? this will give access till the end of the faiss file I think. Let me know if I am missing something here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right, you're not missing anything. What we need is exactly the flat vector section. I think the starting offset is already addressed, but the end offset of the section, as you said, the size of file is being used.
Do you think would it be better to make it explicitly requires startOffset and size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think would it be better to make it explicitly requires startOffset and size?

Yes, I know its being managed correctly but best to give only access to the required section

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update in the next rev.

*/
@Override
public long[] extractAddressAndSize(final IndexInput indexInput) {
public long[] extractAddressAndSize(final IndexInput indexInput, long baseOffset) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the responsibility of this method was to extract addressAndSize, Shouldn't the manipulation of base and sectionSize happen in MMap*VectorValues?

I think this still works considering addressAndSize is not a member variable, just want to make sure if like baseOffset, we need the total size as parameter in this method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MMap*VectorValues is a general VectorValues that returns addressAndSize, then it's FaissIndex's (ex: FaissIndexScalarQuantizedFlat) responsibility to slice necessary parts from it and give it to scorer. Since each FaissIndex implementation knows the flat vector section size in Faiss index, I divided responsibility into 2 parts:

  1. Let MMap*VectorValues return whole addressAndSize for underlying file.
  2. Each FaissIndex implementation then slice it based on flat vector FaissSection.

Will add size parameter to make it return exactly the region it needs for scoring.

Comment on lines +61 to +83
final long endOffsetExclusive = address + chunkSize;
if (endOffsetExclusive > baseOffset) {
// If this chunk contains `baseOffset`, then force its address value to be `baseOffset`
if (address < baseOffset) {
chunkSize = endOffsetExclusive - baseOffset;
address += baseOffset - startOffset;
}

// Collect only chunk that overlap with baseOffset or placed after baseOffset
addressAndSize[addressIndex] = address;
addressIndex += 2;
addressAndSize[sizeIndex] = chunkSize;
sizeIndex += 2;
}
startOffset += originalChunkSize;
}

if (addressIndex != addressAndSize.length) {
// There was a chunk that excluded, shrink it
long[] newAddressAndSize = new long[addressIndex];
System.arraycopy(addressAndSize, 0, newAddressAndSize, 0, addressIndex);
return newAddressAndSize;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we extract this into a common method and avoid duplicates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do in next rev.

reconstructor
);
} else {
log.debug("Failed to extract mapped pointers from IndexInput, falling back to ByteVectorValuesImpl.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: FloatVectorValuesImpl

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update in the next rev.

@shatejas
Copy link
Collaborator

shatejas commented Oct 9, 2025

Looks good overall, please remove v2 files in the next rev

this.addressAndSize = mmapVectorValues.getAddressAndSize();
this.maxOrd = knnVectorValues.size();
this.nativeFunctionTypeOrd = similarityFunctionType.ordinal();
SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if the use of thread local should be minimized to queryVectorSimdAligned. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.

// Collect only chunk that overlap with baseOffset or placed after baseOffset
addressAndSize[addressIndex] = address;
addressIndex += 2;
addressAndSize[sizeIndex] = chunkSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think would it be better to make it explicitly requires startOffset and size?

Yes, I know its being managed correctly but best to give only access to the required section

static void ipToMaxIpTransformBulk(float* scores, const int32_t numScores) noexcept {
int32_t i = 0;
for (; (i + 8) <= numScores ; i += 8, scores += 8) {
scores[0] = scores[0] < 0 ? 1 / (1 - scores[0]) : (1 + scores[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to figure out the intent here, why not loop through and call ipToMaxIpTransform instead doing 8 together? will this be replaced by SIMD intrinsics in the same file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just typical loop unrolling optimization likewise Lucene is doing in here 😛- https://en.wikipedia.org/wiki/Loop_unrolling
Not only the SIMD, but it can take advantage of CPU instruction pipeline.
Did some simple tests, and the loop unrolling version showed 30% - 50% faster performance than simple looping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants