Native scoring for FP16 V1 implementation. #2922

0ctopus13prime · 2025-10-06T04:01:03Z

Description

This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.

The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

[O] New functionality includes testing.
[O] New functionality has been documented.
[O] API changes companion pull request created.
[O] Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime · 2025-10-06T04:07:46Z

Performance improvement summarization

With early termination + Native scoring for FP16,
I'm seeing 70% QPS improvement compared to when using Faiss C++ for multi segments scenario with Cohere-10M with 1-2% recall drop.
For the single segment case, this is showing slightly less QPS than Faiss C++ which I suspect it's due to the JNI call + Reflection logic extracting MemorySegment[] overhead.

This is what we saw in the POC performance benchmark in the RFC.
Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

0ctopus13prime · 2025-10-06T04:37:06Z

jni/src/simd/similarity_function/x86_avx2_similarity_function.cpp

@@ -0,0 +1,562 @@
+#include <algorithm>


oops, mistakenly included V2.
This will be removed in the next revision.

0ctopus13prime · 2025-10-06T04:37:14Z

jni/src/simd/similarity_function/x86_avx512_similarity_function.cpp

@@ -0,0 +1,538 @@
+#include <algorithm>


oops, mistakenly included V2.
This will be removed in the next revision.

0ctopus13prime · 2025-10-06T04:37:24Z

jni/src/simd/similarity_function/arm_simd_similarity_function.cpp

@@ -0,0 +1,690 @@
+#include <algorithm>


oops, mistakenly included V2.
This will be removed in the next revision.

navneet1v · 2025-10-06T06:46:50Z

@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed.

navneet1v · 2025-10-06T06:48:18Z

Therefore, once BulkSimd + prefetch optimization (e.g. V2) comes in play, I think it will be improved further.

@0ctopus13prime why we are not implementing V2 here and going with V1?

0ctopus13prime · 2025-10-06T06:51:39Z

@navneet1v
yeah, I think it's a flacky test, as KNN1030CodecTests complains that Settings was null. Probably initialization is missing.
For the window, it's MinGW compilation related issue, I'm fixing it right now.

Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅
Since V2 will only have intrinsic for each chip, it will be much easier compared to this version.

Signed-off-by: Dooyong Kim <[email protected]>

Signed-off-by: Doo Yong Kim <[email protected]>

0ctopus13prime · 2025-10-07T16:23:20Z

Hi @Vikasht34 @shatejas
Fixed build failure in Windows. Please take it a look!

0ctopus13prime · 2025-10-07T19:27:09Z

src/main/java/org/opensearch/knn/index/query/memoryoptsearch/MemoryOptimizedKNNWeight.java

    private final KnnCollectorManager knnCollectorManager;

+    // Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more.
+    private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager {


This will not be put into main, it's for benchmarking.

Vikasht34 · 2025-10-08T22:21:45Z

Will look into PR tomorrow morning

shatejas

Still working through the PR - cpp review is pending

shatejas · 2025-10-08T22:29:47Z

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/WrappedFloatVectorValues.java

+    public static FloatVectorValues getBottomFloatVectorValues(KnnVectorValues knnVectorValues) {
+        if (knnVectorValues instanceof FloatVectorValues floatVectorValues) {
+            while (floatVectorValues instanceof WrappedFloatVectorValues wrappedFloatVectorValues) {
+                floatVectorValues = wrappedFloatVectorValues.nestedVectorValues;


Just curious, are we expecting multiple layers of this? From what I understand these will always be flat values from faiss graph. Do we really need a while loop here? maybe just an assert !(wrappedFloatVectorValues.nestedVectorValues instanceof WrappedFloatVectorValues is sufficient before returning

No, we are not expecting it to be multi-layers. I just added a general way to handle wrapper.
Will get rid of loop in the next rev.

shatejas · 2025-10-08T22:32:31Z

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/WrappedFloatVectorValues.java

+public abstract class WrappedFloatVectorValues extends FloatVectorValues {
+
+    // The wrapped (nested) {@link FloatVectorValues} instance.
+    protected final FloatVectorValues nestedVectorValues;


The prefix nested can be confusing considering knn supported nested index mapping. I understand this is in terms of how faiss stores index in nested fashion, but maybe flatVectorValues or something else to avoid the confusion?

Sure, I will just use flatVectorValues.
Will update in the next rev.

shatejas · 2025-10-09T00:17:14Z

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java

+                    // Collect only chunk that overlap with baseOffset or placed after baseOffset
+                    addressAndSize[addressIndex] = address;
+                    addressIndex += 2;
+                    addressAndSize[sizeIndex] = chunkSize;


Just checking if we need to manipulate the final chunkSize based on size of faissSection? You only need access to flat vectors right? this will give access till the end of the faiss file I think. Let me know if I am missing something here

yes, you're right, you're not missing anything. What we need is exactly the flat vector section. I think the starting offset is already addressed, but the end offset of the section, as you said, the size of file is being used.
Do you think would it be better to make it explicitly requires startOffset and size?

Do you think would it be better to make it explicitly requires startOffset and size?

Yes, I know its being managed correctly but best to give only access to the required section

sure, will update in the next rev.

shatejas · 2025-10-09T00:36:13Z

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java

     */
    @Override
-    public long[] extractAddressAndSize(final IndexInput indexInput) {
+    public long[] extractAddressAndSize(final IndexInput indexInput, long baseOffset) {


So the responsibility of this method was to extract addressAndSize, Shouldn't the manipulation of base and sectionSize happen in MMap*VectorValues?

I think this still works considering addressAndSize is not a member variable, just want to make sure if like baseOffset, we need the total size as parameter in this method

MMap*VectorValues is a general VectorValues that returns addressAndSize, then it's FaissIndex's (ex: FaissIndexScalarQuantizedFlat) responsibility to slice necessary parts from it and give it to scorer. Since each FaissIndex implementation knows the flat vector section size in Faiss index, I divided responsibility into 2 parts:

Let MMap*VectorValues return whole addressAndSize for underlying file.

Each FaissIndex implementation then slice it based on flat vector FaissSection.

Will add size parameter to make it return exactly the region it needs for scoring.

shatejas · 2025-10-09T00:51:52Z

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK22.java

+                final long endOffsetExclusive = address + chunkSize;
+                if (endOffsetExclusive > baseOffset) {
+                    // If this chunk contains `baseOffset`, then force its address value to be `baseOffset`
+                    if (address < baseOffset) {
+                        chunkSize = endOffsetExclusive - baseOffset;
+                        address += baseOffset - startOffset;
+                    }
+
+                    // Collect only chunk that overlap with baseOffset or placed after baseOffset
+                    addressAndSize[addressIndex] = address;
+                    addressIndex += 2;
+                    addressAndSize[sizeIndex] = chunkSize;
+                    sizeIndex += 2;
+                }
+                startOffset += originalChunkSize;
            }
+
+            if (addressIndex != addressAndSize.length) {
+                // There was a chunk that excluded, shrink it
+                long[] newAddressAndSize = new long[addressIndex];
+                System.arraycopy(addressAndSize, 0, newAddressAndSize, 0, addressIndex);
+                return newAddressAndSize;
+            }


nit: can we extract this into a common method and avoid duplicates

Sure, will do in next rev.

shatejas · 2025-10-09T01:06:16Z

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/FaissIndexScalarQuantizedFlat.java

+                    reconstructor
+                );
+            } else {
+                log.debug("Failed to extract mapped pointers from IndexInput, falling back to ByteVectorValuesImpl.");


nit: FloatVectorValuesImpl

will update in the next rev.

shatejas · 2025-10-09T21:17:43Z

Looks good overall, please remove v2 files in the next rev

shatejas · 2025-10-09T16:26:15Z

src/main/java/org/opensearch/knn/memoryoptsearch/faiss/NativeRandomVectorScorer.java

+        this.addressAndSize = mmapVectorValues.getAddressAndSize();
+        this.maxOrd = knnVectorValues.size();
+        this.nativeFunctionTypeOrd = similarityFunctionType.ordinal();
+        SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd);


I am wondering if the use of thread local should be minimized to queryVectorSimdAligned. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think

Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.

shatejas · 2025-10-09T18:48:23Z

src/main/java/org/opensearch/knn/memoryoptsearch/MemorySegmentAddressExtractorJDK21.java

+                    // Collect only chunk that overlap with baseOffset or placed after baseOffset
+                    addressAndSize[addressIndex] = address;
+                    addressIndex += 2;
+                    addressAndSize[sizeIndex] = chunkSize;


Do you think would it be better to make it explicitly requires startOffset and size?

Yes, I know its being managed correctly but best to give only access to the required section

shatejas · 2025-10-09T20:46:03Z

jni/src/simd/similarity_function/faiss_score_to_lucene_transform.cpp

+    static void ipToMaxIpTransformBulk(float* scores, const int32_t numScores) noexcept {
+        int32_t i = 0;
+        for (; (i + 8) <= numScores ; i += 8, scores += 8) {
+            scores[0] = scores[0] < 0 ? 1 / (1 - scores[0]) : (1 + scores[0]);


just trying to figure out the intent here, why not loop through and call ipToMaxIpTransform instead doing 8 together? will this be replaced by SIMD intrinsics in the same file?

Just typical loop unrolling optimization likewise Lucene is doing in here 😛- https://en.wikipedia.org/wiki/Loop_unrolling
Not only the SIMD, but it can take advantage of CPU instruction pipeline.
Did some simple tests, and the loop unrolling version showed 30% - 50% faster performance than simple looping.

0ctopus13prime added 2 commits October 5, 2025 14:59

Added Faiss FP16 native scoring.

b36c306

Signed-off-by: Dooyong Kim <[email protected]>

Added MultiLeafTopKnnCollectorManager.

d6d6ef2

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime self-assigned this Oct 6, 2025

0ctopus13prime requested review from VijayanB, Vikasht34, heemin32, jmazanec15, junqiu-lei, luyuncheng, martin-gaievski, naveentatikonda, navneet1v, ryanbogan, shatejas and vamshin as code owners October 6, 2025 04:01

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 359665b to 546cfd9 Compare October 6, 2025 04:02

0ctopus13prime commented Oct 6, 2025

View reviewed changes

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 546cfd9 to 2acabbe Compare October 6, 2025 04:57

0ctopus13prime force-pushed the use-faise-scoring-v1 branch 5 times, most recently from 060028c to 2ed780f Compare October 6, 2025 16:47

0ctopus13prime changed the base branch from main to feature/fp16-faiss-bulk October 6, 2025 16:52

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 2ed780f to 1b651ab Compare October 6, 2025 19:15

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 1b651ab to 4e6572d Compare October 7, 2025 07:26

Added Java docs for native scoring for FP16.

189670b

Signed-off-by: Dooyong Kim <[email protected]>

0ctopus13prime force-pushed the use-faise-scoring-v1 branch from 4e6572d to 189670b Compare October 7, 2025 07:35

Delete jni/Makefile

ad543d7

Signed-off-by: Doo Yong Kim <[email protected]>

0ctopus13prime commented Oct 7, 2025

View reviewed changes

shatejas reviewed Oct 9, 2025

View reviewed changes

Native scoring for FP16 V1 implementation. #2922

Are you sure you want to change the base?

Native scoring for FP16 V1 implementation. #2922

Conversation

0ctopus13prime commented Oct 6, 2025

Description

Related Issues

Check List

Uh oh!

0ctopus13prime commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance improvement summarization

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

navneet1v commented Oct 6, 2025

Uh oh!

navneet1v commented Oct 6, 2025

Uh oh!

0ctopus13prime commented Oct 6, 2025

Uh oh!

0ctopus13prime commented Oct 7, 2025

Uh oh!

0ctopus13prime Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vikasht34 commented Oct 8, 2025

Uh oh!

shatejas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shatejas commented Oct 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

0ctopus13prime commented Oct 6, 2025 •

edited

Loading

0ctopus13prime Oct 7, 2025 •

edited

Loading