-
Notifications
You must be signed in to change notification settings - Fork 169
Native scoring for FP16 V1 implementation. #2922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/fp16-faiss-bulk
Are you sure you want to change the base?
Native scoring for FP16 V1 implementation. #2922
Conversation
Signed-off-by: Dooyong Kim <[email protected]>
Signed-off-by: Dooyong Kim <[email protected]>
359665b
to
546cfd9
Compare
Performance improvement summarizationWith early termination + Native scoring for FP16, This is what we saw in the POC performance benchmark in the RFC. |
@@ -0,0 +1,562 @@ | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, mistakenly included V2.
This will be removed in the next revision.
@@ -0,0 +1,538 @@ | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, mistakenly included V2.
This will be removed in the next revision.
@@ -0,0 +1,690 @@ | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, mistakenly included V2.
This will be removed in the next revision.
546cfd9
to
2acabbe
Compare
@0ctopus13prime can we fix the CIs which are failing seems like some tests have failed. |
@0ctopus13prime why we are not implementing V2 here and going with V1? |
@navneet1v Also I'm taking an iterative way to get to V2. To have V2 in this PR, it's going to be too much heavy for review as it's already 36 files have been modified 😅 |
060028c
to
2ed780f
Compare
2ed780f
to
1b651ab
Compare
1b651ab
to
4e6572d
Compare
Signed-off-by: Dooyong Kim <[email protected]>
4e6572d
to
189670b
Compare
Signed-off-by: Doo Yong Kim <[email protected]>
Hi @Vikasht34 @shatejas |
private final KnnCollectorManager knnCollectorManager; | ||
|
||
// Ported from Lucene as since 10.3.0, TopKnnCollectorManager does not use MultiLeafKnnCollector no more. | ||
private static class MultiLeafTopKnnCollectorManager implements KnnCollectorManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not be put into main, it's for benchmarking.
Will look into PR tomorrow morning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working through the PR - cpp review is pending
public static FloatVectorValues getBottomFloatVectorValues(KnnVectorValues knnVectorValues) { | ||
if (knnVectorValues instanceof FloatVectorValues floatVectorValues) { | ||
while (floatVectorValues instanceof WrappedFloatVectorValues wrappedFloatVectorValues) { | ||
floatVectorValues = wrappedFloatVectorValues.nestedVectorValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, are we expecting multiple layers of this? From what I understand these will always be flat values from faiss graph. Do we really need a while loop here? maybe just an assert !(wrappedFloatVectorValues.nestedVectorValues instanceof WrappedFloatVectorValues
is sufficient before returning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are not expecting it to be multi-layers. I just added a general way to handle wrapper.
Will get rid of loop in the next rev.
public abstract class WrappedFloatVectorValues extends FloatVectorValues { | ||
|
||
// The wrapped (nested) {@link FloatVectorValues} instance. | ||
protected final FloatVectorValues nestedVectorValues; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prefix nested
can be confusing considering knn supported nested index mapping. I understand this is in terms of how faiss stores index in nested fashion, but maybe flatVectorValues or something else to avoid the confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will just use flatVectorValues
.
Will update in the next rev.
// Collect only chunk that overlap with baseOffset or placed after baseOffset | ||
addressAndSize[addressIndex] = address; | ||
addressIndex += 2; | ||
addressAndSize[sizeIndex] = chunkSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking if we need to manipulate the final chunkSize based on size of faissSection? You only need access to flat vectors right? this will give access till the end of the faiss file I think. Let me know if I am missing something here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right, you're not missing anything. What we need is exactly the flat vector section. I think the starting offset is already addressed, but the end offset of the section, as you said, the size of file is being used.
Do you think would it be better to make it explicitly requires startOffset and size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think would it be better to make it explicitly requires startOffset and size?
Yes, I know its being managed correctly but best to give only access to the required section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update in the next rev.
*/ | ||
@Override | ||
public long[] extractAddressAndSize(final IndexInput indexInput) { | ||
public long[] extractAddressAndSize(final IndexInput indexInput, long baseOffset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the responsibility of this method was to extract addressAndSize, Shouldn't the manipulation of base and sectionSize happen in MMap*VectorValues?
I think this still works considering addressAndSize is not a member variable, just want to make sure if like baseOffset, we need the total size as parameter in this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MMap*VectorValues is a general VectorValues that returns addressAndSize, then it's FaissIndex's (ex: FaissIndexScalarQuantizedFlat) responsibility to slice necessary parts from it and give it to scorer. Since each FaissIndex implementation knows the flat vector section size in Faiss index, I divided responsibility into 2 parts:
- Let MMap*VectorValues return whole addressAndSize for underlying file.
- Each FaissIndex implementation then slice it based on flat vector FaissSection.
Will add size
parameter to make it return exactly the region it needs for scoring.
final long endOffsetExclusive = address + chunkSize; | ||
if (endOffsetExclusive > baseOffset) { | ||
// If this chunk contains `baseOffset`, then force its address value to be `baseOffset` | ||
if (address < baseOffset) { | ||
chunkSize = endOffsetExclusive - baseOffset; | ||
address += baseOffset - startOffset; | ||
} | ||
|
||
// Collect only chunk that overlap with baseOffset or placed after baseOffset | ||
addressAndSize[addressIndex] = address; | ||
addressIndex += 2; | ||
addressAndSize[sizeIndex] = chunkSize; | ||
sizeIndex += 2; | ||
} | ||
startOffset += originalChunkSize; | ||
} | ||
|
||
if (addressIndex != addressAndSize.length) { | ||
// There was a chunk that excluded, shrink it | ||
long[] newAddressAndSize = new long[addressIndex]; | ||
System.arraycopy(addressAndSize, 0, newAddressAndSize, 0, addressIndex); | ||
return newAddressAndSize; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we extract this into a common method and avoid duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do in next rev.
reconstructor | ||
); | ||
} else { | ||
log.debug("Failed to extract mapped pointers from IndexInput, falling back to ByteVectorValuesImpl."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: FloatVectorValuesImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update in the next rev.
Looks good overall, please remove v2 files in the next rev |
this.addressAndSize = mmapVectorValues.getAddressAndSize(); | ||
this.maxOrd = knnVectorValues.size(); | ||
this.nativeFunctionTypeOrd = similarityFunctionType.ordinal(); | ||
SimdVectorComputeService.saveSearchContext(query, addressAndSize, nativeFunctionTypeOrd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the use of thread local should be minimized to queryVectorSimdAligned
. Everything other parameter in search context can be local to the thread. This increases maintaibility of the code while keeping the optimization. let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good.
But I think it's better to save a list of address so the pointer of vector calculation can be done quickly.
Other than that, I think mostly are native types, we can pass them as parameters.
Will update in the next rev.
// Collect only chunk that overlap with baseOffset or placed after baseOffset | ||
addressAndSize[addressIndex] = address; | ||
addressIndex += 2; | ||
addressAndSize[sizeIndex] = chunkSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think would it be better to make it explicitly requires startOffset and size?
Yes, I know its being managed correctly but best to give only access to the required section
static void ipToMaxIpTransformBulk(float* scores, const int32_t numScores) noexcept { | ||
int32_t i = 0; | ||
for (; (i + 8) <= numScores ; i += 8, scores += 8) { | ||
scores[0] = scores[0] < 0 ? 1 / (1 - scores[0]) : (1 + scores[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying to figure out the intent here, why not loop through and call ipToMaxIpTransform instead doing 8 together? will this be replaced by SIMD intrinsics in the same file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just typical loop unrolling optimization likewise Lucene is doing in here 😛- https://en.wikipedia.org/wiki/Loop_unrolling
Not only the SIMD, but it can take advantage of CPU instruction pipeline.
Did some simple tests, and the loop unrolling version showed 30% - 50% faster performance than simple looping.
Description
RFC : #2875
This PR introduces a new build structure in k-NN for SIMD-based computation, implementing the V1 design described in the RFC linked above.
V1 relies on Faiss distance calculation functions for similarity scoring. The structure introduced here is designed to make it easy to add bulk SIMD operations in subsequent versions. Starting with V1 helps reduce the initial review complexity for maintainers.
The core concept is to extract mapped memory pointers from MemorySegmentIndexInput and leverage native SIMD acceleration to enhance search performance.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.