-
Notifications
You must be signed in to change notification settings - Fork 169
Deprecate custom compound codec for native index write logic #2633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Deprecate custom compound codec for native index write logic #2633
Conversation
…s. Implement a NativeIndexReader that can read these indexes from compound files and provide the IndexInput for loading it into the cache Signed-off-by: Enes Palaz <[email protected]>
a5ed81a
to
a2ca66c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one concern wrt search performance degradation when compound format is enabled. Could you address it?
try (Engine.Searcher searcher = indexShard.acquireSearcher("knn-warmup")) { | ||
log.info("[KNN] entered searcher"); | ||
getAllEngineFileContexts(searcher.getIndexReader()).forEach((engineFileContext) -> { | ||
log.info("[KNN] Engine file context: [{}]", engineFileContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you see any stringified logs with this?
I think EngineFileContext needs @tostring annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove these new info logs, I was using them for tracking some cluster logs.
|
||
@Override | ||
public void close() throws IOException { | ||
this.indexInput.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to close the sliced IndexInput unless it was created explicitly with clone()
method.
It is safe just closing the source IndexInput (e.g. sliceSourceIndexInput
) where the sliced IndexInput origins.
You can find many places in Lucene never calls close()
to sliced IndexInput - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L484
^^ above, it does not call close()
to dataIn
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having issues with KNNWeightTests which extend LuceneTestCase when it is not closed. It uses MockDirectoryWrapper
to verify if all files are closed. Without closing the source, tests were failing with file being left opened.
I also saw some other examples were sliced clones are kept somewhere to be closed with the source, like in here https://github.com/opensearch-project/OpenSearch/blob/edd854a61b6fc051dfab2a3ce4641748ff799491/server/src/main/java/org/opensearch/index/store/remote/filecache/FullFileCachedIndexInput.java#L75
Similarly here as well:
https://github.com/opensearch-project/OpenSearch/blob/edd854a61b6fc051dfab2a3ce4641748ff799491/server/src/main/java/org/opensearch/gateway/MetadataStateFormat.java#L312-L339
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's from opensearch. Could you find an example closing sliced IndexInput in Lucene?
Sliced IndexInput is a view, and it does not have to be closed. (Cloned index input is a whole different story though)
Not sure why MockDirectoryWrapper recognizes sliced IndexInput as a separate stream to track. Let me reproduce from my end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epalaz @0ctopus13prime If slice doesn't need to be closed, can we please remove this entire class? from what I can see this is simply used to close slice and indexinput together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shatejas yes, that is the approach I will take in case we don't need it. I am considering some other options as well.
} | ||
|
||
@Override | ||
public IndexInput slice(String s, long l, long l1) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you give better names to s, l, l1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, I will update these on final draft. If we can get away without closing the source IndexInput, we won't need this class anyways.
} | ||
|
||
@Override | ||
public void readBytes(byte[] bytes, int i, int i1) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you give better names to i, i1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, I will update these on final draft. If we can get away without closing the source IndexInput, we won't need this class anyways.
public static final int FLOAT_BYTE_SIZE = 4; | ||
|
||
public static boolean isSegmentUsingLegacyIndexCompoundCodec(SegmentInfo segmentInfo) { | ||
return !segmentInfo.getVersion().onOrAfter(Version.LUCENE_10_0_0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUPER NIT : could you follow the convention?
segmeintInfo.getVersion().onOrAfter(..) == false
* This index input extension allows passing both index inputs down to consumer so when close is called, both | ||
* index inputs are closed. See NativeIndexReader for usage of this class. | ||
*/ | ||
public class NativeIndexInput extends IndexInput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an use case calling clone()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in our use case but in terms of Lucene jargon, they call input streams created from cloning/slicing clones
of the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add clone method for completeness here.
It will give subtle error if relying on the default clone()
.
import org.apache.lucene.store.IndexInput; | ||
|
||
@Log4j2 | ||
public class NativeIndexReader implements Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give brief comments on what this class does?
|
||
// If it is non-legacy compound file, calculate size from file size - header - footer | ||
// Cache will hold the index file bytes after the header hence the size calculation | ||
long compoundFileLength = directory.fileLength(indexFileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT : hmm... this is somewhat misleading? it should be indexFileLength?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is true, will update.
* @return List of engine files | ||
*/ | ||
public static List<String> getEngineFiles(String extension, String fieldName, SegmentInfo segmentInfo) { | ||
public static List<String> getEngineFiles(KNNEngine engine, String fieldName, SegmentInfo segmentInfo) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, this will be called for every query.
With this change, then for the compound format, it will involve all sorts of opening a file + decoding per each query which will likely impact search latency.
I think we need to come up with plan to avoid this cost.
- Decoding compound meta info : https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90CompoundReader.java#L60
- Query engine files during query : https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L321
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Let me see if there is an alternative.
|
||
package org.opensearch.knn.index.codec.nativeindex; | ||
|
||
public class NativeIndexReaderTests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, wanted to get general logic verified before finalizing all the tests. I will polish this before actual PR.
|
||
@Override | ||
public void close() throws IOException { | ||
this.indexInput.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@epalaz @0ctopus13prime If slice doesn't need to be closed, can we please remove this entire class? from what I can see this is simply used to close slice and indexinput together
} | ||
} | ||
|
||
public static NativeIndexReader getReader(SegmentInfo segmentInfo) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we want this reader. this deviates from the construct of a lucene reader which encapsulated in a codec. Shouldn't this logic be encapsulated in KNN80CompoundReader? Is there a way we can do it?
Moreover this again circles back to the point of performance @0ctopus13prime bought up. While currently this is being cached it might still impact tail latencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind the change is to get rid of custom compound codec. There is a custom write logic that made this codec necessary since when it is compound codec, we remove the native index out of the compound file and save separately.
The functionality of this reader is to strip the codec header/footer that is added in NativeIndexWriter
before loading it into the memory. Not part of the codec in that aspect.
In terms of performance, I am working on getting the benchmark numbers so if there is a significant impact, we may abandon this approach fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Readers and writers are supposed to be inside the codec - so lets not deviate from the convention and we need to rename as it can get really confusing.
- The problem with this approach is that its opening a new compound reader with each request, which underneath opens index input which will assign new virtual address spaces every time. It will eventually run out of addresses spaces with enough load and crash the node making the cluster red. So even if the latencies are okay we might not be able to use this
I looked around and bit and there doesn't seem a way that you can access compound reader which is cached in SegmentReader
. From what it looks like, Lucene expects that the knnvector reader should be used and a compound directory is passed to it in segment read state. The reader, as a part of opening index input, should check the header and put index input in the right location to start reading the graph file. The links below should help you verify it
- SegmentCoreReader caching compound format
- Segment read state with the right directory (cfs or otherwise)
- Caching of knnvector reader in segment reader
- Lucenes implementation of reader where it checks header
The right way would be to move doANNSearch in a VectorSearcher which is used in Native99VectorReader
@epalaz I know this is in Draft ..., Can we quickly run benchmarking the above approach , post the benchmarking results before we deep dive into this PR. |
Description
Deprecating the use of custom compound codec for native index files. Previously, custom codec KNN80CompoundFormat, would take compound files out of the compound file list and write them as standalone files with extensions like
.faissc
,.hnswc
.With this CR, we are removing this custom logic and write native indexes into compound files like all other index files. With this, extensions like
.faissc
for compound files are removed. All index files have the same name/index even if they are written with an index where compound files are enabled or not.In order to enable this, a header is added to all native index files written in NativeIndexWriter since all subfiles in compound file requires a header. A new component NativeIndexReader is implemented to read IndexInput from either compound file or segment directory based on whether compound files are enabled. This component slices the input to remove the header and footer so only native binary index data is loaded into the native index cache.
Backwards Compatibility
In order to ensure backwards compatibility, a check for Lucene version that is used for the segment is implemented. Since with 3.0 we are moving to Lucene version 10, we assume any index that is written with Lucene version older than 10, will have the older custom codec.
For those, we fallback to existing read logic of reading the custom file and skip header and footer removal.
Related Issues
Resolves #2536
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.