Deprecate custom compound codec for native index write logic #2633

epalaz · 2025-03-28T18:37:21Z

Description

Deprecating the use of custom compound codec for native index files. Previously, custom codec KNN80CompoundFormat, would take compound files out of the compound file list and write them as standalone files with extensions like .faissc, .hnswc.

With this CR, we are removing this custom logic and write native indexes into compound files like all other index files. With this, extensions like .faissc for compound files are removed. All index files have the same name/index even if they are written with an index where compound files are enabled or not.

In order to enable this, a header is added to all native index files written in NativeIndexWriter since all subfiles in compound file requires a header. A new component NativeIndexReader is implemented to read IndexInput from either compound file or segment directory based on whether compound files are enabled. This component slices the input to remove the header and footer so only native binary index data is loaded into the native index cache.

Backwards Compatibility

In order to ensure backwards compatibility, a check for Lucene version that is used for the segment is implemented. Since with 3.0 we are moving to Lucene version 10, we assume any index that is written with Lucene version older than 10, will have the older custom codec.

For those, we fallback to existing read logic of reading the custom file and skip header and footer removal.

Related Issues

Resolves #2536

Check List

[In progress for full coverage] New functionality includes testing.
[In progress] New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…s. Implement a NativeIndexReader that can read these indexes from compound files and provide the IndexInput for loading it into the cache Signed-off-by: Enes Palaz <[email protected]>

0ctopus13prime

I have one concern wrt search performance degradation when compound format is enabled. Could you address it?

0ctopus13prime · 2025-03-28T20:50:25Z

src/main/java/org/opensearch/knn/index/KNNIndexShard.java

        try (Engine.Searcher searcher = indexShard.acquireSearcher("knn-warmup")) {
+            log.info("[KNN] entered searcher");
            getAllEngineFileContexts(searcher.getIndexReader()).forEach((engineFileContext) -> {
+                log.info("[KNN] Engine file context: [{}]", engineFileContext);


Can you see any stringified logs with this?
I think EngineFileContext needs @tostring annotation

I will remove these new info logs, I was using them for tracking some cluster logs.

0ctopus13prime · 2025-03-28T20:58:47Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexInput.java

+
+    @Override
+    public void close() throws IOException {
+        this.indexInput.close();


I don't think we need to close the sliced IndexInput unless it was created explicitly with clone() method.
It is safe just closing the source IndexInput (e.g. sliceSourceIndexInput) where the sliced IndexInput origins.

You can find many places in Lucene never calls close() to sliced IndexInput - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L484

^^ above, it does not call close() to dataIn.

I was having issues with KNNWeightTests which extend LuceneTestCase when it is not closed. It uses MockDirectoryWrapper to verify if all files are closed. Without closing the source, tests were failing with file being left opened.

I also saw some other examples were sliced clones are kept somewhere to be closed with the source, like in here https://github.com/opensearch-project/OpenSearch/blob/edd854a61b6fc051dfab2a3ce4641748ff799491/server/src/main/java/org/opensearch/index/store/remote/filecache/FullFileCachedIndexInput.java#L75

Similarly here as well:
https://github.com/opensearch-project/OpenSearch/blob/edd854a61b6fc051dfab2a3ce4641748ff799491/server/src/main/java/org/opensearch/gateway/MetadataStateFormat.java#L312-L339

That's from opensearch. Could you find an example closing sliced IndexInput in Lucene?
Sliced IndexInput is a view, and it does not have to be closed. (Cloned index input is a whole different story though)

Not sure why MockDirectoryWrapper recognizes sliced IndexInput as a separate stream to track. Let me reproduce from my end

@epalaz @0ctopus13prime If slice doesn't need to be closed, can we please remove this entire class? from what I can see this is simply used to close slice and indexinput together

@shatejas yes, that is the approach I will take in case we don't need it. I am considering some other options as well.

0ctopus13prime · 2025-03-28T20:59:31Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexInput.java

+    }
+
+    @Override
+    public IndexInput slice(String s, long l, long l1) throws IOException {


could you give better names to s, l, l1?

For sure, I will update these on final draft. If we can get away without closing the source IndexInput, we won't need this class anyways.

0ctopus13prime · 2025-03-28T20:59:51Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexInput.java

+    }
+
+    @Override
+    public void readBytes(byte[] bytes, int i, int i1) throws IOException {


could you give better names to i, i1?

For sure, I will update these on final draft. If we can get away without closing the source IndexInput, we won't need this class anyways.

0ctopus13prime · 2025-03-28T21:19:06Z

src/main/java/org/opensearch/knn/index/codec/util/KNNCodecUtil.java

    public static final int FLOAT_BYTE_SIZE = 4;

+    public static boolean isSegmentUsingLegacyIndexCompoundCodec(SegmentInfo segmentInfo) {
+        return !segmentInfo.getVersion().onOrAfter(Version.LUCENE_10_0_0);


SUPER NIT : could you follow the convention?
segmeintInfo.getVersion().onOrAfter(..) == false

0ctopus13prime · 2025-03-28T21:43:57Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexInput.java

+ * This index input extension allows passing both index inputs down to consumer so when close is called, both
+ * index inputs are closed. See NativeIndexReader for usage of this class.
+ */
+public class NativeIndexInput extends IndexInput {


Is there an use case calling clone()?

Not in our use case but in terms of Lucene jargon, they call input streams created from cloning/slicing clones of the original.

Let's add clone method for completeness here.
It will give subtle error if relying on the default clone().

0ctopus13prime · 2025-03-28T21:45:39Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexReader.java

+import org.apache.lucene.store.IndexInput;
+
+@Log4j2
+public class NativeIndexReader implements Closeable {


Could you give brief comments on what this class does?

0ctopus13prime · 2025-03-28T21:47:01Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexReader.java

+
+        // If it is non-legacy compound file, calculate size from file size - header - footer
+        // Cache will hold the index file bytes after the header hence the size calculation
+        long compoundFileLength = directory.fileLength(indexFileName);


NIT : hmm... this is somewhat misleading? it should be indexFileLength?

that is true, will update.

0ctopus13prime · 2025-03-28T21:52:46Z

src/main/java/org/opensearch/knn/index/codec/util/KNNCodecUtil.java

     * @return List of engine files
     */
-    public static List<String> getEngineFiles(String extension, String fieldName, SegmentInfo segmentInfo) {
+    public static List<String> getEngineFiles(KNNEngine engine, String fieldName, SegmentInfo segmentInfo) throws IOException {


oh wait, this will be called for every query.
With this change, then for the compound format, it will involve all sorts of opening a file + decoding per each query which will likely impact search latency.
I think we need to come up with plan to avoid this cost.

Decoding compound meta info : https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90CompoundReader.java#L60

Query engine files during query : https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L321

Great point! Let me see if there is an alternative.

0ctopus13prime · 2025-03-28T21:56:01Z

src/test/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexReaderTests.java

+
+package org.opensearch.knn.index.codec.nativeindex;
+
+public class NativeIndexReaderTests {


haha what is this?

Oh yeah, wanted to get general logic verified before finalizing all the tests. I will polish this before actual PR.

shatejas · 2025-03-31T18:01:06Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexInput.java

+
+    @Override
+    public void close() throws IOException {
+        this.indexInput.close();


@epalaz @0ctopus13prime If slice doesn't need to be closed, can we please remove this entire class? from what I can see this is simply used to close slice and indexinput together

shatejas · 2025-03-31T22:09:54Z

src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexReader.java

+        }
+    }
+
+    public static NativeIndexReader getReader(SegmentInfo segmentInfo) throws IOException {


I am not sure if we want this reader. this deviates from the construct of a lucene reader which encapsulated in a codec. Shouldn't this logic be encapsulated in KNN80CompoundReader? Is there a way we can do it?

Moreover this again circles back to the point of performance @0ctopus13prime bought up. While currently this is being cached it might still impact tail latencies

The idea behind the change is to get rid of custom compound codec. There is a custom write logic that made this codec necessary since when it is compound codec, we remove the native index out of the compound file and save separately.

The functionality of this reader is to strip the codec header/footer that is added in NativeIndexWriter before loading it into the memory. Not part of the codec in that aspect.

In terms of performance, I am working on getting the benchmark numbers so if there is a significant impact, we may abandon this approach fully.

Readers and writers are supposed to be inside the codec - so lets not deviate from the convention and we need to rename as it can get really confusing.

The problem with this approach is that its opening a new compound reader with each request, which underneath opens index input which will assign new virtual address spaces every time. It will eventually run out of addresses spaces with enough load and crash the node making the cluster red. So even if the latencies are okay we might not be able to use this

I looked around and bit and there doesn't seem a way that you can access compound reader which is cached in SegmentReader. From what it looks like, Lucene expects that the knnvector reader should be used and a compound directory is passed to it in segment read state. The reader, as a part of opening index input, should check the header and put index input in the right location to start reading the graph file. The links below should help you verify it

SegmentCoreReader caching compound format

Segment read state with the right directory (cfs or otherwise)

Caching of knnvector reader in segment reader

Lucenes implementation of reader where it checks header

The right way would be to move doANNSearch in a VectorSearcher which is used in Native99VectorReader

Vikasht34 · 2025-04-01T16:14:15Z

@epalaz I know this is in Draft ..., Can we quickly run benchmarking the above approach , post the benchmarking results before we deep dive into this PR.

epalaz changed the title ~~Draft: #2536 Remove custom code for writing indexes outside of compound file…~~ Deprecate custom compound codec for native index write logic Mar 28, 2025

feat: Remove custom code for writing indexes outside of compound file…

a2ca66c

…s. Implement a NativeIndexReader that can read these indexes from compound files and provide the IndexInput for loading it into the cache Signed-off-by: Enes Palaz <[email protected]>

epalaz force-pushed the deprecate-custom-compound-codec branch from a5ed81a to a2ca66c Compare March 28, 2025 20:15

0ctopus13prime reviewed Mar 28, 2025

View reviewed changes

shatejas reviewed Mar 31, 2025

View reviewed changes

epalaz added 2 commits April 1, 2025 11:11

Merge branch 'main' into deprecate-custom-compound-codec

d4afc90

Merge branch 'main' into deprecate-custom-compound-codec

5763540


		package org.opensearch.knn.index.codec.nativeindex;

		public class NativeIndexReaderTests {

Deprecate custom compound codec for native index write logic #2633

Are you sure you want to change the base?

Deprecate custom compound codec for native index write logic #2633

Uh oh!

Conversation

epalaz commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Backwards Compatibility

Related Issues

Check List

Uh oh!

0ctopus13prime left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

epalaz Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0ctopus13prime Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vikasht34 commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

epalaz commented Mar 28, 2025 •

edited

Loading

epalaz Mar 28, 2025 •

edited

Loading

0ctopus13prime Mar 28, 2025 •

edited

Loading