Skip to content

Move away from custom compoundformat #2536

@jmazanec15

Description

@jmazanec15

Description

Recently, for native libraries, we introduced a change to interact with the files via indexinput and indexoutput. With this, we should be able to remove our custom compoundformat in our codec (see #2185).

However, when removing it, we get an error like:

Caused by: org.apache.lucene.index.CorruptIndexException: compound sub-files must have a valid codec header and footer: codec header mismatch: actual header=1232620912 vs expected header=1071082519 (resource=BufferedChecksumIndexInput(MemorySegmentIndexInput(path="/Users/jmazane/workspace/Opensearch/DockerRunner/k-NN-1/build/testclusters/integTest-0/data/nodes/0/indices/6zG12XzjQLaWyWiAL_OFaQ/0/index/_0_165_test_nested.test_vector.faiss")))
        at org.apache.lucene.codecs.CodecUtil.verifyAndCopyIndexHeader(CodecUtil.java:287) ~[lucene-core-10.1.0.jar:10.1.0 884954006de769dc43b811267230d625886e6515 - 2024-12-17 16:15:44]

This is because for the native index files we write a footer but no header: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/nativeindex/NativeIndexWriter.java#L141-L150. See CompoundFormat interface: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/CompoundFormat.java#L41-L45.

We should get rid of the CompoundFormat so we can move towards just extending the PerFieldVectorFormat. To do this, we need to write the header, and make sure to read it before forwarding on the output to the underlying libraries.

From a bwc perspective, for old codecs, we will need to keep around the old CompoundFormat. But we should be able to remove on new codecs.

Metadata

Metadata

Assignees

Labels

RefactoringImprove the design, structure, and implementation while preserving its functionalityv3.0.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions