-
Notifications
You must be signed in to change notification settings - Fork 40
Description
What is the bug?
We wrap DirectoryReader
with SoftDeletesDirectoryReaderWrapper
when soft delete is possible. If soft deletes are present on the segment, then SoftDeletesDirectoryReaderWrapper
will wrap SegmentReader
in a FilterCodecReader
.
This breaks the segmentName extraction in LeafReader
implementations and returns null for segmentName. That in turn causes comparison conflict inSegmentNameSorter
, throwing following warnings:
2025-03-30 17:21:49,462 WARN o.o.m.b.l.SegmentNameSorter [workFinishScheduler-1] Unexpected equality during leafReader sorting, expected sort to yield no equality to ensure consistent segment ordering. This may cause missing documents if both segmentscontains docs. LeafReader1DebugInfo: Class: org.opensearch.migrations.bulkload.lucene.version_9.LeafReader9
Context: LeafReaderContext(_np(9.7.0):C2770541:[diagnostics={mergeFactor=10, java.vendor=Eclipse Adoptium, os=Linux, os.version=5.10.230-223.885.amzn2.aarch64, timestamp=1739516803646, mergeMaxNumSegments=-1, lucene.version=9.7.0, source=merge, os.arch=aarch64, java.runtime.version=17.0.8+7}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=4ys2kvsdoahmrjm2aztass02u docBase=0 ord=0)
SegmentInfo: _np(9.7.0):C2770541:[diagnostics={mergeFactor=10, java.vendor=Eclipse Adoptium, os=Linux, os.version=5.10.230-223.885.amzn2.aarch64, timestamp=1739516803646, mergeMaxNumSegments=-1, lucene.version=9.7.0, source=merge, os.arch=aarch64, java.runtime.version=17.0.8+7}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=4ys2kvsdoahmrjm2aztass02u
LeafReader2DebugInfo: Class: org.opensearch.migrations.bulkload.lucene.version_9.LeafReader9
Context: LeafReaderContext(shadow.lucene9.org.apache.lucene.index.SoftDeletesDirectoryReaderWrapper$SoftDeletesFilterCodecReader@3dc22860 docBase=0 ord=0)
Now TimSort
(behind Arrays.sort
) has validations on Comparator contracts where comparator is expected to be follow equivalence relation Comparator docs
This leads to following error (only sometimes when mergeHi
is code path is hit with conflicting inputs):
2025-03-30 22:04:00,017 ERROR o.o.m.b.w.DocumentsRunner [DocumentBatchReindexer-1] Error prevented some batches from being processed
java.lang.IllegalArgumentException: Comparison method violates its general contract!
at java.base/java.util.TimSort.mergeHi(TimSort.java:903) ~[?:?]
at java.base/java.util.TimSort.mergeAt(TimSort.java:520) ~[?:?]
at java.base/java.util.TimSort.mergeForceCollapse(TimSort.java:461) ~[?:?]
at java.base/java.util.TimSort.sort(TimSort.java:254) ~[?:?]
at java.base/java.util.Arrays.sort(Arrays.java:1307) ~[?:?]
at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:353) ~[?:?]
at java.base/java.util.stream.Sink$ChainedReference.end(Sink.java:258) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:510) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
at org.opensearch.migrations.bulkload.lucene.LuceneReader.getSegmentsFromStartingSegment(LuceneReader.java:69) ~[RFS-0.1.0-SNAPSHOT.jar:?]
at org.opensearch.migrations.bulkload.lucene.LuceneReader.readDocsByLeavesFromStartingPosition(LuceneReader.java:39) ~[RFS-0.1.0-SNAPSHOT.jar:?]
at org.opensearch.migrations.bulkload.lucene.LuceneIndexReader.lambda$readDocuments$0(LuceneIndexReader.java:69) ~[RFS-0.1.0-SNAPSHOT.jar:?]
at reactor.core.publisher.FluxUsing.subscribe(FluxUsing.java:85) ~[reactor-core-3.7.4.jar:3.7.4]
at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:68) ~[reactor-core-3.7.4.jar:3.7.4]
at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194) ~[reactor-core-3.7.4.jar:3.7.4]
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) [reactor-core-3.7.4.jar:3.7.4]
at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) [reactor-core-3.7.4.jar:3.7.4]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
I ran into the error for 4 shards out 1680 shards i was trying to migrate.
What are your migration environments?
Source snapshots from OS 2.11
How can one reproduce the bug?
Create snapshots from OS 2.11 with soft-deletes. LeafReader9#getSegmentName
will return null for segments with soft-deletes. If you have a lot of segments and run for multiple shards you might run into the TimSort exception.
What is the expected behavior?
- SegmentName is picked correctly for segments with soft-deletes
- No failure from sorting.
Do you have any additional context?
I have a fix on personal branch, will raise a PR