Skip to content

[BUG] SegmentName sort breaks for Lucene 9 indices with soft-deletes #1405

@kanatti

Description

@kanatti

What is the bug?

We wrap DirectoryReader with SoftDeletesDirectoryReaderWrapper when soft delete is possible. If soft deletes are present on the segment, then SoftDeletesDirectoryReaderWrapper will wrap SegmentReader in a FilterCodecReader.

This breaks the segmentName extraction in LeafReader implementations and returns null for segmentName. That in turn causes comparison conflict inSegmentNameSorter, throwing following warnings:

2025-03-30 17:21:49,462 WARN o.o.m.b.l.SegmentNameSorter [workFinishScheduler-1] Unexpected equality during leafReader sorting, expected sort to yield no equality to ensure consistent segment ordering. This may cause missing documents if both segmentscontains docs. LeafReader1DebugInfo: Class: org.opensearch.migrations.bulkload.lucene.version_9.LeafReader9
Context: LeafReaderContext(_np(9.7.0):C2770541:[diagnostics={mergeFactor=10, java.vendor=Eclipse Adoptium, os=Linux, os.version=5.10.230-223.885.amzn2.aarch64, timestamp=1739516803646, mergeMaxNumSegments=-1, lucene.version=9.7.0, source=merge, os.arch=aarch64, java.runtime.version=17.0.8+7}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=4ys2kvsdoahmrjm2aztass02u docBase=0 ord=0)
SegmentInfo: _np(9.7.0):C2770541:[diagnostics={mergeFactor=10, java.vendor=Eclipse Adoptium, os=Linux, os.version=5.10.230-223.885.amzn2.aarch64, timestamp=1739516803646, mergeMaxNumSegments=-1, lucene.version=9.7.0, source=merge, os.arch=aarch64, java.runtime.version=17.0.8+7}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=4ys2kvsdoahmrjm2aztass02u
 
LeafReader2DebugInfo: Class: org.opensearch.migrations.bulkload.lucene.version_9.LeafReader9
Context: LeafReaderContext(shadow.lucene9.org.apache.lucene.index.SoftDeletesDirectoryReaderWrapper$SoftDeletesFilterCodecReader@3dc22860 docBase=0 ord=0)

Now TimSort (behind Arrays.sort) has validations on Comparator contracts where comparator is expected to be follow equivalence relation Comparator docs

This leads to following error (only sometimes when mergeHi is code path is hit with conflicting inputs):

2025-03-30 22:04:00,017 ERROR o.o.m.b.w.DocumentsRunner [DocumentBatchReindexer-1] Error prevented some batches from being processed
java.lang.IllegalArgumentException: Comparison method violates its general contract!
	at java.base/java.util.TimSort.mergeHi(TimSort.java:903) ~[?:?]
	at java.base/java.util.TimSort.mergeAt(TimSort.java:520) ~[?:?]
	at java.base/java.util.TimSort.mergeForceCollapse(TimSort.java:461) ~[?:?]
	at java.base/java.util.TimSort.sort(TimSort.java:254) ~[?:?]
	at java.base/java.util.Arrays.sort(Arrays.java:1307) ~[?:?]
	at java.base/java.util.stream.SortedOps$SizedRefSortingSink.end(SortedOps.java:353) ~[?:?]
	at java.base/java.util.stream.Sink$ChainedReference.end(Sink.java:258) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:510) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499) ~[?:?]
	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921) ~[?:?]
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:682) ~[?:?]
	at org.opensearch.migrations.bulkload.lucene.LuceneReader.getSegmentsFromStartingSegment(LuceneReader.java:69) ~[RFS-0.1.0-SNAPSHOT.jar:?]
	at org.opensearch.migrations.bulkload.lucene.LuceneReader.readDocsByLeavesFromStartingPosition(LuceneReader.java:39) ~[RFS-0.1.0-SNAPSHOT.jar:?]
	at org.opensearch.migrations.bulkload.lucene.LuceneIndexReader.lambda$readDocuments$0(LuceneIndexReader.java:69) ~[RFS-0.1.0-SNAPSHOT.jar:?]
	at reactor.core.publisher.FluxUsing.subscribe(FluxUsing.java:85) ~[reactor-core-3.7.4.jar:3.7.4]
	at reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:68) ~[reactor-core-3.7.4.jar:3.7.4]
	at reactor.core.publisher.FluxSubscribeOn$SubscribeOnSubscriber.run(FluxSubscribeOn.java:194) ~[reactor-core-3.7.4.jar:3.7.4]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84) [reactor-core-3.7.4.jar:3.7.4]
	at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37) [reactor-core-3.7.4.jar:3.7.4]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

I ran into the error for 4 shards out 1680 shards i was trying to migrate.

What are your migration environments?

Source snapshots from OS 2.11

How can one reproduce the bug?

Create snapshots from OS 2.11 with soft-deletes. LeafReader9#getSegmentName will return null for segments with soft-deletes. If you have a lot of segments and run for multiple shards you might run into the TimSort exception.

What is the expected behavior?

  • SegmentName is picked correctly for segments with soft-deletes
  • No failure from sorting.

Do you have any additional context?

I have a fix on personal branch, will raise a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions