Skip to content

[Feature Request] Add segment sorter to ShuffleForcedMergePolicy instead of name-based segment sorting #18168

Open
@prudhvigodithi

Description

@prudhvigodithi

Is your feature request related to a problem? Please describe

Currently, when OpenSearch performs a forced merge (ex _forcemerge?max_num_segments=1), it wraps the underlying merge policy with ShuffleForcedMergePolicy.

From my understanding the ShuffleForcedMergePolicy – a special merge policy that interleaves documents from the oldest and newest segments during the merge. “Interleaves” means mixing things together by alternating them. By shuffling documents from old and new segments, the ShuffleForcedMergePolicy breaks the ordering pattern, enabling more uniform distribution of data during force merge.

Now for time-series index and using LogByteSizeMergePolicy where merges always combine adjacent segments and assign the result a new Lucene name (via a global counter), but that name doesn’t reflect the segment’s true timestamp range. Because the current shuffle logic sorts segments by these Lucene names (info.name), the interleaving can drift from real time order.

Coming from #17404 seen a variance with desc_sort_timestamp after force merge. After some investigation the main cause (#17404 (comment)) is Document ID Reassignment after the force merge which is coming from ShuffleForcedMergePolicy. More tests done here #17737 (comment).

A high level example with _6 (Jan 1–4), _8 (Jan 4–6), _5 (Jan 6–7), _7 (Jan 7–8), _9 (Jan 8–9)], with current logic when sorted by name its [_5, _6, _7, _8, _9]. Now the resulting merge order is [_5, _9, _6, _8, _7].

Describe the solution you'd like

Allow supplying a segment comparator (use this for time-series index) pulled from the index’s own LeafSorter. So that the data is now evenly interleaved and sort queries dont pay a penalty by skipping Lucene’s BKD optimization .

Now from above high level example with _6 (Jan 1–4), _8 (Jan 4–6), _5 (Jan 6–7), _7 (Jan 7–8), _9 (Jan 8–9)]. The merge order should be [_6, _9, _8, _7, _5]. The gives more predictable DocID assignment and newest and oldest data is interleaved properly.

Related component

Search

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

SearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or request

Type

No type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions