Skip to content

Enable sort optimization on int, short and byte fields #127968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented May 9, 2025

Before this PR sorting on integer, short and byte fields types used SortField.Type.LONG. This made sort optimization impossible for these field types.

This PR uses SortField.Type.INT for integer, short and byte fields. This enables sort optimization.

There are several caveats with changing sort type that are addressed:

  • Before mixed sort on integer and long fields was automatically supported, as both field types used SortField.TYPE.LONG. Now when merging results from different shards, we need to convert sort to LONG and results to long values.
  • Similar for collapsing when there is mixed INT and LONG sort types.
  • Index sorting. Similarly, before for index sorting on integer field, SortField.Type.LONG was used. This sort type is stored in the index writer config on disk and can't be modified. Now when providing sortField() for index sorting, we need to account for index version: for older indices return sort with SortField.Type.LONG and for new indices return SortField.Type.INT.

There is only 1 change that may be considered not backwards compatible:
Before if an integer field was missing a value , it sort values will return Long.MAX_VALUE in a search response. With this integer, it sort valeu will return Integer.MAX_VALUE. But I think this change is ok, as in our documentation, we don't provide information what value will be returned, we just say it will be sorted last.


Also closes #127965 (as same type validation in added for collapse queries)

Before this PR sorting on integer, short and byte fields types
used SortField.Type.LONG. This made sort optimization impossible
for these field types.

This PR uses SortField.Type.INT for integer, short and byte fields.
This enables sort optimization.

There are several caveats with changing sort type that are addressed:
- Before mixed sort on integer and long fields was automatically supported,
as both field types used SortField.TYPE.LONG. Now when merging results
from different shards, we need to convert sort to LONG and results to
long values.
- Similar for collapsing when there is mixed INT and LONG sort types.
- Index sorting. Similarly, before for index sorting on integer field,
SortField.Type.LONG was used. This sort type is stored in the index
writer config on disk and can't be modified. Now when providing
sortField() for index sorting, we need to account for index version:
for older indices return sort with SortField.Type.LONG and for new
indices return SortField.Type.INT.
@mayya-sharipova mayya-sharipova added >enhancement :Search Relevance/Search Catch all for Search Relevance labels May 9, 2025
@elasticsearchmachine elasticsearchmachine added v9.1.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels May 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've created a changelog YAML for you.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 9, 2025

Benchmarks done on the geonames track where population and elevation were indexed as integer:
Results, the contender has over 2.4x - 6.7x times improvement in queries speed than the baseline

  • baseline (current main branch)
  • contender (this PR)
|                                                        Metric |                                       Task |         Baseline |        Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------------------------------------------:|-----------------:|-----------------:|------------:|-------:|---------:|
|                                                 Segment count |                                            |     69           |     71           |     2       |        |   +2.90% |
|                                  90th percentile service time |                       desc_sort_population |     55.8922      |      9.73511     |   -46.1571  |     ms |  -82.58% |
|                                  90th percentile service time |    desc_sort_population_can_match_shortcut |     27.5603      |     11.3325      |   -16.2278  |     ms |  -58.88% |
|                                  90th percentile service time | desc_sort_population_no_can_match_shortcut |     27.6777      |     10.4456      |   -17.232   |     ms |  -62.26% |
|                                  90th percentile service time |                        asc_sort_population |     43.7791      |      7.9269      |   -35.8522  |     ms |  -81.89% |
|                                  90th percentile service time |             asc_sort_with_after_population |     62.1788      |      8.2714      |   -53.9074  |     ms |  -86.70% |
|                                  90th percentile service time |                        desc_sort_elevation |     52.0221      |      8.24807     |   -43.774   |     ms |  -84.15% |
|                                  90th percentile service time |                         asc_sort_elevation |     41.8961      |      6.16761     |   -35.7284  |     ms |  -85.28% |

Benchmarks done on the http_logs track where "ingest_percentage:20, where status and size were indexed as integer:
Results, the contender has over 6x-136x times improvement in queries speed than the baseline

  • baseline (current main branch)
  • contender (this PR)
|                                                        Metric |                                                   Task |         Baseline |        Contender |         Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|-------------------------------------------------------:|-----------------:|-----------------:|-------------:|-------:|---------:|
|                                                 Segment count |                                                        |     25           |     25           |      0       |        |    0.00% |
|                                  90th percentile service time |                                         sort_size_desc |    111.687       |     15.1521      |    -96.5345  |     ms |  -86.43% |
|                                  90th percentile service time |                                          sort_size_asc |     93.394       |     14.8044      |    -78.5896  |     ms |  -84.15% |
|                                  90th percentile service time |                                       sort_status_desc |    123.959       |     13.4924      |   -110.466   |     ms |  -89.12% |
|                                  90th percentile service time |                                        sort_status_asc |    105.192       |     11.857       |    -93.3347  |     ms |  -88.73% |
|                                  90th percentile service time |                 sort-size-desc-after-force-merge-1-seg |    111.995       |     21.2422      |    -90.7533  |     ms |  -81.03% |
|                                       90th percentile latency |                  sort-size-asc-after-force-merge-1-seg |  12386.2         |     11.8291      | -12374.4     |     ms |  -99.90% |
|                                       90th percentile latency |               sort-status-desc-after-force-merge-1-seg |    126.58        |     13.4277      |   -113.152   |     ms |  -89.39% |
|                                  90th percentile service time |                sort-status-asc-after-force-merge-1-seg |    104.873       |      7.42143     |    -97.4519  |     ms |  -92.92% |

@@ -154,8 +154,11 @@ private static Object convertValueFromSortType(String fieldName, SortField.Type
try {
switch (sortType) {
case DOC, INT:
if (value instanceof Number) {
return ((Number) value).intValue();
if (value instanceof Number valueNumber) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly thinking outloud since I'm not entirely sure what a user's expectations would be here. But when I read this snippet I was wondering if this logic should throw an error instead of defaulting the value to max int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@john-wagster I've added a comment for clarification.

This is to support the current behaviour for compatibility. Currently, before this PR, all sort on INT fields was treated as Long sort, so search_after of values > Integer.MAX_VALUE were allowed.
Now to support this and also to support sort on mixed shards of int and long value, we should convert values larger than Integer.MAX_VALUE

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment, took a couple of passes through the PR, and overall lgtm to me.

@elasticsearchmachine
Copy link
Collaborator

Hi @mayya-sharipova, I've updated the changelog YAML for you.

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/part-2

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/part-3

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run "Elasticsearch Serverless Checks"

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice results on the benchmark!
From what I understand bytes and shorts would benefit even more, right?
Let's follow up on that to ensure we have a rally track for these new optimisations.
It would be helpful to add dedicated YAML tests for this behavior to ensure full coverage.

@mayya-sharipova
Copy link
Contributor Author

@jimczi Thanks for the review, I will address it:

Let's follow up on that to ensure we have a rally track for these new optimisations.

I've added a PR for integer_sort to http_logs.

mayya-sharipova added a commit to elastic/rally-tracks that referenced this pull request May 26, 2025
This adds sorting on integer fields of "size" and "status".
We are optimizing integer sort in elastic/elasticsearch/pull/127968,
and it would be nice to have dedicated operations for integer sort.

Notice, no target-throughput for these operations, as when optimization is merged
we expect massive speedups.
@mayya-sharipova
Copy link
Contributor Author

@jimczi Thanks for the review.
I added more tests in : 8404d0f

@mayya-sharipova
Copy link
Contributor Author

@jimczi Do you have more comments for this PR, or could it be merged?

@mayya-sharipova
Copy link
Contributor Author

The test on transforms failing because of this: apache/lucene#14732

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was digging around and attempting to fix the transforms failure, indeed, I think its due to the bug you found here: apache/lucene#14732

Its frustrating that this class is just full of anonymous classes & private methods.

Any way for us to apply the fix here and use a XIndexSortSortedNumericDocValuesRangeQuery extends IndexSortSortedNumericDocValuesRangeQuery in the appropriate places?

Comment on lines +239 to +243
if (getType(sortFields[fieldIdx]) == SortField.Type.INT) {
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
FieldDoc fieldDoc = (FieldDoc) scoreDoc;
fieldDoc.fields[fieldIdx] = ((Number) fieldDoc.fields[fieldIdx]).longValue();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to change the internal sort field type to LONG as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to convert them to sort field types to LONG, as TopDocs.merge will use this type to get Long comparator to sorting long values.

/**
* Comparator source for integer values.
*/
public class IntValuesComparatorSource extends LongValuesComparatorSource {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't provide an public BucketedSort newBucketedSort override for ints?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can do that in a follow up, but it should reduce memory consumption as we could tell BigArrays to use an IntArray, instead of a LongArray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben, this is indeed a good follow up, I will add TODO:

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 30, 2025

I've updated http_logs benchmarks to compare integer sort before and after these changes. The following charts have been added:

  • for multiple segments: sort_size_desc, sort_size_asc, sort_status_desc, sort_status_asc,
  • for single segment: sort-size-desc-after-force-merge-1-seg, sort-size-asc-after-force-merge-1-seg, sort-status-desc-after-force-merge-1-seg, sort-status-asc-after-force-merge-1-seg

@mayya-sharipova
Copy link
Contributor Author

@elasticsearchmachine run elasticsearch-ci/part-1

@mayya-sharipova mayya-sharipova added auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jun 2, 2025
@elasticsearchmachine elasticsearchmachine merged commit 080a0cd into elastic:main Jun 2, 2025
18 checks passed
@mayya-sharipova mayya-sharipova deleted the sort_optimization_int_short_byte branch June 2, 2025 21:51
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 127968

mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 3, 2025
Before this PR sorting on integer, short and byte fields types used
SortField.Type.LONG. This made sort optimization impossible for these
field types.

This PR uses SortField.Type.INT for integer, short and byte fields. This
enables sort optimization.

There are several caveats with changing sort type that are addressed: -
Before mixed sort on integer and long fields was automatically
supported, as both field types used SortField.TYPE.LONG. Now when
merging results from different shards, we need to convert sort to LONG
and results to long values. - Similar for collapsing when there is mixed
INT and LONG sort types. - Index sorting. Similarly, before for index
sorting on integer field, SortField.Type.LONG was used. This sort type
is stored in the index writer config on disk and can't be modified. Now
when providing sortField() for index sorting, we need to account for
index version: for older indices return sort with SortField.Type.LONG
and for new indices return SortField.Type.INT.

---

There is only 1 change that  may be considered not backwards compatible:
Before if an integer field was [missing a
value](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/sort-search-results#_missing_values)
, it sort values will return Long.MAX_VALUE in a search response. With
this integer, it sort valeu will return Integer.MAX_VALUE.  But I think
this change is ok, as in our documentation, we don't provide information
what value will be returned, we just say it will be sorted last. 

---

Also closes elastic#127965 (as same type validation in added for collapse
queries)
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Jun 3, 2025
Before this PR sorting on integer, short and byte fields types used
SortField.Type.LONG. This made sort optimization impossible for these
field types.

This PR uses SortField.Type.INT for integer, short and byte fields. This
enables sort optimization.

There are several caveats with changing sort type that are addressed: -
Before mixed sort on integer and long fields was automatically
supported, as both field types used SortField.TYPE.LONG. Now when
merging results from different shards, we need to convert sort to LONG
and results to long values. - Similar for collapsing when there is mixed
INT and LONG sort types. - Index sorting. Similarly, before for index
sorting on integer field, SortField.Type.LONG was used. This sort type
is stored in the index writer config on disk and can't be modified. Now
when providing sortField() for index sorting, we need to account for
index version: for older indices return sort with SortField.Type.LONG
and for new indices return SortField.Type.INT.

---

There is only 1 change that  may be considered not backwards compatible:
Before if an integer field was [missing a
value](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/sort-search-results#_missing_values)
, it sort values will return Long.MAX_VALUE in a search response. With
this integer, it sort valeu will return Integer.MAX_VALUE.  But I think
this change is ok, as in our documentation, we don't provide information
what value will be returned, we just say it will be sorted last. 

---

Also closes elastic#127965 (as same type validation in added for collapse
queries)
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jun 3, 2025
elasticsearchmachine added a commit that referenced this pull request Jun 3, 2025
…8832)

* Enable sort optimization on int, short and byte fields (#127968)

Before this PR sorting on integer, short and byte fields types used
SortField.Type.LONG. This made sort optimization impossible for these
field types.

This PR uses SortField.Type.INT for integer, short and byte fields. This
enables sort optimization.

There are several caveats with changing sort type that are addressed: -
Before mixed sort on integer and long fields was automatically
supported, as both field types used SortField.TYPE.LONG. Now when
merging results from different shards, we need to convert sort to LONG
and results to long values. - Similar for collapsing when there is mixed
INT and LONG sort types. - Index sorting. Similarly, before for index
sorting on integer field, SortField.Type.LONG was used. This sort type
is stored in the index writer config on disk and can't be modified. Now
when providing sortField() for index sorting, we need to account for
index version: for older indices return sort with SortField.Type.LONG
and for new indices return SortField.Type.INT.

---

There is only 1 change that  may be considered not backwards compatible:
Before if an integer field was [missing a
value](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/sort-search-results#_missing_values)
, it sort values will return Long.MAX_VALUE in a search response. With
this integer, it sort valeu will return Integer.MAX_VALUE.  But I think
this change is ok, as in our documentation, we don't provide information
what value will be returned, we just say it will be sorted last.

---

Also closes #127965 (as same type validation in added for collapse
queries)

* [CI] Auto commit changes from spotless

* Add bucketedSort based on int

---------

Co-authored-by: elasticsearchmachine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Jun 3, 2025
Add bucketedSort on Int

Follow up on #127968
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request Jun 4, 2025
elasticsearchmachine pushed a commit that referenced this pull request Jun 4, 2025
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
Before this PR sorting on integer, short and byte fields types used
SortField.Type.LONG. This made sort optimization impossible for these
field types.

This PR uses SortField.Type.INT for integer, short and byte fields. This
enables sort optimization.

There are several caveats with changing sort type that are addressed: -
Before mixed sort on integer and long fields was automatically
supported, as both field types used SortField.TYPE.LONG. Now when
merging results from different shards, we need to convert sort to LONG
and results to long values. - Similar for collapsing when there is mixed
INT and LONG sort types. - Index sorting. Similarly, before for index
sorting on integer field, SortField.Type.LONG was used. This sort type
is stored in the index writer config on disk and can't be modified. Now
when providing sortField() for index sorting, we need to account for
index version: for older indices return sort with SortField.Type.LONG
and for new indices return SortField.Type.INT.

---

There is only 1 change that  may be considered not backwards compatible:
Before if an integer field was [missing a
value](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/sort-search-results#_missing_values)
, it sort values will return Long.MAX_VALUE in a search response. With
this integer, it sort valeu will return Integer.MAX_VALUE.  But I think
this change is ok, as in our documentation, we don't provide information
what value will be returned, we just say it will be sorted last. 

---

Also closes elastic#127965 (as same type validation in added for collapse
queries)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sorting in collapse on incompatible field types across shards returns 500
5 participants