Skip to content

New lexicographical sort mode for multi-valued keyword fields #127948

Closed
@igordemiranda

Description

@igordemiranda

Description

Feature description

Consider the following 4 documents with a keyword field "names":

  • _doc/0: { names: [A, B] }
  • _doc/1: { names: [A, E] }
  • _doc/2: { names: [A, D] }
  • _doc/4: { names: [A, C] }

The current supported behavior would have these 4 docs be considered as ties when sorted ascendingly by names.

I would like to be able to sort these so that the lists are sorted lexicographically as a whole; i.e. when the first element is a tie, it compares the second element, and so on, like so:

  • _doc/0: { names: [A, B] }
  • _doc/4: { names: [A, C] }
  • _doc/2: { names: [A, D] }
  • _doc/1: { names: [A, E] }

More examples:

  • [ [A], [B], [A, B] ] would sort as [ [A], [A, B], [B] ]
  • [ [A], [A, B, C], [A, B, B] ] would sort as [ [A], [A, B, B], [A, B, C] ]

This could be solved by Elasticsearch by introducing a new sort mode (e.g lex, as a placeholder name for now) for multi-value fields.

POST /_search
{
   "query" : {
      "match_all" : {}
   },
   "sort" : [
      {"names" : {"order" : "asc", "mode" : "lex"}}
   ]
}

How I had to solve this instead

I created a new field names_sortKey of type keyword, and in my application I joined the list elements with a delimiter character that sorts before all printable characters (e.g. \u001F) and then I perform the sort on this field.

Example:

{
  "names": ["A", "B", "C"],
  "names_sortKey": "A\u001fB\u001fC"
}
POST /_search
{
   "query" : {
      "match_all" : {}
   },
   "sort" : [
      {"names_sortKey" : {"order" : "asc"}}
   ]
}

Activity

changed the title [-]New sort mode for multi-valued keyword fields[/-] [+]New lexicographical sort mode for multi-valued keyword fields[/+] on May 9, 2025
elasticsearchmachine

elasticsearchmachine commented on May 9, 2025

@elasticsearchmachine
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

mayya-sharipova

mayya-sharipova commented on May 9, 2025

@mayya-sharipova
Contributor

You can also do that through a script for sort, like this (there is no need to index an extra field):

POST /_search
{
   "query" : {
      "match_all" : {}
   },
   "sort" : [
      {
         "_script" : {
            "type" : "string", 
            "script" : {
               "lang" : "painless",
               "source" : """
                  if (doc['names'].size() == 0) {
                    return ""; 
                  }
                  return doc['names'].join(' '); 
               """
            },
            "order" : "asc"
         }
      }
   ]
}

Is this a good option for you and can we close this issue?

igordemiranda

igordemiranda commented on May 9, 2025

@igordemiranda
Author

Hi Mayya, thanks for the quick response.

That does sounds like a good general workaround. In my case I'm using the icu_collation_keyword for text sorting. For example:

PUT array_sorting_test/_mapping
{
  "properties": {
    "names": {
      "type": "text",
      "fields": {
        "sort": {
          "type": "icu_collation_keyword",
          "strength": "secondary",
          "alternate": "shifted",
          "variable_top": " ",
          "rules": "&[before 1]\u203E < \u001F"
        }
      }
    }
  }
}

I wonder:

  1. How to use this analyzer in a scripted sort.
  2. Whether that would be a heavy thing to do at sort time (especially given how ICU's sort key generation is considered "many times more expensive than doing a compare" - source) and, if so, having that join done at index time would be better for performance.

An alternative would be if I'm able to create a custom analyzer that does the join at index time before going through the ICU analyzer. I couldn't find a way to do that though. If I understood correctly an analyzer cannot transform multi-valued fields into a single value? If you have any insights into that direction that would be great too.

mayya-sharipova

mayya-sharipova commented on May 21, 2025

@mayya-sharipova
Contributor

@igordemiranda When you index a document with an icu_collation_keyword field, Elasticsearch uses the ICU library to generate a binary collation key for the field's string value. This precomputed binary values is stored is stored in doc_values, and is used later for sorting. Since it is already pre-computed, the sorting during search is very efficient, as it involves just binary comparison.

Answering your specific questions:

  1. How to use this analyzer in a scripted sort? Analyzed value is pre-computed and stored in doc values, that are used for sorting.
  2. Whether that would be a heavy thing to do at sort time? No, it is not, as sorting involves binary comparison of already pre-computed values.

I think we can consider this issue closed.
Please feel free to reopen if you think it is not addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mayya-sharipova@elasticsearchmachine@igordemiranda

        Issue actions

          New lexicographical sort mode for multi-valued keyword fields · Issue #127948 · elastic/elasticsearch