Scroll through list of IDs from Search index #6719

BartChris · 2025-10-13T15:54:56Z

One of the problems with the current Hibernate Search based setup is that it relies on the following flow:

First use the provided filter fields to search the Search backend (e.g. Elasticsearch) for a specific term (TitleDocMain:Rhein).
Use the returned IDs to restrict the database query, which contains the actual results, to the ones coming from the search query

Because of the hard limit of 10.000 results (from Elasticsearch/Opensearch) in a Search response the ID list might not contain all relevant (or IDs of non active processes) so the returned result from the DB is wrong. This experimental PR uses for the Excel export result set scrolling to retrieve all IDs.

BartChris · 2025-10-13T16:37:57Z

Additional thoughts here:

Why are we injecting mutiple potential large lists of IDs into the SQL query instead of building an intersection list from the multiple index queries.
The generated SQL already treats these lists effectively as an intersection of all passed-in IDs, but by applying large AND joined filters. Maybe we can built the intersection before in memory.

FROM Process AS process
WHERE process.project.client.id = :sessionClientId
AND process.id IN (:userFilter1query1) AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState) AND process.project.id IN (:projectIDs) ORDER BY process.id DESC

matthias-ronge · 2025-11-25T10:51:14Z

Why are we injecting mutiple potential large lists of IDs into the SQL query instead of building an intersection list from the multiple index queries

I implemented something like this years ago in a different context, but then it was said that it unnecessarily complicated the code, and it was removed during the review. That's why I've omitted it this time. The code is naturally easier to read this way. If we really need it, we can of course reimplement it.

BartChris · 2025-11-25T10:53:27Z

I implemented something like this years ago in a different context, but then it was said that it unnecessarily complicated the code, and it was removed during the review. That's why I've omitted it this time. The code is naturally easier to read this way. If we really need it, we can of course reimplement it.

What i tried locally is sending multiple terms to Elasticsearch. this works as well:

public void performIndexSearches() {
        List<Pair<String, String>> terms = new ArrayList<>();
        for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
            Entry<String, Pair<FilterField, String>> entry = iterator.next();
            String field = entry.getValue().getLeft().getSearchField();
            String token = entry.getValue().getRight();
            terms.add(Pair.of(field, token));
        }
        Collection<Integer> ids = indexingService.searchIds(Process.class, terms);
        Collection<Integer> finalIds = ids.isEmpty() ? NO_HIT : ids;

        for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
            Entry<String, Pair<FilterField, String>> entry = iterator.next();
            parameters.put(entry.getKey(), finalIds);
            iterator.remove();
        }
    }

public Collection<Integer> searchIds(Class<? extends BaseBean> beanClass, List<Pair<String, String>> terms) {
      SearchSession searchSession = Search.session(HibernateUtil.getSession());
      SearchProjection<Integer> idField = searchSession.scope(beanClass).projection().field("id", Integer.class)
              .toProjection();
      var query = searchSession.search(beanClass)
              .select(idField)
              .where(f -> {
                  var bool = f.bool();
                  for (var term : terms) {
                      bool.must(
                              f.match()
                                      .field(term.getLeft())
                                      .matching(term.getRight())
                      );
                  }
                  return bool;
              });
      List<Integer> ids = query.fetchAll().hits();

      logger.debug(
              "Searching {} IDs with terms {}: {} hits",
              beanClass.getSimpleName(),
              terms.stream()
                      .map(t -> t.getLeft() + "=\"" + t.getRight() + "\"")
                      .collect(Collectors.joining(", ")),
              ids.size()
      );
      return ids;
  }

matthias-ronge · 2025-11-25T10:55:15Z

Yes, this seems like the most sensible solution. Do you want to set the pull request to ready for review?

BartChris · 2025-11-25T11:00:13Z

I can make a seperate Pull request proposing the multi term query in Elasticsearch, which is independent of using the scroll API. Then we would have at least a more efficient usage of Elasticsearch. This PR here is still in draft state because we probably cannot solve key architectural constraints.
@henning-gerhardt has scenarios where 300.000 entries are coming back from Elasticsearch. I suppose the databases do not even support injecting more than 100.000 IDs in a where clause, but we can of course try.

henning-gerhardt · 2025-11-25T11:11:23Z

Kitodo/src/main/java/org/kitodo/production/services/index/IndexingService.java

+                                         String value,
+                                         boolean useScroll) {
+
        SearchSession searchSession = Search.session(HibernateUtil.getSession());


While reading the code changes here: HibernateUtil.getSession() is returing an Hibernate-Session which implements AutoCloseable interface. Should here and the other places in this class this usage replaced by a try-with-ressources statement? I noticed that using the search beginning with 3.9.x consumes more resources than before and this could be a reason. Maybe this should be fixed in a separate pull request. I did not start a discussion why in a IndexingService labeled class search methods are defined and used instead doing this in a SearchService class like in 3.8.x.

I will open a pull request to change it to use try-with-resource statement for the used Hibernate Session.

BartChris · 2025-11-25T11:12:59Z

My incremental approach would be to first improve the results from Elasticsearch, by using more efficient query patterns (see above) and return relevant data only (and not IDs of deactivated processes which are not needed in many cases).
And once we have achieved the result that Elasticsearch returns ALL RELEVANT IDs in one go, we might start to think about how we can make cases work where Elasticsearch returns more than 10.000 RELEVANT IDs.
One option would be to use batching also at the database side when creating the Excel export to at least provide all relevant data there.
But the more i think about it the more i doubt, whether we can achieve a good solution if Elasticsearch is only an ID feeder.

We only want to know one thing from Elasticsearch: Which IDs should be included in the result? We want no aggregations, no tokenizations, no relevance ranking. So we use nothing of what Elasticsearch is best at. But we still have to fight it.

henning-gerhardt · 2025-11-25T11:13:15Z

I must correct @BartChris in one point: we did not get only 300.000 hits back from ElasticSearch, we get over 500.000 or all data (over 800.000 hits back and this number is increasing day by day). The application must handle this in some way that the user is understanding.

henning-gerhardt · 2025-11-25T11:17:11Z

We want no aggregations, no tokenizations, no relevance ranking. So we use nothing Elasticsearch is best at. But we still have to fight it.

Or in short: we mis-use the search service (ElasticSearch or OpenSearch) in a way which the search service is never been developed for. We are not using the relevant search at all and maybe never. We need this search service only for searching inside our meta data which may can be solved with other solutions. But this will again a long discussion and an even long development change.

BartChris · 2025-11-25T11:20:49Z

We need this search service only for searching inside our meta data which may can be solved with other solutions. But this will again a long discussion and an even long development change.

You are of course right. Maybe we need a longer discussion on that. My assertion would be that what we do right now in Elasticsearch can be done in MySQL and MariaDB, which have sophisticated fulltext search since many years. So we could store all the tokens we generate by hand now for Elasticsearch directly in the database and use the quite sophisticated MATCH queries which are provided by the databases.

matthias-ronge · 2025-11-26T08:39:44Z

Let's discuss only this code review in this pull request. For the underlying problem, I've opened a new issue #6772

Scroll through list of IDs from Search index

794f492

BartChris mentioned this pull request Oct 13, 2025

[3.9] Scroll through list of IDs from Search index #6720

Closed

BartChris mentioned this pull request Oct 14, 2025

[3.9] Scroll through list of IDs from Search index #6723

Draft

henning-gerhardt reviewed Nov 25, 2025

View reviewed changes

This was referenced Nov 25, 2025

IndexingService: Use try-with-resource statement #6768

Merged

[3.9.x] IndexingService: Use try-with-resource statement #6769

Merged

matthias-ronge mentioned this pull request Nov 26, 2025

Searching for tens of thousands of hits yields incomplete results #6772

Open

Scroll through list of IDs from Search index #6719

Are you sure you want to change the base?

Scroll through list of IDs from Search index #6719

Uh oh!

Conversation

BartChris commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BartChris commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias-ronge commented Nov 25, 2025

Uh oh!

BartChris commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias-ronge commented Nov 25, 2025

Uh oh!

BartChris commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henning-gerhardt Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

henning-gerhardt Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

BartChris commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henning-gerhardt commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henning-gerhardt commented Nov 25, 2025

Uh oh!

BartChris commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias-ronge commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BartChris commented Oct 13, 2025 •

edited

Loading

BartChris commented Oct 13, 2025 •

edited

Loading

BartChris commented Nov 25, 2025 •

edited

Loading

BartChris commented Nov 25, 2025 •

edited

Loading

BartChris commented Nov 25, 2025 •

edited

Loading

henning-gerhardt commented Nov 25, 2025 •

edited

Loading

BartChris commented Nov 25, 2025 •

edited

Loading