Skip to content

Conversation

@BartChris
Copy link
Collaborator

@BartChris BartChris commented Oct 13, 2025

One of the problems with the current Hibernate Search based setup is that it relies on the following flow:

  • First use the provided filter fields to search the Search backend (e.g. Elasticsearch) for a specific term (TitleDocMain:Rhein).
  • Use the returned IDs to restrict the database query, which contains the actual results, to the ones coming from the search query

Because of the hard limit of 10.000 results (from Elasticsearch/Opensearch) in a Search response the ID list might not contain all relevant (or IDs of non active processes) so the returned result from the DB is wrong. This experimental PR uses for the Excel export result set scrolling to retrieve all IDs.

@BartChris
Copy link
Collaborator Author

BartChris commented Oct 13, 2025

Additional thoughts here:

  • Why are we injecting mutiple potential large lists of IDs into the SQL query instead of building an intersection list from the multiple index queries.
  • The generated SQL already treats these lists effectively as an intersection of all passed-in IDs, but by applying large AND joined filters. Maybe we can built the intersection before in memory.
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
AND process.id IN (:userFilter1query1) AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState) AND process.project.id IN (:projectIDs) ORDER BY process.id DESC

@matthias-ronge
Copy link
Collaborator

Why are we injecting mutiple potential large lists of IDs into the SQL query instead of building an intersection list from the multiple index queries

I implemented something like this years ago in a different context, but then it was said that it unnecessarily complicated the code, and it was removed during the review. That's why I've omitted it this time. The code is naturally easier to read this way. If we really need it, we can of course reimplement it.

@BartChris
Copy link
Collaborator Author

BartChris commented Nov 25, 2025

I implemented something like this years ago in a different context, but then it was said that it unnecessarily complicated the code, and it was removed during the review. That's why I've omitted it this time. The code is naturally easier to read this way. If we really need it, we can of course reimplement it.

What i tried locally is sending multiple terms to Elasticsearch. this works as well:

public void performIndexSearches() {
        List<Pair<String, String>> terms = new ArrayList<>();
        for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
            Entry<String, Pair<FilterField, String>> entry = iterator.next();
            String field = entry.getValue().getLeft().getSearchField();
            String token = entry.getValue().getRight();
            terms.add(Pair.of(field, token));
        }
        Collection<Integer> ids = indexingService.searchIds(Process.class, terms);
        Collection<Integer> finalIds = ids.isEmpty() ? NO_HIT : ids;

        for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
            Entry<String, Pair<FilterField, String>> entry = iterator.next();
            parameters.put(entry.getKey(), finalIds);
            iterator.remove();
        }
    }

public Collection<Integer> searchIds(Class<? extends BaseBean> beanClass, List<Pair<String, String>> terms) {
      SearchSession searchSession = Search.session(HibernateUtil.getSession());
      SearchProjection<Integer> idField = searchSession.scope(beanClass).projection().field("id", Integer.class)
              .toProjection();
      var query = searchSession.search(beanClass)
              .select(idField)
              .where(f -> {
                  var bool = f.bool();
                  for (var term : terms) {
                      bool.must(
                              f.match()
                                      .field(term.getLeft())
                                      .matching(term.getRight())
                      );
                  }
                  return bool;
              });
      List<Integer> ids = query.fetchAll().hits();

      logger.debug(
              "Searching {} IDs with terms {}: {} hits",
              beanClass.getSimpleName(),
              terms.stream()
                      .map(t -> t.getLeft() + "=\"" + t.getRight() + "\"")
                      .collect(Collectors.joining(", ")),
              ids.size()
      );
      return ids;
  }

@matthias-ronge
Copy link
Collaborator

Yes, this seems like the most sensible solution. Do you want to set the pull request to ready for review?

@BartChris
Copy link
Collaborator Author

BartChris commented Nov 25, 2025

I can make a seperate Pull request proposing the multi term query in Elasticsearch, which is independent of using the scroll API. Then we would have at least a more efficient usage of Elasticsearch. This PR here is still in draft state because we probably cannot solve key architectural constraints.
@henning-gerhardt has scenarios where 300.000 entries are coming back from Elasticsearch. I suppose the databases do not even support injecting more than 100.000 IDs in a where clause, but we can of course try.

String value,
boolean useScroll) {

SearchSession searchSession = Search.session(HibernateUtil.getSession());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading the code changes here: HibernateUtil.getSession() is returing an Hibernate-Session which implements AutoCloseable interface. Should here and the other places in this class this usage replaced by a try-with-ressources statement? I noticed that using the search beginning with 3.9.x consumes more resources than before and this could be a reason. Maybe this should be fixed in a separate pull request. I did not start a discussion why in a IndexingService labeled class search methods are defined and used instead doing this in a SearchService class like in 3.8.x.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a pull request to change it to use try-with-resource statement for the used Hibernate Session.

@BartChris
Copy link
Collaborator Author

BartChris commented Nov 25, 2025

My incremental approach would be to first improve the results from Elasticsearch, by using more efficient query patterns (see above) and return relevant data only (and not IDs of deactivated processes which are not needed in many cases).
And once we have achieved the result that Elasticsearch returns ALL RELEVANT IDs in one go, we might start to think about how we can make cases work where Elasticsearch returns more than 10.000 RELEVANT IDs.
One option would be to use batching also at the database side when creating the Excel export to at least provide all relevant data there.
But the more i think about it the more i doubt, whether we can achieve a good solution if Elasticsearch is only an ID feeder.

We only want to know one thing from Elasticsearch: Which IDs should be included in the result? We want no aggregations, no tokenizations, no relevance ranking. So we use nothing of what Elasticsearch is best at. But we still have to fight it.

@henning-gerhardt
Copy link
Collaborator

henning-gerhardt commented Nov 25, 2025

I must correct @BartChris in one point: we did not get only 300.000 hits back from ElasticSearch, we get over 500.000 or all data (over 800.000 hits back and this number is increasing day by day). The application must handle this in some way that the user is understanding.

@henning-gerhardt
Copy link
Collaborator

We want no aggregations, no tokenizations, no relevance ranking. So we use nothing Elasticsearch is best at. But we still have to fight it.

Or in short: we mis-use the search service (ElasticSearch or OpenSearch) in a way which the search service is never been developed for. We are not using the relevant search at all and maybe never. We need this search service only for searching inside our meta data which may can be solved with other solutions. But this will again a long discussion and an even long development change.

@BartChris
Copy link
Collaborator Author

BartChris commented Nov 25, 2025

We need this search service only for searching inside our meta data which may can be solved with other solutions. But this will again a long discussion and an even long development change.

You are of course right. Maybe we need a longer discussion on that. My assertion would be that what we do right now in Elasticsearch can be done in MySQL and MariaDB, which have sophisticated fulltext search since many years. So we could store all the tokens we generate by hand now for Elasticsearch directly in the database and use the quite sophisticated MATCH queries which are provided by the databases.

@matthias-ronge
Copy link
Collaborator

Let's discuss only this code review in this pull request. For the underlying problem, I've opened a new issue #6772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants