-
Notifications
You must be signed in to change notification settings - Fork 68
Scroll through list of IDs from Search index #6719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Additional thoughts here:
FROM Process AS process
WHERE process.project.client.id = :sessionClientId
AND process.id IN (:userFilter1query1) AND process.id IN (:userFilter1query2)
AND (process.sortHelperStatus IS NULL OR process.sortHelperStatus != :completedState) AND process.project.id IN (:projectIDs) ORDER BY process.id DESC |
I implemented something like this years ago in a different context, but then it was said that it unnecessarily complicated the code, and it was removed during the review. That's why I've omitted it this time. The code is naturally easier to read this way. If we really need it, we can of course reimplement it. |
What i tried locally is sending multiple terms to Elasticsearch. this works as well: public void performIndexSearches() {
List<Pair<String, String>> terms = new ArrayList<>();
for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
Entry<String, Pair<FilterField, String>> entry = iterator.next();
String field = entry.getValue().getLeft().getSearchField();
String token = entry.getValue().getRight();
terms.add(Pair.of(field, token));
}
Collection<Integer> ids = indexingService.searchIds(Process.class, terms);
Collection<Integer> finalIds = ids.isEmpty() ? NO_HIT : ids;
for (var iterator = indexQueries.entrySet().iterator(); iterator.hasNext();) {
Entry<String, Pair<FilterField, String>> entry = iterator.next();
parameters.put(entry.getKey(), finalIds);
iterator.remove();
}
}
public Collection<Integer> searchIds(Class<? extends BaseBean> beanClass, List<Pair<String, String>> terms) {
SearchSession searchSession = Search.session(HibernateUtil.getSession());
SearchProjection<Integer> idField = searchSession.scope(beanClass).projection().field("id", Integer.class)
.toProjection();
var query = searchSession.search(beanClass)
.select(idField)
.where(f -> {
var bool = f.bool();
for (var term : terms) {
bool.must(
f.match()
.field(term.getLeft())
.matching(term.getRight())
);
}
return bool;
});
List<Integer> ids = query.fetchAll().hits();
logger.debug(
"Searching {} IDs with terms {}: {} hits",
beanClass.getSimpleName(),
terms.stream()
.map(t -> t.getLeft() + "=\"" + t.getRight() + "\"")
.collect(Collectors.joining(", ")),
ids.size()
);
return ids;
} |
|
Yes, this seems like the most sensible solution. Do you want to set the pull request to ready for review? |
|
I can make a seperate Pull request proposing the multi term query in Elasticsearch, which is independent of using the scroll API. Then we would have at least a more efficient usage of Elasticsearch. This PR here is still in draft state because we probably cannot solve key architectural constraints. |
| String value, | ||
| boolean useScroll) { | ||
|
|
||
| SearchSession searchSession = Search.session(HibernateUtil.getSession()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading the code changes here: HibernateUtil.getSession() is returing an Hibernate-Session which implements AutoCloseable interface. Should here and the other places in this class this usage replaced by a try-with-ressources statement? I noticed that using the search beginning with 3.9.x consumes more resources than before and this could be a reason. Maybe this should be fixed in a separate pull request. I did not start a discussion why in a IndexingService labeled class search methods are defined and used instead doing this in a SearchService class like in 3.8.x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a pull request to change it to use try-with-resource statement for the used Hibernate Session.
|
My incremental approach would be to first improve the results from Elasticsearch, by using more efficient query patterns (see above) and return relevant data only (and not IDs of deactivated processes which are not needed in many cases). We only want to know one thing from Elasticsearch: Which IDs should be included in the result? We want no aggregations, no tokenizations, no relevance ranking. So we use nothing of what Elasticsearch is best at. But we still have to fight it. |
|
I must correct @BartChris in one point: we did not get only |
Or in short: we mis-use the search service (ElasticSearch or OpenSearch) in a way which the search service is never been developed for. We are not using the relevant search at all and maybe never. We need this search service only for searching inside our meta data which may can be solved with other solutions. But this will again a long discussion and an even long development change. |
You are of course right. Maybe we need a longer discussion on that. My assertion would be that what we do right now in Elasticsearch can be done in MySQL and MariaDB, which have sophisticated fulltext search since many years. So we could store all the tokens we generate by hand now for Elasticsearch directly in the database and use the quite sophisticated |
|
Let's discuss only this code review in this pull request. For the underlying problem, I've opened a new issue #6772 |
One of the problems with the current Hibernate Search based setup is that it relies on the following flow:
TitleDocMain:Rhein).Because of the hard limit of 10.000 results (from Elasticsearch/Opensearch) in a Search response the ID list might not contain all relevant (or IDs of non active processes) so the returned result from the DB is wrong. This experimental PR uses for the Excel export result set scrolling to retrieve all IDs.