Skip to content

Conversation

@BartChris
Copy link
Collaborator

@BartChris BartChris commented Nov 24, 2025

Fixes #6709

This addresses point 3) of #6743

This is a very small change which requires a longer explanation. As discussed extensively in the linked issue, the Kitodo search for process titles behaves different when searching a title in a search over all fields vs. a search in the process title field in the index.

The reason for that is, that Kitodo tokenizes the input string different at search time depending on which field is searched. In the process title search the Kitodo search effectively drops all tokens with a length of less then three.

So Heutwia_898482011-1794081501_01-s is effectively searched as Heutwia_898482011-1794081501. This works, because at index time the same happens.

The behaviour when searching over all fields however is different. When searching over all fields Kitodo drops no token and searches for Heutwia_898482011-1794081501_01-s. This is however not what was indexed (only Heutwia_898482011-1794081501). As result the user gets no hits when searching this process title over all fields, which is super confusing.

The reason for this implementation was that when searching projects over all fields, it was considered necessary that "project A" can be differentiated from "project B". Therefor the tokenization for the project terms in the all field determines the search time tokenization of all other fields which leads to the confusing behavior. (See: #6618 (comment))

As we already have a quite sophisticated project field search the @kitodo/kitodo-community-board decided after consulatation with @matthias-ronge, that this special tokenization behaviour is not necessary. We opt for using the same default tokenization rules for all fields indexed in the global search field and to drop the special constant.

@solth
Copy link
Member

solth commented Dec 4, 2025

@BartChris please rebase against current main branch, that might resolve those failing Selenium tests!

@BartChris BartChris force-pushed the improve_search_coherence branch from f821434 to df50c0b Compare December 4, 2025 08:28
@solth
Copy link
Member

solth commented Jan 8, 2026

@matthias-ronge does this change pose any problems in your opinion?

@matthias-ronge
Copy link
Collaborator

This is a business decision, not a technical one. The only consideration would be to omit the parameter in FilterField entirely if it is set to LENGTH_MIN_DEFAULT for all search fields.

Copy link
Collaborator

@matthias-ronge matthias-ronge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is OK.

I would suggest to simplify the code:

  • Remove parameter int minLength from function ProcessKeywords.filterMinLength(), use LENGTH_MIN_DEFAULT right in the for loop
  • Remove private final int minTokenLength and its getter from FilterField as it is always 3 now

But I'm also fine if this one is merged first and we clean it up in a later pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Different filter behaviour with and without process:

3 participants