Skip to content

Search can't find text having soft hyphens and/or ligature control characters #133

@Moonbase59

Description

@Moonbase59

Today I was scratching my head, because TNT Search didn’t find pages that definitely contain my search words. Until I realized that most of my text contains soft hyphens, to give the renderer some hyphenation hints for our looong German words:

Screenshot 2023-11-21 at 14-43-22 Nite Radio  Läuft  ( _blog_nite-radio-laeuft ) Nite Radio

Now of course the search won’t find something like Text­datei or Text­­datei (invisible U+00AD inside), and a user cannot know how I hyphenated my text.

It’s even worse with ligatures, which are heavily used in German Fraktursatz as well as in Arabic/Persian/Indian languages, to control how a word actually looks like. This is mostly done using Unicode U+200C zero-width non-joiner and U+200D zero-width joiner.

Here’s my proposal for better search:

Since we’re already "cleaning" the searched pages in getCleanContent() (in file user/plugins/tntsearch/classes/GravTNTSearch.php), we might as well remove these in-word Unicode control characters before looking for a match.

I have tried this here, using Grav v1.7.43, Admin v1.10.43, TNT Search v3.4.0, and it works well, just by adding:

// 2023-11-21 MCH - Remove some in-word Unicode that regularly breaks searches
$problematic = [
    '/­/i', '/­/', '/­/i', '/\x{00AD}/u', // soft hyphen
    '/‍/i', '/‍/', '/‍/i', '/\x{200D}/u', // zero-width joiner
    '/‌/i', '/‌/', '/‌/i', '/\x{200C}/u', // zero-width non-joiner
];
$content = preg_replace($problematic, '', $content) ?? $content;

in getCleanContent(). As you see, we have to check the four most common use cases for each character, since article editors could use any variant in their Markdown text. Some lucky ones even have keyboards with these characters on them.

I guess this change will improve the TNT Search Plugin a lot, being able to find text even if it has been typographically enhanced on the web site. Of course one couldn’t search for the replaced entities anymore (like ­) but that shouldn’t be a problem, I think.

Strictly spoken, user input from the search box should also have these removed, but a website user would probably never enter soft hyphen or ligature control in the search box, I assume. At least I wouldn’t enter Text­datei, Brot‌zeit or Auf‌lage (or use the invisible keys) but instead use a simple textdatei, brotzeit or auflage for searching:

Screenshot 2023-11-21 at 15-11-43 Suche Nite Radio

If there are no objections, I could prepare a pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions