Skip to content

Unexpected spaces in snippet around every character #35

Open
@Krinkle

Description

Description

A web page containing QUnit.test('add', shows up in search result snippets as QUnit . test ( ' add ' , assert. Take note of the unexpected spaces around virtually every symbol. I believe this is most likely a side-effect of the characters in question having <span> in the source code. However, there are no spaces in the source code around (most) of these characters.

Steps to reproduce

<code><span class="nx">QUnit</span><span class="p">.</span><span class="nx">test</span><span class="p">(</span><span class="dl">'</span><span class="s1">add</span><span class="dl">'</span><span class="p">,</span> <span class="nx">assert</span> <span class="o">=&gt;</span> <span class="p">{</span></code>

I'm evaluating Typesense for use on https://api.jquery.com, https://qunitjs.com and other OpenJS sites. I've used typesense/docsearch-scraper via GitHub Actions, and docsearch is configured with "text": "p,li,tr,pre" among the selectors. The above code is part of a regular paragraph of PRE tag.

source: typense.yaml
source: /docsearch.config.json)

Expected Behavior

For inline elements like <span>, <em>, <code>, <strong> to not result in additional spaces to be injected into the indexed text. It is not uncommon for prose to sometimes emphasize, underline, strike, superscript, or otherwise wrap only part of a word in markup for any reason. It is probably most common in content with syntax-highlighted source code.

Metadata

Typesense Version: 0.24.1

OS: Debian 11 Bullseye

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions