Skip to content

Word splitting is not taking CJK into account #1506

@ncaq

Description

@ncaq

Please complete the following tasks

Valid word

HashiCorp

Incorrect correction

HashCorp

Justification

https://www.hashicorp.com

Notes

When Japanese (CJK) characters are adjacent to ASCII characters without whitespace (e.g., 含むHashiCorp製品),
typos treats the entire mixed-script string as a single identifier rather than splitting at the script boundary.
As a result, extend-identifiers entries like HashiCorp = "HashiCorp" do not match because the actual identifier is the longer mixed-script string.
The identifier is then split into subwords,
and Hashi is flagged as a typo for Hash,
which extend-words for HashiCorp also cannot suppress since it only matches at the subword level against the full string HashiCorp.

The expected behavior would be for typos to treat script boundaries (e.g., CJK to Latin) as identifier boundaries,
so that HashiCorp in 含むHashiCorp製品 is recognized as a standalone identifier and correctly matched against extend-identifiers entries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-dictA-tokenArea: tokenization, including definition of identifiers and wordsS-triageStatus: New; needs maintainer attention.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions