Skip to content

Word boundary detection broken for Unicode characters (ä, ö, ü, ß) on Windows without system-level UTF-8 #138

@zoig

Description

@zoig

Repository: ltex-ls-plus
Version: 18.6.1
OS: Windows 10/11 (without "Beta: Use Unicode UTF-8 for worldwide language support" system setting — requires admin rights to enable)
Java: bundled JDK 21.0.8+9


Description

On Windows systems where the system codepage is not UTF-8 (Cp1252 / Windows-1252), ltex-ls-plus incorrectly splits words at Unicode character boundaries. Multi-byte UTF-8 characters such as ä, ö, ü, Ö, ß are not recognized as part of a word.

Example:

  • Österreich is treated as two tokens: Ö and sterreich
  • größte is treated as two tokens: gr and ßte

This means spell checking and grammar checking effectively does not work for any language using these characters (e.g. German, Austrian German).


Root Cause

The bundled JDK reports the following encoding settings on affected systems:

file.encoding   = UTF-8       ✅ (correctly set)
native.encoding = Cp1252      ❌
stdout.encoding = Cp1252      ❌
stderr.encoding = Cp1252      ❌
sun.jnu.encoding = Cp1252     ❌

Setting JAVA_TOOL_OPTIONS to override encodings partially helps:

-Dfile.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -Dnative.encoding=UTF-8

After applying these flags:

file.encoding   = UTF-8  ✅
stdout.encoding = UTF-8  ✅
stderr.encoding = UTF-8  ✅
native.encoding = Cp1252 ❌  (cannot be overridden via JAVA_TOOL_OPTIONS)
sun.jnu.encoding = Cp1252 ❌ (cannot be overridden via JAVA_TOOL_OPTIONS)

native.encoding and sun.jnu.encoding remain Cp1252 and cannot be overridden without system-level changes that require administrator rights.


Steps to Reproduce

  1. Use Windows without the "Beta: Use Unicode UTF-8" system codepage setting (requires admin)
  2. Install ltex-ls-plus via Mason (Neovim)
  3. Open a .tex or .md file containing German umlauts
  4. Observe that words containing ä, ö, ü, ß are split at the umlaut

Expected Behavior

Words like Österreich, größte, Übung should be recognized as single tokens and spell/grammar checked correctly.


Workaround

None available without administrator rights. Users with admin rights can enable:
Control Panel → Region → Administrative → Change system locale → Beta: Use Unicode UTF-8
and restart — this sets the system codepage to UTF-8 and resolves the issue. (could not be tested, as I am working on a computer without admin rights.)


Suggested Fix

The ltex-ls-plus launcher script (.bat on Windows) could explicitly pass -Dsun.jnu.encoding=UTF-8 as a JVM argument directly in the launch command rather than relying on JAVA_TOOL_OPTIONS. This would bypass the system codepage restriction:

"%JAVA_EXEC%" -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 ... -jar ltex-ls.jar

This would fix the issue for all Windows users regardless of their system locale settings or admin rights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions