Word boundary detection broken for Unicode characters (ä, ö, ü, ß) on Windows without system-level UTF-8

**Repository:** ltex-ls-plus  
**Version:** 18.6.1  
**OS:** Windows 10/11 (without "Beta: Use Unicode UTF-8 for worldwide language support" system setting — requires admin rights to enable)  
**Java:** bundled JDK 21.0.8+9

---

### Description

On Windows systems where the system codepage is not UTF-8 (Cp1252 / Windows-1252), ltex-ls-plus incorrectly splits words at Unicode character boundaries. Multi-byte UTF-8 characters such as `ä`, `ö`, `ü`, `Ö`, `ß` are not recognized as part of a word.

**Example:**
- `Österreich` is treated as two tokens: `Ö` and `sterreich`
- `größte` is treated as two tokens: `gr` and `ßte`

This means spell checking and grammar checking effectively does not work for any language using these characters (e.g. German, Austrian German).

---

### Root Cause

The bundled JDK reports the following encoding settings on affected systems:
```
file.encoding   = UTF-8       ✅ (correctly set)
native.encoding = Cp1252      ❌
stdout.encoding = Cp1252      ❌
stderr.encoding = Cp1252      ❌
sun.jnu.encoding = Cp1252     ❌
```

Setting `JAVA_TOOL_OPTIONS` to override encodings partially helps:
```
-Dfile.encoding=UTF-8 -Dstdout.encoding=UTF-8 -Dstderr.encoding=UTF-8 -Dnative.encoding=UTF-8
```

After applying these flags:
```
file.encoding   = UTF-8  ✅
stdout.encoding = UTF-8  ✅
stderr.encoding = UTF-8  ✅
native.encoding = Cp1252 ❌  (cannot be overridden via JAVA_TOOL_OPTIONS)
sun.jnu.encoding = Cp1252 ❌ (cannot be overridden via JAVA_TOOL_OPTIONS)
```

`native.encoding` and `sun.jnu.encoding` remain `Cp1252` and cannot be overridden without system-level changes that require administrator rights.

---

### Steps to Reproduce

1. Use Windows without the "Beta: Use Unicode UTF-8" system codepage setting (requires admin)
2. Install ltex-ls-plus via Mason (Neovim)
3. Open a `.tex` or `.md` file containing German umlauts
4. Observe that words containing `ä`, `ö`, `ü`, `ß` are split at the umlaut

---

### Expected Behavior

Words like `Österreich`, `größte`, `Übung` should be recognized as single tokens and spell/grammar checked correctly.

---

### Workaround

None available without administrator rights. Users with admin rights can enable:  
`Control Panel → Region → Administrative → Change system locale → Beta: Use Unicode UTF-8`  
and restart — this sets the system codepage to UTF-8 and resolves the issue. (could  not be tested, as I am working on a computer without admin rights.)

---

### Suggested Fix

The ltex-ls-plus launcher script (`.bat` on Windows) could explicitly pass `-Dsun.jnu.encoding=UTF-8` as a JVM argument directly in the launch command rather than relying on `JAVA_TOOL_OPTIONS`. This would bypass the system codepage restriction:
```bat
"%JAVA_EXEC%" -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 ... -jar ltex-ls.jar
```

This would fix the issue for all Windows users regardless of their system locale settings or admin rights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Word boundary detection broken for Unicode characters (ä, ö, ü, ß) on Windows without system-level UTF-8 #138

Description

Root Cause

Steps to Reproduce

Expected Behavior

Workaround

Suggested Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Word boundary detection broken for Unicode characters (ä, ö, ü, ß) on Windows without system-level UTF-8 #138

Description

Description

Root Cause

Steps to Reproduce

Expected Behavior

Workaround

Suggested Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions