Use byteidx()/utf16idx() for correct UTF-16 position conversion by mattn · Pull Request #1649 · prabirshrestha/vim-lsp

mattn · 2026-03-04T15:58:26Z

Summary

Use byteidx(str, idx, v:true) and utf16idx(str, byteidx) for LSP position conversion when available (Vim 9.0.1485+)
LSP uses UTF-16 code unit offsets, but the current strcharpart()/strchars() counts Unicode codepoints, which is incorrect for characters outside the BMP (e.g. emoji with surrogate pairs)
Falls back to existing codepoint-based conversion on older Vim and Neovim

Changed files

autoload/lsp/utils/position.vim: s:to_col() and s:to_char() now use UTF-16 aware builtins
autoload/lsp/utils.vim: lsp#utils#to_char() likewise

Test plan

Verify LSP go-to-definition, hover, completion work correctly with ASCII, CJK, and emoji (BMP-external) characters
Verify no regression on Neovim (falls back to existing behavior)

LSP uses UTF-16 code unit offsets for character positions, but the current implementation uses strcharpart()/strchars() which count Unicode codepoints. This is incorrect for characters outside the BMP (e.g. emoji) that require surrogate pairs in UTF-16. When byteidx() with utf16 flag and utf16idx() are available (Vim 9.0.1485+), use them for correct UTF-16 offset handling. Falls back to the existing codepoint-based conversion on older Vim and Neovim.

Avoid calling exists() on every lsp#utils#to_char() invocation.

Define separate s:to_col/s:to_char/lsp#utils#to_char functions at script load time based on exists('*utf16idx'), eliminating per-call branching overhead. Also extract common line-fetching logic into s:_get_line() helper.

mattn · 2026-03-04T16:19:07Z

Benchmark results

iterations: 100000
short : 48 bytes
long  : 303 bytes
vlong : 2903 bytes

to_col  short        old: 0.3346  new: 0.2818  (1.2x)
to_col  long         old: 0.3139  new: 0.3479  (0.9x)
to_col  vlong        old: 0.4681  new: 1.0679  (0.4x)

to_char short        old: 0.2833  new: 0.2712  (1.0x)
to_char long         old: 0.3356  new: 0.3827  (0.9x)
to_char vlong        old: 0.5920  new: 1.9010  (0.3x)

Short strings are slightly faster, but long strings are slower due to byteidx() calling ptr2len() twice per iteration in UTF-16 mode (once for surrogate pair check, once for advancing). This is a Vim core inefficiency that should be fixed separately.

The primary benefit of this PR is correctness: proper UTF-16 code unit handling for characters outside the BMP (emoji with surrogate pairs), which strcharpart()/strchars() (codepoint-based) cannot handle correctly.

- Replace v:none (Vim-only) with empty list [] in s:_get_line() to fix E121 on Neovim v0.4/v0.5 - Handle utf16idx() when byte index falls in the middle of a multi-byte character by rounding up to the next character index - Handle byte index past end of string to avoid utf16idx() returning -1

mattn added 4 commits March 5, 2026 00:57

Cache exists('*utf16idx') result in script variable

86ecc5a

Avoid calling exists() on every lsp#utils#to_char() invocation.

Remove fnametouri/fnamefromuri references not yet in Vim core

fad99d8

mattn mentioned this pull request Mar 4, 2026

Fix CI: replace v:none and handle mid-character utf16idx #1650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use byteidx()/utf16idx() for correct UTF-16 position conversion#1649

Use byteidx()/utf16idx() for correct UTF-16 position conversion#1649
mattn wants to merge 5 commits into
prabirshrestha:masterfrom
mattn:use-byteidx-utf16

mattn commented Mar 4, 2026

Uh oh!

mattn commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattn commented Mar 4, 2026

Summary

Changed files

Test plan

Uh oh!

mattn commented Mar 4, 2026

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant