Skip to content

Use byteidx()/utf16idx() for correct UTF-16 position conversion#1649

Open
mattn wants to merge 5 commits into
prabirshrestha:masterfrom
mattn:use-byteidx-utf16
Open

Use byteidx()/utf16idx() for correct UTF-16 position conversion#1649
mattn wants to merge 5 commits into
prabirshrestha:masterfrom
mattn:use-byteidx-utf16

Conversation

@mattn

@mattn mattn commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Use byteidx(str, idx, v:true) and utf16idx(str, byteidx) for LSP position conversion when available (Vim 9.0.1485+)
  • LSP uses UTF-16 code unit offsets, but the current strcharpart()/strchars() counts Unicode codepoints, which is incorrect for characters outside the BMP (e.g. emoji with surrogate pairs)
  • Falls back to existing codepoint-based conversion on older Vim and Neovim

Changed files

  • autoload/lsp/utils/position.vim: s:to_col() and s:to_char() now use UTF-16 aware builtins
  • autoload/lsp/utils.vim: lsp#utils#to_char() likewise

Test plan

  • Verify LSP go-to-definition, hover, completion work correctly with ASCII, CJK, and emoji (BMP-external) characters
  • Verify no regression on Neovim (falls back to existing behavior)

mattn added 4 commits March 5, 2026 00:57
LSP uses UTF-16 code unit offsets for character positions, but the
current implementation uses strcharpart()/strchars() which count
Unicode codepoints. This is incorrect for characters outside the BMP
(e.g. emoji) that require surrogate pairs in UTF-16.

When byteidx() with utf16 flag and utf16idx() are available
(Vim 9.0.1485+), use them for correct UTF-16 offset handling.
Falls back to the existing codepoint-based conversion on older
Vim and Neovim.
Avoid calling exists() on every lsp#utils#to_char() invocation.
Define separate s:to_col/s:to_char/lsp#utils#to_char functions at
script load time based on exists('*utf16idx'), eliminating per-call
branching overhead. Also extract common line-fetching logic into
s:_get_line() helper.
@mattn

mattn commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator Author

Benchmark results

iterations: 100000
short : 48 bytes
long  : 303 bytes
vlong : 2903 bytes

to_col  short        old: 0.3346  new: 0.2818  (1.2x)
to_col  long         old: 0.3139  new: 0.3479  (0.9x)
to_col  vlong        old: 0.4681  new: 1.0679  (0.4x)

to_char short        old: 0.2833  new: 0.2712  (1.0x)
to_char long         old: 0.3356  new: 0.3827  (0.9x)
to_char vlong        old: 0.5920  new: 1.9010  (0.3x)

Short strings are slightly faster, but long strings are slower due to byteidx() calling ptr2len() twice per iteration in UTF-16 mode (once for surrogate pair check, once for advancing). This is a Vim core inefficiency that should be fixed separately.

The primary benefit of this PR is correctness: proper UTF-16 code unit handling for characters outside the BMP (emoji with surrogate pairs), which strcharpart()/strchars() (codepoint-based) cannot handle correctly.

- Replace v:none (Vim-only) with empty list [] in s:_get_line() to fix
  E121 on Neovim v0.4/v0.5
- Handle utf16idx() when byte index falls in the middle of a multi-byte
  character by rounding up to the next character index
- Handle byte index past end of string to avoid utf16idx() returning -1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant