Tree-sitter's C API supports ts_parser_set_encoding, which makes node offsets line up with utf16 code units instead of utf8 bytes. This is a natural fit for JVM languages, where String is already utf16 internally.
ktreesitter v0.24.1 hardcodes utf8 in both parse paths:
parse(source: String)
parse(oldTree, callback)
This forces Kotlin callers have to maintain a byte to char offset table on every parse to bridge tree-sitter's utf8 offsets to Kotlin's utf16 string indices. Exposing a knob would let us skip that entirely.
Is there interest in this? Happy to contribute a PR if the direction sounds reasonable.
Tree-sitter's C API supports
ts_parser_set_encoding, which makes node offsets line up with utf16 code units instead of utf8 bytes. This is a natural fit for JVM languages, whereStringis already utf16 internally.ktreesitter
v0.24.1hardcodes utf8 in both parse paths:This forces Kotlin callers have to maintain a byte to char offset table on every parse to bridge tree-sitter's utf8 offsets to Kotlin's utf16 string indices. Exposing a knob would let us skip that entirely.
Is there interest in this? Happy to contribute a PR if the direction sounds reasonable.