Skip to content

[BUG] caml_utf8_of_utf16 has a bug in high surrogate case #2006

Closed
@bzy-debug

Description

@bzy-debug

Describe the bug

caml_utf8_of_utf16(String.fromCharCode(0xdbff, 0xdfff)) generate wrong utf8 encoding string

} else if (
c >= 0xdbff ||
i + 1 === l ||
(d = s.charCodeAt(i + 1)) < 0xdc00 ||
d > 0xdfff
) {
// Unmatched surrogate pair, replaced by \ufffd (replacement character)
t += "\xef\xbf\xbd";

line 107 should be c > 0xdbff since 0xdbff is a valid surrogate

Expected behavior

expected result: b'\xf4\x8f\xbf\xbf' (get from python bytes.fromhex('dbffdfff').decode('utf-16be').encode('utf-8'))

actual result: ef bf bd ed bf bf, get from

Array.from(caml_utf8_of_utf16(String.fromCharCode(0xdbff, 0xdfff)))
    .map((c) => c.charCodeAt(0).toString(16).padStart(2, "0"))
    .join(" ")

Versions

latest version of jsoo contains this bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions