fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304
Open
tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
Open
fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
Conversation
…ecoder The UTF-16 TextDecoder (both LE and BE) was passing raw code units directly to JsString, preserving unpaired surrogates instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard. Route both decoders through a shared `decode_utf16_units` helper that uses `std::char::decode_utf16` with replacement, and simplify the UTF-16 BE decoder to borrow instead of taking ownership. Closes boa-dev#4612
Test262 conformance changes
Tested main commit: |
jedel1043
requested changes
Apr 7, 2026
Comment on lines
+482
to
+544
| const INVALID_UTF16_CASES: &[(&[u16], &[u16])] = &[ | ||
| // Lone high surrogate in the middle | ||
| ( | ||
| &[0x0061, 0x0062, 0xD800, 0x0077, 0x0078], | ||
| &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078], | ||
| ), | ||
| // Lone high surrogate only | ||
| (&[0xD800], &[0xFFFD]), | ||
| // Two consecutive high surrogates | ||
| (&[0xD800, 0xD800], &[0xFFFD, 0xFFFD]), | ||
| // Lone low surrogate in the middle | ||
| ( | ||
| &[0x0061, 0x0062, 0xDFFF, 0x0077, 0x0078], | ||
| &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078], | ||
| ), | ||
| // Low surrogate followed by high surrogate (wrong order) | ||
| (&[0xDFFF, 0xD800], &[0xFFFD, 0xFFFD]), | ||
| ]; | ||
|
|
||
| #[test] | ||
| fn decoder_utf16le_replaces_unpaired_surrogates() { | ||
| for (invalid, replaced) in INVALID_UTF16_CASES { | ||
| let mut input_bytes = Vec::with_capacity(invalid.len() * 2); | ||
| for &code_unit in *invalid { | ||
| input_bytes.extend_from_slice(&code_unit.to_le_bytes()); | ||
| } | ||
|
|
||
| let result = encodings::utf16le::decode(&input_bytes, false); | ||
| let expected = JsString::from(*replaced); | ||
| assert_eq!(result, expected, "utf16le failed for input {invalid:?}"); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16be_replaces_unpaired_surrogates() { | ||
| for (invalid, replaced) in INVALID_UTF16_CASES { | ||
| let mut input_bytes = Vec::with_capacity(invalid.len() * 2); | ||
| for &code_unit in *invalid { | ||
| input_bytes.extend_from_slice(&code_unit.to_be_bytes()); | ||
| } | ||
|
|
||
| let result = encodings::utf16be::decode(&input_bytes, false); | ||
| let expected = JsString::from(*replaced); | ||
| assert_eq!(result, expected, "utf16be failed for input {invalid:?}"); | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16le_dangling_byte_produces_replacement() { | ||
| // Odd-length input: the last byte is truncated and replaced with U+FFFD | ||
| let input: &[u8] = &[0x41, 0x00, 0x42]; // 'A' (LE) + dangling byte | ||
| let result = encodings::utf16le::decode(input, false); | ||
| let expected = JsString::from(&[0x0041u16, 0xFFFD][..]); | ||
| assert_eq!(result, expected); | ||
| } | ||
|
|
||
| #[test] | ||
| fn decoder_utf16be_dangling_byte_produces_replacement() { | ||
| let input: &[u8] = &[0x00, 0x41, 0x42]; // 'A' (BE) + dangling byte | ||
| let result = encodings::utf16be::decode(input, false); | ||
| let expected = JsString::from(&[0x0041u16, 0xFFFD][..]); | ||
| assert_eq!(result, expected); | ||
| } |
Member
There was a problem hiding this comment.
Do we need these tests? Might be better to enable the utf16 decoding tests from the wpt suite instead, since those basically compile to the same kind of tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextDecoder
Summary
Fixes #4612.
The UTF-16
TextDecoder(both LE and BE) passed raw code units directly toJsString, preserving unpaired surrogates (e.g.\ud800) instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard.Changes
decode_utf16_units()helper usingstd::char::decode_utf16with.unwrap_or('\u{FFFD}')to produce well-formed outpututf16le::decodeandutf16be::decodethrough this helperutf16be::decodeto borrow (&[u8]) instead of taking ownership (Vec<u8>)Test plan
decoder_utf16le_replaces_unpaired_surrogates— lone high/low surrogates, consecutive, reversed pairsdecoder_utf16be_replaces_unpaired_surrogates— same for big-endiandecoder_utf16le_dangling_byte_produces_replacement— odd-length inputdecoder_utf16be_dangling_byte_produces_replacement— odd-length input (BE)cargo fmtandcargo clippyclean