fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD… by tkshsbcue · Pull Request #5304 · boa-dev/boa

tkshsbcue · 2026-04-07T14:43:54Z

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextDecoder

Summary

The UTF-16 TextDecoder (both LE and BE) passed raw code units directly to JsString, preserving unpaired surrogates (e.g. \ud800) instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard.

Changes

Add a shared decode_utf16_units() helper using std::char::decode_utf16 with .unwrap_or('\u{FFFD}') to produce well-formed output
Route both utf16le::decode and utf16be::decode through this helper
Simplify utf16be::decode to borrow (&[u8]) instead of taking ownership (Vec<u8>)
Handle dangling byte edge case: only append extra U+FFFD if the last code unit is not already a replaced high surrogate

Test plan

decoder_utf16le_replaces_unpaired_surrogates — lone high/low surrogates, consecutive, reversed pairs
decoder_utf16be_replaces_unpaired_surrogates — same for big-endian
decoder_utf16le_dangling_byte_produces_replacement — odd-length input
decoder_utf16be_dangling_byte_produces_replacement — odd-length input (BE)
All 33 existing text tests pass
cargo fmt and cargo clippy clean

…ecoder The UTF-16 TextDecoder (both LE and BE) was passing raw code units directly to JsString, preserving unpaired surrogates instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard. Route both decoders through a shared `decode_utf16_units` helper that uses `std::char::decode_utf16` with replacement, and simplify the UTF-16 BE decoder to borrow instead of taking ownership. Closes boa-dev#4612

github-actions · 2026-04-07T14:53:17Z

Test262 conformance changes

Test result	main count	PR count	difference
Total	53,125	53,125	0
Passed	51,049	51,049	0
Ignored	1,482	1,482	0
Failed	594	594	0
Panics	0	0	0
Conformance	96.09%	96.09%	0.00%

Tested main commit: b895f7411e6a74206c1bec8ba5a02df6791d12a2
Tested PR commit: 70c9ccba671f179e5a97b9be6ff53ef2f772502b
Compare commits: b895f74...70c9ccb

jedel1043 · 2026-04-07T17:50:01Z

core/runtime/src/text/tests.rs

+const INVALID_UTF16_CASES: &[(&[u16], &[u16])] = &[
+    // Lone high surrogate in the middle
+    (
+        &[0x0061, 0x0062, 0xD800, 0x0077, 0x0078],
+        &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078],
+    ),
+    // Lone high surrogate only
+    (&[0xD800], &[0xFFFD]),
+    // Two consecutive high surrogates
+    (&[0xD800, 0xD800], &[0xFFFD, 0xFFFD]),
+    // Lone low surrogate in the middle
+    (
+        &[0x0061, 0x0062, 0xDFFF, 0x0077, 0x0078],
+        &[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078],
+    ),
+    // Low surrogate followed by high surrogate (wrong order)
+    (&[0xDFFF, 0xD800], &[0xFFFD, 0xFFFD]),
+];
+
+#[test]
+fn decoder_utf16le_replaces_unpaired_surrogates() {
+    for (invalid, replaced) in INVALID_UTF16_CASES {
+        let mut input_bytes = Vec::with_capacity(invalid.len() * 2);
+        for &code_unit in *invalid {
+            input_bytes.extend_from_slice(&code_unit.to_le_bytes());
+        }
+
+        let result = encodings::utf16le::decode(&input_bytes, false);
+        let expected = JsString::from(*replaced);
+        assert_eq!(result, expected, "utf16le failed for input {invalid:?}");
+    }
+}
+
+#[test]
+fn decoder_utf16be_replaces_unpaired_surrogates() {
+    for (invalid, replaced) in INVALID_UTF16_CASES {
+        let mut input_bytes = Vec::with_capacity(invalid.len() * 2);
+        for &code_unit in *invalid {
+            input_bytes.extend_from_slice(&code_unit.to_be_bytes());
+        }
+
+        let result = encodings::utf16be::decode(&input_bytes, false);
+        let expected = JsString::from(*replaced);
+        assert_eq!(result, expected, "utf16be failed for input {invalid:?}");
+    }
+}
+
+#[test]
+fn decoder_utf16le_dangling_byte_produces_replacement() {
+    // Odd-length input: the last byte is truncated and replaced with U+FFFD
+    let input: &[u8] = &[0x41, 0x00, 0x42]; // 'A' (LE) + dangling byte
+    let result = encodings::utf16le::decode(input, false);
+    let expected = JsString::from(&[0x0041u16, 0xFFFD][..]);
+    assert_eq!(result, expected);
+}
+
+#[test]
+fn decoder_utf16be_dangling_byte_produces_replacement() {
+    let input: &[u8] = &[0x00, 0x41, 0x42]; // 'A' (BE) + dangling byte
+    let result = encodings::utf16be::decode(input, false);
+    let expected = JsString::from(&[0x0041u16, 0xFFFD][..]);
+    assert_eq!(result, expected);
+}


Do we need these tests? Might be better to enable the utf16 decoding tests from the wpt suite instead, since those basically compile to the same kind of tests

tkshsbcue requested a review from a team as a code owner April 7, 2026 14:43

github-actions bot added the Waiting On Review Waiting on reviews from the maintainers label Apr 7, 2026

github-actions bot added this to the v1.0.0 milestone Apr 7, 2026

github-actions bot added C-Tests Issues and PRs related to the tests. C-Runtime Issues and PRs related to Boa's runtime features labels Apr 7, 2026

jedel1043 requested changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304
tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
tkshsbcue:fix/textdecoder-utf16-unpaired-surrogates

tkshsbcue commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026

Uh oh!

jedel1043 Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tkshsbcue commented Apr 7, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Apr 7, 2026

Test262 conformance changes

Uh oh!

jedel1043 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants