Skip to content

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304

Open
tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
tkshsbcue:fix/textdecoder-utf16-unpaired-surrogates
Open

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextD…#5304
tkshsbcue wants to merge 1 commit intoboa-dev:mainfrom
tkshsbcue:fix/textdecoder-utf16-unpaired-surrogates

Conversation

@tkshsbcue
Copy link
Copy Markdown
Contributor

fix(runtime): replace unpaired surrogates with U+FFFD in UTF-16 TextDecoder

Summary

Fixes #4612.

The UTF-16 TextDecoder (both LE and BE) passed raw code units directly to JsString, preserving unpaired surrogates (e.g. \ud800) instead of replacing them with U+FFFD as required by the WHATWG Encoding Standard.

Changes

  • Add a shared decode_utf16_units() helper using std::char::decode_utf16 with .unwrap_or('\u{FFFD}') to produce well-formed output
  • Route both utf16le::decode and utf16be::decode through this helper
  • Simplify utf16be::decode to borrow (&[u8]) instead of taking ownership (Vec<u8>)
  • Handle dangling byte edge case: only append extra U+FFFD if the last code unit is not already a replaced high surrogate

Test plan

  • decoder_utf16le_replaces_unpaired_surrogates — lone high/low surrogates, consecutive, reversed pairs
  • decoder_utf16be_replaces_unpaired_surrogates — same for big-endian
  • decoder_utf16le_dangling_byte_produces_replacement — odd-length input
  • decoder_utf16be_dangling_byte_produces_replacement — odd-length input (BE)
  • All 33 existing text tests pass
  • cargo fmt and cargo clippy clean

…ecoder

The UTF-16 TextDecoder (both LE and BE) was passing raw code units
directly to JsString, preserving unpaired surrogates instead of
replacing them with U+FFFD as required by the WHATWG Encoding Standard.

Route both decoders through a shared `decode_utf16_units` helper that
uses `std::char::decode_utf16` with replacement, and simplify the
UTF-16 BE decoder to borrow instead of taking ownership.

Closes boa-dev#4612
@tkshsbcue tkshsbcue requested a review from a team as a code owner April 7, 2026 14:43
@github-actions github-actions bot added the Waiting On Review Waiting on reviews from the maintainers label Apr 7, 2026
@github-actions github-actions bot added this to the v1.0.0 milestone Apr 7, 2026
@github-actions github-actions bot added C-Tests Issues and PRs related to the tests. C-Runtime Issues and PRs related to Boa's runtime features labels Apr 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 7, 2026

Test262 conformance changes

Test result main count PR count difference
Total 53,125 53,125 0
Passed 51,049 51,049 0
Ignored 1,482 1,482 0
Failed 594 594 0
Panics 0 0 0
Conformance 96.09% 96.09% 0.00%

Tested main commit: b895f7411e6a74206c1bec8ba5a02df6791d12a2
Tested PR commit: 70c9ccba671f179e5a97b9be6ff53ef2f772502b
Compare commits: b895f74...70c9ccb

Comment on lines +482 to +544
const INVALID_UTF16_CASES: &[(&[u16], &[u16])] = &[
// Lone high surrogate in the middle
(
&[0x0061, 0x0062, 0xD800, 0x0077, 0x0078],
&[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078],
),
// Lone high surrogate only
(&[0xD800], &[0xFFFD]),
// Two consecutive high surrogates
(&[0xD800, 0xD800], &[0xFFFD, 0xFFFD]),
// Lone low surrogate in the middle
(
&[0x0061, 0x0062, 0xDFFF, 0x0077, 0x0078],
&[0x0061, 0x0062, 0xFFFD, 0x0077, 0x0078],
),
// Low surrogate followed by high surrogate (wrong order)
(&[0xDFFF, 0xD800], &[0xFFFD, 0xFFFD]),
];

#[test]
fn decoder_utf16le_replaces_unpaired_surrogates() {
for (invalid, replaced) in INVALID_UTF16_CASES {
let mut input_bytes = Vec::with_capacity(invalid.len() * 2);
for &code_unit in *invalid {
input_bytes.extend_from_slice(&code_unit.to_le_bytes());
}

let result = encodings::utf16le::decode(&input_bytes, false);
let expected = JsString::from(*replaced);
assert_eq!(result, expected, "utf16le failed for input {invalid:?}");
}
}

#[test]
fn decoder_utf16be_replaces_unpaired_surrogates() {
for (invalid, replaced) in INVALID_UTF16_CASES {
let mut input_bytes = Vec::with_capacity(invalid.len() * 2);
for &code_unit in *invalid {
input_bytes.extend_from_slice(&code_unit.to_be_bytes());
}

let result = encodings::utf16be::decode(&input_bytes, false);
let expected = JsString::from(*replaced);
assert_eq!(result, expected, "utf16be failed for input {invalid:?}");
}
}

#[test]
fn decoder_utf16le_dangling_byte_produces_replacement() {
// Odd-length input: the last byte is truncated and replaced with U+FFFD
let input: &[u8] = &[0x41, 0x00, 0x42]; // 'A' (LE) + dangling byte
let result = encodings::utf16le::decode(input, false);
let expected = JsString::from(&[0x0041u16, 0xFFFD][..]);
assert_eq!(result, expected);
}

#[test]
fn decoder_utf16be_dangling_byte_produces_replacement() {
let input: &[u8] = &[0x00, 0x41, 0x42]; // 'A' (BE) + dangling byte
let result = encodings::utf16be::decode(input, false);
let expected = JsString::from(&[0x0041u16, 0xFFFD][..]);
assert_eq!(result, expected);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these tests? Might be better to enable the utf16 decoding tests from the wpt suite instead, since those basically compile to the same kind of tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-Runtime Issues and PRs related to Boa's runtime features C-Tests Issues and PRs related to the tests. Waiting On Review Waiting on reviews from the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

utf-16 TextDecoder is incorrect

2 participants