[Fix] parse: interpret astral numeric entities via String.fromCodePoint#563
[Fix] parse: interpret astral numeric entities via String.fromCodePoint#563chatman-media wants to merge 2 commits into
parse: interpret astral numeric entities via String.fromCodePoint#563Conversation
|
Unfortunately, |
7b0e2a2 to
a49bcb6
Compare
|
Good point — reworked to avoid |
…Point` `interpretNumericEntities` used `String.fromCharCode`, which only handles UTF-16 code units (0 - 0xFFFF) and silently truncates anything larger to 16 bits. Numeric character references for astral code points - emoji and many CJK extension characters, e.g. `😀` (U+1F600) - were turned into the wrong BMP character instead of the intended glyph. Use `String.fromCodePoint`, which builds the correct surrogate pair for the full Unicode range. `fromCodePoint` throws a `RangeError` for values above the Unicode maximum (U+10FFFF), so guard against that and leave such out-of-range entities as the literal text, preserving the previous non-throwing behavior.
a49bcb6 to
a95074d
Compare
|
Agreed — pulling in a |
|
I still don't see those changes - also, it's looking increasingly like an LLM is being used for all of this content. Can you confirm you're a human, and avoid using an LLM to generate the entirety of your contribution, including prose? |
Bug
With
interpretNumericEntities: true(iniso-8859-1charset), a numeric character reference for an astral code point — anything above U+FFFF, i.e. emoji and many CJK-extension characters — is decoded into the wrong character:Cause
interpretNumericEntitiesusesString.fromCharCode:String.fromCharCodeoperates on UTF-16 code units (0 – 0xFFFF) and truncates larger values to 16 bits. For😀,128512 & 0xFFFF === 0xF600, so it yields''(a lone Private-Use-Area char) rather than the surrogate pair for U+1F600. BMP references (e.g. the existing☺→ ☺, and the✓checkmark used by the charset sentinel) happen to be unaffected because they already fit in 16 bits.Fix
Use
String.fromCodePoint, which produces the correct surrogate pair across the full Unicode range.fromCodePointthrows aRangeErrorfor values above the Unicode maximum (U+10FFFF), whichfromCharCodenever did — so guard against that and leave out-of-range entities (�,�, …) as the literal text instead of throwing. For valid BMP references the output is byte-for-byte identical to before.Tests
Added a case under the existing
interpretNumericEntitiestests intest/parse.js:😀(U+1F600) round-trips to 😀.�) is left untouched and does not throw.Verification:
npx tape test/parse.js— 404 passing (was 402 + 2 failing without the fix; the two new assertions fail onmasterand pass with the change).npm run tests-only— 939 passing.npm run lint— 0 errors (pre-existing warnings only, none on the changed lines).structuredClone/the WHATWG URL encoder and browsers all resolve😀to U+1F600; this bringsqsin line.