Skip to content

[Fix] parse: interpret astral numeric entities via String.fromCodePoint#563

Draft
chatman-media wants to merge 2 commits into
ljharb:mainfrom
chatman-media:fix/numeric-entities-astral-codepoints
Draft

[Fix] parse: interpret astral numeric entities via String.fromCodePoint#563
chatman-media wants to merge 2 commits into
ljharb:mainfrom
chatman-media:fix/numeric-entities-astral-codepoints

Conversation

@chatman-media

Copy link
Copy Markdown

Bug

With interpretNumericEntities: true (in iso-8859-1 charset), a numeric character reference for an astral code point — anything above U+FFFF, i.e. emoji and many CJK-extension characters — is decoded into the wrong character:

qs.parse('a=%26%23128512%3B', { charset: 'iso-8859-1', interpretNumericEntities: true });
// %26%23128512%3B === encodeURIComponent('😀'), the reference for 😀 (U+1F600)

// actual:   { a: '' }   ← a single, wrong BMP char
// expected: { a: '😀' }       ← U+1F600

Cause

interpretNumericEntities uses String.fromCharCode:

return str.replace(/&#(\d+);/g, function ($0, numberStr) {
    return String.fromCharCode(parseInt(numberStr, 10));
});

String.fromCharCode operates on UTF-16 code units (0 – 0xFFFF) and truncates larger values to 16 bits. For 😀, 128512 & 0xFFFF === 0xF600, so it yields '' (a lone Private-Use-Area char) rather than the surrogate pair for U+1F600. BMP references (e.g. the existing ☺ → ☺, and the ✓ checkmark used by the charset sentinel) happen to be unaffected because they already fit in 16 bits.

Fix

Use String.fromCodePoint, which produces the correct surrogate pair across the full Unicode range. fromCodePoint throws a RangeError for values above the Unicode maximum (U+10FFFF), which fromCharCode never did — so guard against that and leave out-of-range entities (�, �, …) as the literal text instead of throwing. For valid BMP references the output is byte-for-byte identical to before.

var codePoint = parseInt(numberStr, 10);
return codePoint > 0x10FFFF ? $0 : String.fromCodePoint(codePoint);

Tests

Added a case under the existing interpretNumericEntities tests in test/parse.js:

  • 😀 (U+1F600) round-trips to 😀.
  • An out-of-range reference (�) is left untouched and does not throw.

Verification:

  • npx tape test/parse.js — 404 passing (was 402 + 2 failing without the fix; the two new assertions fail on master and pass with the change).
  • npm run tests-only — 939 passing.
  • npm run lint — 0 errors (pre-existing warnings only, none on the changed lines).

structuredClone/the WHATWG URL encoder and browsers all resolve 😀 to U+1F600; this brings qs in line.

@ljharb

ljharb commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Unfortunately, String.fromCodePoint is not available on every engine we support - namely, node 0.x, and some old browsers. Adding https://www.npmjs.com/package/string.fromcodepoint seems like a pretty big cost, even if I extracted https://github.com/mathiasbynens/String.fromCodePoint/blob/main/implementation.js out to its own package.

@ljharb ljharb marked this pull request as draft June 23, 2026 06:17
@chatman-media chatman-media force-pushed the fix/numeric-entities-astral-codepoints branch from 7b0e2a2 to a49bcb6 Compare June 23, 2026 07:40
@chatman-media

Copy link
Copy Markdown
Author

Good point — reworked to avoid String.fromCodePoint entirely. It now builds the surrogate pair by hand with String.fromCharCode (high = 0xD800 + (cp >> 10), low = 0xDC00 + (cp & 0x3FF) for code points above 0xFFFF, plain fromCharCode otherwise), mirroring the existing surrogate math in lib/utils.js. No new dependency and it works on node 0.x / older browsers. Out-of-range references (> U+10FFFF) are still left as the literal entity rather than throwing, and BMP output is byte-for-byte unchanged. Tests updated to not rely on fromCodePoint either; full suite green locally (939 passing, lint clean).

…Point`

`interpretNumericEntities` used `String.fromCharCode`, which only handles
UTF-16 code units (0 - 0xFFFF) and silently truncates anything larger to
16 bits. Numeric character references for astral code points - emoji and
many CJK extension characters, e.g. `😀` (U+1F600) - were turned
into the wrong BMP character instead of the intended glyph.

Use `String.fromCodePoint`, which builds the correct surrogate pair for
the full Unicode range. `fromCodePoint` throws a `RangeError` for values
above the Unicode maximum (U+10FFFF), so guard against that and leave
such out-of-range entities as the literal text, preserving the previous
non-throwing behavior.
@chatman-media chatman-media force-pushed the fix/numeric-entities-astral-codepoints branch from a49bcb6 to a95074d Compare June 24, 2026 00:05
@chatman-media

Copy link
Copy Markdown
Author

Agreed — pulling in a String.fromCodePoint polyfill/dependency would be too much for this. I've reworked it to avoid String.fromCodePoint entirely: there's now a tiny local fromCodePoint helper that does the surrogate-pair math by hand with String.fromCharCode (0xD800 + (c >> 10), 0xDC00 + (c & 0x3FF)), mirroring the existing String.fromCharCode usage already in lib/parse.js. No new dependency, ES5-only, works on node 0.x. Out-of-range code points (> 0x10FFFF) are left as the original literal entity rather than coerced. Rebased onto latest main.

@ljharb

ljharb commented Jun 24, 2026

Copy link
Copy Markdown
Owner

I still don't see those changes - also, it's looking increasingly like an LLM is being used for all of this content. Can you confirm you're a human, and avoid using an LLM to generate the entirety of your contribution, including prose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants