Fix HtmlUtils unescape for supplementary chars #35477

juntae6942 · 2025-09-13T14:24:07Z

Currently, HtmlUtils.htmlUnescape() does not correctly handle numeric character references for Unicode supplementary characters (e.g., emojis).

For example, an entity like 😀 (😀) is incorrectly converted to a garbled character corresponding to U+F600 due to data truncation.

Step to Reproduce

public static void main(String[] args) {
        // Test character: 'Grinning Face' emoji (😀)
        // Unicode code point: U+1F600
        // Hexadecimal: 1F600
        // Decimal: 128512

        // 1. Input value as a decimal HTML entity
        String inputDecimal = "&#128512;";

        // 2. Input value as a hexadecimal HTML entity
        String inputHex = "&#x1F600;";

        // 3. The expected result after correct conversion
        String expectedOutput = "😀";

        System.out.println("--- Decimal HTML Entity Test ---");
        System.out.println("Input: " + inputDecimal);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputDecimal = HtmlUtils.htmlUnescape(inputDecimal);

        System.out.println("Actual Output: " + actualOutputDecimal);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputDecimal));

        System.out.println("\n--- Hexadecimal HTML Entity Test ---");
        System.out.println("Input: " + inputHex);

        // Call the HtmlUtils.htmlUnescape() method
        String actualOutputHex = HtmlUtils.htmlUnescape(inputHex);

        System.out.println("Actual Output: " + actualOutputHex);
        System.out.println("Expected Output: " + expectedOutput);
        System.out.println("Result matches expected: " + expectedOutput.equals(actualOutputHex));
    }

Cause

The root cause was a problematic cast to a 16-bit char in the HtmlCharacterEntityDecoder. This operation truncated any Unicode code point value greater than U+FFFF, leading to the loss of the most significant bits.

Solution

This PR resolves the issue by replacing the direct (char) cast with a call to StringBuilder.appendCodePoint().

The appendCodePoint() method is designed to handle the full range of Unicode code points. It correctly converts supplementary characters into a two-character surrogate pair, ensuring that all characters are unescaped without data loss. A corresponding unit test has been added to verify this fix.

Signed-off-by: potato <[email protected]>

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Sep 13, 2025

juntae6942 mentioned this pull request Sep 13, 2025

HtmlUtils.htmlUnescape() incorrect for numeric character references >= 𐀀 / 𐀀 #35426

Open

Fix HtmlUtils unescape for supplementary chars

0b60800

Signed-off-by: potato <[email protected]>

juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from bc095df to 369ffe4 Compare September 14, 2025 04:41

Test: Add case for basic HTML entities in HtmlUtils

a6efa2a

Signed-off-by: potato <[email protected]>

juntae6942 force-pushed the fix/spring-framework-35426-htmlunescape-unicode branch from 369ffe4 to a6efa2a Compare September 14, 2025 04:48

Style: Format code to align with project conventions

7378478

Signed-off-by: potato <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix HtmlUtils unescape for supplementary chars #35477

Fix HtmlUtils unescape for supplementary chars #35477

Uh oh!

juntae6942 commented Sep 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix HtmlUtils unescape for supplementary chars #35477

Are you sure you want to change the base?

Fix HtmlUtils unescape for supplementary chars #35477

Uh oh!

Conversation

juntae6942 commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Step to Reproduce

Cause

Solution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juntae6942 commented Sep 13, 2025 •

edited

Loading