[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing… by aoeiuvb · Pull Request #6578 · alibaba/druid

aoeiuvb · 2025-12-25T03:23:25Z

1. Problem Description

In Hive SQL, regular expressions often use Unicode escape sequences to match specific character ranges, for example:

-- Matching Chinese characters
SELECT * FROM table WHERE col REGEXP '[\u4e00-\u9fa5]+';

Current Issue:
When formatting this SQL, the parser currently converts Unicode escapes (starting with \u) into actual characters (e.g., converting \u4e00 to 一).
However, SQL formatting should only beautify the layout (such as newlines and indentation) and should not alter the original content or literal values of the SQL. Converting these escapes can break the semantics of regular expressions or cause encoding issues.

2. Changes

I have introduced a new feature SQLParserFeature.KeepUnicodeEscape to address this issue.

New Feature: Added KeepUnicodeEscape.
Logic: When this feature is enabled, the Lexer will not decode Unicode sequences starting with \u into specific characters (overriding the behavior of SupportUnicodeCodePoint).
Result: The escape sequences are preserved as-is (raw string), ensuring the SQL content remains unchanged during formatting or parsing.

3. Verification

I have added a new unit test class HiveRegContainUnicodeTest to verify the fix.

Test Case 1 (Feature Disabled): Verifies that without KeepUnicodeEscape, the parser follows the default SupportUnicodeCodePoint behavior (legacy behavior).
Test Case 2 (Feature Enabled): Verifies that when KeepUnicodeEscape is enabled, the Unicode escapes (e.g., \u4e00) are not escaped/decoded and are output exactly as the original input string.

… issues

wenshao · 2026-03-10T16:43:48Z

The escape handling changes should be conditional. The fix should look something like:

  switch (ch) {
      case 'u':
          if ((features & SQLParserFeature.KeepUnicodeEscape.mask) != 0) {
              putChar('\\');
              putChar('u');
          } else if ((features & SQLParserFeature.SupportUnicodeCodePoint.mask) != 0) {
              // existing unicode decode logic
          }
          break;
      // KEEP all existing cases: '0', '\'', '"', 'b', 'n', 'r', 't', '\\', 'Z', '%', '_'
      case '0':
          putChar('\0');
          break;
      // ... etc
      default:
          putChar(ch);
          break;
  }

The HiveOutputVisitor change should similarly be conditional — only change backslash handling when KeepUnicodeEscape is active.

wenshao · 2026-03-10T16:44:38Z

Unconditional removal of all escape sequence handling (HiveLexer.java)

The PR removes the case branches for \0, ', ", \b, \n, \r, \t, \, \Z, %, _ and replaces them with a blanket default that does putChar('\'); putChar(ch). This change is not gated behind KeepUnicodeEscape — it applies to all Hive SQL parsing unconditionally.

This means:

'\n' in a Hive string will no longer be decoded to a newline — it stays as the literal two characters \n
'\t' stays as \t instead of a tab
'\0' stays as \0 instead of a null byte
All other standard escapes are similarly broken

This is a breaking behavioral change that goes far beyond the stated goal of preserving Unicode escapes. Strings containing standard escape sequences will parse differently than before.

[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing…

43915f4

… issues

wenshao requested a review from lingo-xp February 8, 2026 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing…#6578

[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing…#6578
aoeiuvb wants to merge 1 commit intoalibaba:masterfrom
aoeiuvb:feature/hive-keep-unicode

aoeiuvb commented Dec 25, 2025

Uh oh!

wenshao commented Mar 10, 2026

Uh oh!

wenshao commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aoeiuvb commented Dec 25, 2025

1. Problem Description

2. Changes

3. Verification

Uh oh!

wenshao commented Mar 10, 2026

Uh oh!

wenshao commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants