Skip to content

[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing…#6578

Open
aoeiuvb wants to merge 1 commit intoalibaba:masterfrom
aoeiuvb:feature/hive-keep-unicode
Open

[Feature] Support KeepUnicodeEscape feature to fix Hive regex parsing…#6578
aoeiuvb wants to merge 1 commit intoalibaba:masterfrom
aoeiuvb:feature/hive-keep-unicode

Conversation

@aoeiuvb
Copy link

@aoeiuvb aoeiuvb commented Dec 25, 2025

1. Problem Description

In Hive SQL, regular expressions often use Unicode escape sequences to match specific character ranges, for example:

-- Matching Chinese characters
SELECT * FROM table WHERE col REGEXP '[\u4e00-\u9fa5]+';

Current Issue:
When formatting this SQL, the parser currently converts Unicode escapes (starting with \u) into actual characters (e.g., converting \u4e00 to ).
However, SQL formatting should only beautify the layout (such as newlines and indentation) and should not alter the original content or literal values of the SQL. Converting these escapes can break the semantics of regular expressions or cause encoding issues.

2. Changes

I have introduced a new feature SQLParserFeature.KeepUnicodeEscape to address this issue.

  • New Feature: Added KeepUnicodeEscape.
  • Logic: When this feature is enabled, the Lexer will not decode Unicode sequences starting with \u into specific characters (overriding the behavior of SupportUnicodeCodePoint).
  • Result: The escape sequences are preserved as-is (raw string), ensuring the SQL content remains unchanged during formatting or parsing.

3. Verification

I have added a new unit test class HiveRegContainUnicodeTest to verify the fix.

  • Test Case 1 (Feature Disabled): Verifies that without KeepUnicodeEscape, the parser follows the default SupportUnicodeCodePoint behavior (legacy behavior).
  • Test Case 2 (Feature Enabled): Verifies that when KeepUnicodeEscape is enabled, the Unicode escapes (e.g., \u4e00) are not escaped/decoded and are output exactly as the original input string.

@wenshao wenshao requested a review from lingo-xp February 8, 2026 12:14
@wenshao
Copy link
Member

wenshao commented Mar 10, 2026

The escape handling changes should be conditional. The fix should look something like:

  switch (ch) {
      case 'u':
          if ((features & SQLParserFeature.KeepUnicodeEscape.mask) != 0) {
              putChar('\\');
              putChar('u');
          } else if ((features & SQLParserFeature.SupportUnicodeCodePoint.mask) != 0) {
              // existing unicode decode logic
          }
          break;
      // KEEP all existing cases: '0', '\'', '"', 'b', 'n', 'r', 't', '\\', 'Z', '%', '_'
      case '0':
          putChar('\0');
          break;
      // ... etc
      default:
          putChar(ch);
          break;
  }

The HiveOutputVisitor change should similarly be conditional — only change backslash handling when KeepUnicodeEscape is active.

@wenshao
Copy link
Member

wenshao commented Mar 10, 2026

Unconditional removal of all escape sequence handling (HiveLexer.java)

The PR removes the case branches for \0, ', ", \b, \n, \r, \t, \, \Z, %, _ and replaces them with a blanket default that does putChar('\'); putChar(ch). This change is not gated behind KeepUnicodeEscape — it applies to all Hive SQL parsing unconditionally.

This means:

  • '\n' in a Hive string will no longer be decoded to a newline — it stays as the literal two characters \n
  • '\t' stays as \t instead of a tab
  • '\0' stays as \0 instead of a null byte
  • All other standard escapes are similarly broken

This is a breaking behavioral change that goes far beyond the stated goal of preserving Unicode escapes. Strings containing standard escape sequences will parse differently than before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants