@@ -19,6 +19,124 @@ under the License.
1919
2020# Regular Expressions
2121
22- Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
23- regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
24- this can be overridden by setting ` spark.comet.expression.regexp.allowIncompatible=true ` .
22+ Comet evaluates Spark regular-expression expressions (` rlike ` , ` regexp_replace ` , ` split ` ,
23+ ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` ) two ways:
24+
25+ - ** Codegen dispatcher (default)** — Spark's own ` doGenCode ` for the expression runs inside Comet's
26+ Arrow-direct codegen dispatcher (the same dispatcher used by Comet's ` ScalaUDF ` codegen path).
27+ This is 100% compatible with Spark, at the cost of one JNI round-trip per batch. It is enabled by
28+ default (` spark.comet.exec.scalaUDF.codegen.enabled=true ` ); if the dispatcher is disabled, regex
29+ expressions fall back to Spark.
30+ - ** Native (rust) engine** — the Rust [ ` regex ` ] crate, run natively with no JNI overhead. It is
31+ faster but has different semantics from Java regex (see below), so it is ** opt-in per expression**
32+ via that expression's ` allowIncompatible ` flag. ` rlike ` , ` regexp_replace ` , and ` split ` have a
33+ native implementation; ` regexp_extract ` , ` regexp_extract_all ` , and ` regexp_instr ` do not and
34+ always run through the codegen dispatcher.
35+
36+ | SQL | Native (rust) opt-in config |
37+ | ---------------- | -------------------------------------------------------- |
38+ | ` rlike ` | ` spark.comet.expression.RLike.allowIncompatible ` |
39+ | ` regexp_replace ` | ` spark.comet.expression.RegExpReplace.allowIncompatible ` |
40+ | ` split ` | ` spark.comet.expression.StringSplit.allowIncompatible ` |
41+
42+ When the native path is opted in but a case has no native implementation (for example a non-scalar
43+ ` rlike ` pattern, or ` regexp_replace ` with a non-1 offset), Comet routes that case through the
44+ codegen dispatcher.
45+
46+ ## Disabling Comet for individual regex expressions
47+
48+ Each regex expression has a per-class ` spark.comet.expression.<ClassName>.enabled ` flag (default
49+ ` true ` ) that disables Comet's serde for that expression and forces a Spark fallback. This is
50+ useful for narrowing a regression or comparing performance on a single operator without changing
51+ the engine selector:
52+
53+ | Expression | Config |
54+ | -------------------- | ------------------------------------------------------- |
55+ | ` rlike ` | ` spark.comet.expression.RLike.enabled=false ` |
56+ | ` regexp_extract ` | ` spark.comet.expression.RegExpExtract.enabled=false ` |
57+ | ` regexp_extract_all ` | ` spark.comet.expression.RegExpExtractAll.enabled=false ` |
58+ | ` regexp_instr ` | ` spark.comet.expression.RegExpInStr.enabled=false ` |
59+ | ` regexp_replace ` | ` spark.comet.expression.RegExpReplace.enabled=false ` |
60+ | ` split ` | ` spark.comet.expression.StringSplit.enabled=false ` |
61+
62+ ## Choosing an engine
63+
64+ | | Rust engine | Codegen dispatcher (default) |
65+ | -------------------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
66+ | ** Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
67+ | ** Feature coverage** | ` rlike ` , ` regexp_replace ` , ` split ` natively; ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` via fallthrough | All regexp expressions (` rlike ` , ` regexp_extract ` , ` regexp_extract_all ` , ` regexp_instr ` , ` regexp_replace ` , ` split ` ) |
68+ | ** Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
69+ | ** Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |
70+
71+ The ** Rust engine** is faster but cannot match Java regex semantics for every pattern. Opting in per
72+ expression (for example ` spark.comet.expression.RLike.allowIncompatible=true ` ) declares acceptance
73+ of those differences.
74+
75+ The ** codegen dispatcher** is the default and is enabled by ` spark.comet.exec.scalaUDF.codegen.enabled ` ,
76+ so it can be disabled globally to fall back to Spark for the regex family.
77+
78+ ## Why the engines differ
79+
80+ Java's ` java.util.regex ` is a backtracking engine in the Perl/PCRE family. It supports the full range of
81+ features that style of engine provides, including some whose worst-case running time grows exponentially with
82+ the input.
83+
84+ Rust's [ ` regex ` ] crate is a finite-automaton engine in the [ RE2] family. It deliberately omits features that
85+ cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
86+ in time linear in the size of the input. This is the same trade-off RE2, Go's ` regexp ` , and several other
87+ engines make.
88+
89+ The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
90+ several constructs that look the same in source have different semantics on the two sides.
91+
92+ ## Features supported by Java but not by the Rust engine
93+
94+ Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
95+ the Java engine):
96+
97+ - ** Backreferences** such as ` \1 ` , ` \2 ` , or ` \k<name> ` . The Rust engine has no backtracking and cannot match
98+ a previously captured group.
99+ - ** Lookaround** , including lookahead (` (?=...) ` , ` (?!...) ` ) and lookbehind (` (?<=...) ` , ` (?<!...) ` ).
100+ - ** Atomic groups** (` (?>...) ` ).
101+ - ** Possessive quantifiers** (` *+ ` , ` ++ ` , ` ?+ ` , ` {n,m}+ ` ). Rust supports greedy and lazy quantifiers but not
102+ possessive.
103+ - ** Embedded code, conditionals, and recursion** such as ` (?(cond)yes|no) ` or ` (?R) ` . Rust accepts none of
104+ these.
105+
106+ ## Features that exist on both sides but behave differently
107+
108+ Even where both engines accept a construct, the matching behavior is not always the same.
109+
110+ - ** Unicode-aware character classes.** In the Rust engine, ` \d ` , ` \w ` , ` \s ` , and ` . ` are Unicode-aware by
111+ default, so ` \d ` matches every digit codepoint defined by Unicode rather than only ` 0 ` -` 9 ` . Java's defaults
112+ match ASCII only and require the ` UNICODE_CHARACTER_CLASS ` flag (or ` (?U) ` inline) to switch to Unicode
113+ semantics. The same pattern can therefore match a different set of characters on each side.
114+ - ** Line terminators.** In multiline mode, Java treats ` \r ` , ` \n ` , ` \r\n ` , and a few additional Unicode line
115+ separators as line boundaries by default. The Rust engine treats only ` \n ` as a line boundary unless CRLF
116+ mode is enabled. ` ^ ` , ` $ ` , and ` . ` (with ` (?s) ` off) all depend on this definition.
117+ - ** Case-insensitive matching.** Both engines support ` (?i) ` , but Java's default is ASCII case folding while
118+ the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
119+ outside ASCII can produce different results.
120+ - ** POSIX character classes.** The Rust engine supports ` [[:alpha:]] ` style POSIX classes inside bracket
121+ expressions but not Java's ` \p{Alpha} ` shorthand. Java accepts both. Unicode property escapes (` \p{L} ` ,
122+ ` \p{Greek} ` , etc.) are supported by both engines but cover slightly different sets of properties.
123+ - ** Octal and Unicode escapes.** Java accepts ` \0nnn ` for octal and ` \uXXXX ` for a BMP codepoint. Rust uses
124+ ` \x{...} ` for arbitrary codepoints and does not accept Java's bare ` \uXXXX ` form.
125+ - ** Empty matches in ` split ` .** Spark's ` StringSplit ` , which is built on Java's regex, includes leading empty
126+ strings produced by zero-width matches at the start of the input. The Rust engine's ` split ` follows different
127+ rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
128+ identical on both sides.
129+
130+ ## When the Rust engine is safe
131+
132+ For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
133+ ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
134+ shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
135+ ` allowIncompatible=true ` is generally safe.
136+
137+ For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
138+ defaults, use the Java engine.
139+
140+ [ `java.util.regex` ] : https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
141+ [ `regex` ] : https://docs.rs/regex/latest/regex/
142+ [ RE2 ] : https://github.com/google/re2/wiki/Syntax
0 commit comments