You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: route structured-text functions through codegen dispatcher
Register the CSV / JSON / XPath / XML structured-text functions that previously
fell back to Spark so they stay native via the codegen dispatcher. None have a
native (rust) implementation; they extend Spark's CodegenFallback.
- from_csv, schema_of_csv, schema_of_json, json_object_keys, xpath/xpath_*
- from_xml, to_xml, schema_of_xml (Spark 4.0+ only)
On Spark 3.4/3.5 these are plain expressions, registered directly in the serde
maps. On Spark 4.x they are RuntimeReplaceable and the optimizer rewrites them
to Invoke(evaluator)/StaticInvoke before Comet sees the plan, so they are
dispatched from CometExprShim4x.convertStructuredText, which matches the backing
evaluators by simple name to stay robust across 4.0/4.1/4.2. When the dispatcher
is disabled they fall back to Spark.
Adds CometStructuredTextSuite (XML tests gated to Spark 4.0+). Verified on the
spark-3.4, 3.5, 4.0, 4.1, and 4.2 profiles.
Copy file name to clipboardExpand all lines: docs/source/user-guide/latest/expressions.md
+31-4Lines changed: 31 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,11 +52,9 @@ Comet focuses acceleration on mainstream relational, string, datetime, math, and
52
52
expressions. The following function families are **not currently planned** for native acceleration (they are not on the 1.0 roadmap): specialized functionality with narrow real-world analytics use and high implementation cost. They fall back to Spark and may be reconsidered based on demand:
53
53
54
54
-**Probabilistic sketches and approximate top-k** (`kll_sketch_*`, `hll_*`, `theta_*`, `count_min_sketch`, `bitmap_*`, `approx_top_k*`): specialized data structures with exact-correctness traps.
55
-
-**XML / XPath** (`from_xml`, `to_xml`, `schema_of_xml`, `xpath*`): legacy text format, rare in accelerated workloads.
-**Avro / Protobuf codecs** (`from_avro`, `to_avro`, `from_protobuf`, `to_protobuf`, `schema_of_avro`): format conversion belongs at the IO layer, not expression evaluation.
58
57
-**JVM reflection** (`java_method`, `reflect`): niche, and they invoke arbitrary JVM methods (a security concern).
59
-
-**CSV functions** (`from_csv`, `to_csv`, `schema_of_csv`): row-level CSV parsing and formatting in expressions is niche and better handled at the data source layer.
-**Miscellaneous niche** (`histogram_numeric`, `version`, `sentences`, `quote`): low-value or specialized functions with little benefit from native acceleration.
|`to_json`| ✅ | Options and map/array inputs fall back ([audit](../../contributor-guide/expression-audits/json_funcs.md#to_json)) |
346
354
347
355
---
@@ -639,6 +647,25 @@ fall back to Spark.
639
647
640
648
---
641
649
650
+
## xml_funcs
651
+
652
+
| Function | Status | Notes |
653
+
| --- | --- | --- |
654
+
|`from_xml`| ✅ | Spark 4.0+ |
655
+
|`schema_of_xml`| ✅ | Spark 4.0+ |
656
+
|`to_xml`| ✅ | Spark 4.0+ |
657
+
|`xpath`| ✅ ||
658
+
|`xpath_boolean`| ✅ ||
659
+
|`xpath_double`| ✅ ||
660
+
|`xpath_float`| ✅ ||
661
+
|`xpath_int`| ✅ ||
662
+
|`xpath_long`| ✅ ||
663
+
|`xpath_number`| ✅ | Alias of `xpath_double`|
664
+
|`xpath_short`| ✅ ||
665
+
|`xpath_string`| ✅ ||
666
+
667
+
---
668
+
642
669
## Beyond SQL functions
643
670
644
671
Comet also accelerates a number of Catalyst expressions that have no Spark SQL function name and therefore do not appear in the tables above. These arise from the DataFrame API, from SQL syntax other than function calls, or from the query optimizer. They include:
0 commit comments