|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Scala UDF and Java UDF Support |
| 21 | + |
| 22 | +Comet executes Spark's Scala and Java [scalar user-defined functions (UDFs)](https://spark.apache.org/docs/latest/sql-ref-functions-udf-scalar.html) on the native Comet path. The presence of a UDF does not force the enclosing operator off the native path; surrounding native operators stay native. |
| 23 | + |
| 24 | +This page covers Spark's `ScalaUDF` (Scala `udf(...)`, `spark.udf.register(...)` over Scala or Java functional interfaces, and SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`). Other UDF kinds (Python / Pandas, Hive, aggregate) are out of scope and continue to fall back to Spark. |
| 25 | + |
| 26 | +This feature is experimental and disabled by default. |
| 27 | + |
| 28 | +## Configuration |
| 29 | + |
| 30 | +| Key | Default | Description | |
| 31 | +| ------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------ | |
| 32 | +| `spark.comet.exec.scalaUDF.codegen.enabled` | `false` | When `true`, eligible `ScalaUDF`s run on the Comet path. When `false`, the enclosing operator falls back to Spark. | |
| 33 | + |
| 34 | +## Supported |
| 35 | + |
| 36 | +- User functions registered via `udf(...)`, `spark.udf.register(...)` (Scala or Java functional interfaces), or SQL `CREATE FUNCTION ... AS 'com.example.MyUDF'`. |
| 37 | +- Scalar input/output types: `Boolean`, `Byte`, `Short`, `Int`, `Long`, `Float`, `Double`, `Decimal`, `String`, `Binary`, `Date`, `Timestamp`, `TimestampNTZ`. |
| 38 | +- Complex input/output types with arbitrary nesting: `ArrayType`, `StructType`, `MapType`. |
| 39 | +- Composition with other Catalyst expressions inside the argument tree (e.g. `myUdf(upper(s))` runs as one native unit). |
| 40 | +- Higher-order functions (`transform`, `filter`, `exists`, `aggregate`, `zip_with`, `map_filter`, `map_zip_with`, etc.) inside the argument tree. |
| 41 | + |
| 42 | +## Not supported |
| 43 | + |
| 44 | +- Aggregate UDFs (`ScalaAggregator`, `TypedImperativeAggregate`, the legacy `UserDefinedAggregateFunction`). |
| 45 | +- Table UDFs and generators. |
| 46 | +- Python `@udf` and Pandas `@pandas_udf`. |
| 47 | +- Hive `GenericUDF` and `SimpleUDF`. |
| 48 | +- `CalendarIntervalType`, `NullType`, and `UserDefinedType` arguments and return types. UDT-typed columns fall back to Spark; for native execution, store and read the underlying representation directly (e.g. write MLlib `Vector` outputs as `Struct<type: Byte, size: Int, indices: Array<Int>, values: Array<Double>>` rather than `VectorUDT`). |
| 49 | +- Trees whose total nested-field count (output plus all input columns the UDF tree references) exceeds `spark.sql.codegen.maxFields` (default 100). Comet refuses these at plan time and the operator falls back to Spark. |
| 50 | + |
| 51 | +When a UDF is rejected, the reason surfaces through Comet's standard fallback diagnostics; the query still runs on Spark. |
| 52 | + |
| 53 | +## Behavior |
| 54 | + |
| 55 | +- Non-deterministic expressions referenced from the argument tree (`rand`, `uuid`, `monotonically_increasing_id`) produce per-partition sequences consistent with Spark. |
| 56 | +- `TaskContext.get()` inside the user function returns the driving Spark task's context. |
| 57 | +- The user function must be closure-serializable; the same function that works with Spark's executor execution works here. |
| 58 | + |
| 59 | +## Known limitations |
| 60 | + |
| 61 | +- Each query containing a ScalaUDF pays a one-time codegen cost on its first batch and reuses the compiled kernel for subsequent batches, matching Spark's whole-stage codegen behavior. Bytecode is deduped JVM-wide via the same `CodeGenerator` cache, so structurally identical queries across a session share the compiled class. |
0 commit comments