Skip to content

Commit ebd011c

Browse files
committed
chore: force extension registration in regexp_extract benchmark
CometBenchmarkBase wires CometSparkSessionExtensions via `withExtensions`, but that call is silently dropped when `SparkSession.builder.getOrCreate()` returns an existing session, so the benchmark was running plain Spark in all four "modes" -- the EXPLAIN plan was just `Project + ColumnarToRow + FileScan parquet` with no CometScan or CometProject. Override `getSparkSession` to set `spark.sql.extensions` on the SparkConf (plus the off-heap and shuffle-manager configs CometTestBase uses) so Comet planning rules actually fire. The native Rust mode now shows up to 2.5x over Spark on patterns with many matches (e.g. regexp_extract_all / alternation), and 1.2-1.3x on the simpler shapes.
1 parent 1de8c0f commit ebd011c

1 file changed

Lines changed: 31 additions & 0 deletions

File tree

spark/src/test/scala/org/apache/spark/sql/benchmark/CometRegExpExtractBenchmark.scala

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,10 @@
1919

2020
package org.apache.spark.sql.benchmark
2121

22+
import org.apache.spark.SparkConf
2223
import org.apache.spark.benchmark.Benchmark
24+
import org.apache.spark.sql.SparkSession
25+
import org.apache.spark.sql.internal.SQLConf
2326

2427
import org.apache.comet.CometConf
2528

@@ -52,6 +55,34 @@ case class RegExpExtractPattern(name: String, pattern: String, idx: Int)
5255
*/
5356
object CometRegExpExtractBenchmark extends CometBenchmarkBase {
5457

58+
// CometBenchmarkBase wires `CometSparkSessionExtensions` via `withExtensions`, but that call
59+
// is silently dropped when `SparkSession.builder.getOrCreate()` returns an existing session
60+
// (the `SqlBasedBenchmark.spark` field can construct one before the override runs). Setting
61+
// `spark.sql.extensions` on the SparkConf forces extension registration regardless. The
62+
// off-heap and shuffle-manager configs match what CometTestBase sets so Comet's planning
63+
// rules don't bail out early.
64+
override def getSparkSession: SparkSession = {
65+
val conf = new SparkConf()
66+
.setAppName("CometRegExpExtractBenchmark")
67+
.set("spark.master", "local[1]")
68+
.setIfMissing("spark.driver.memory", "3g")
69+
.setIfMissing("spark.executor.memory", "3g")
70+
.set("spark.sql.extensions", "org.apache.comet.CometSparkSessionExtensions")
71+
.set("spark.memory.offHeap.enabled", "true")
72+
.set("spark.memory.offHeap.size", "2g")
73+
.set(
74+
"spark.shuffle.manager",
75+
"org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager")
76+
77+
val sparkSession = SparkSession.builder.config(conf).getOrCreate()
78+
sparkSession.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
79+
sparkSession.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
80+
sparkSession.conf.set(CometConf.COMET_ENABLED.key, "false")
81+
sparkSession.conf.set(CometConf.COMET_EXEC_ENABLED.key, "false")
82+
sparkSession.conf.set(SQLConf.ANSI_ENABLED.key, "false")
83+
sparkSession
84+
}
85+
5586
// Patterns chosen to span common shapes that both engines accept. Avoid Java-only constructs
5687
// (backreferences, lookaround, possessive quantifiers, embedded flags) so the native (Rust)
5788
// path is actually exercised rather than falling through to the codegen dispatcher.

0 commit comments

Comments
 (0)