Describe the issue
RegexComplexityEstimator.isValid always computes numRows as gpuTargetBatchSizeBytes / StringType.defaultSize (≈ 53,687,091 with the default 1 GiB batch and Spark's 20-byte string default). This is intended as a per-batch upper bound, but it is applied even when the actual input is far smaller than one batch — small lookup tables, small joins, post-aggregation outputs — which are exactly the cases where regex over-rejection is the most painful (no real memory risk, but the planner still falls back to CPU).
After #14849 fixed the long-standing Int overflow in numStates * numRows * 2, some patterns that previously slipped through (because the overflow wrapped to a negative number that compared <= 2 GiB) are now correctly rejected. The over-rejection becomes user-visible.
Reproduction
70k-row table written to ORC, then queried with a 7-alternation pattern. The pattern has numStates = 64, and at the per-batch ceiling the estimator reports 64 × 53,687,091 × 2 ≈ 6.4 GiB, above the default 2 GiB spark.rapids.sql.regexp.maxStateMemoryBytes. The query falls back to CPU even though the table has only 70,000 rows and the actual NFA scratch needed is in the tens of KB.
Pattern: (quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)
import datetime, os
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F
schema = StructType([
StructField("rlike_multiple_match", StringType(), True),
])
data = [(f"row-{i} The quick brown fox jumps over the lazy dog.",) for i in range(70000)]
spark = SparkSession.builder.appName("regex-fallback-repro").getOrCreate()
spark.conf.set("spark.rapids.sql.enabled", "true")
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.createDataFrame(data, schema).write.orc("/tmp/df_rlike_orc", mode="overwrite")
df = spark.read.orc("/tmp/df_rlike_orc").withColumn("uniqueID", F.monotonically_increasing_id())
df.createOrReplaceTempView("df_rlike_orc_table")
spark.sql("""
select rlike(rlike_multiple_match,
"(quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)") as literal,
uniqueID from df_rlike_orc_table order by literal, uniqueID
""").explain()
Observed: RLike runs on CPU with the message "estimated memory needed for regular expression exceeds the maximum. Set spark.rapids.sql.regexp.maxStateMemoryBytes to change it."
Root cause
RegexComplexityEstimator.estimateGpuMemory derives numRows purely from gpuTargetBatchSizeBytes / StringType.defaultSize. This is the worst-case rows-per-batch and never considers the actual input size. Catalyst already tracks Statistics.sizeInBytes (always populated) and Statistics.rowCount (when CBO or leaf relation provides it) on the LogicalPlan that the enclosing SparkPlan links to via logicalLink. We can read this through the meta hierarchy at planning time and pass it as a rowsHint to the estimator.
Proposed fix
Add an optional rowsHint: Option[Long] parameter to RegexComplexityEstimator.isValid, used as min(perBatchRows, rowsHint) — strictly tightens, never inflates, the estimate. Compute the hint in GpuRegExpUtils.validateRegExpComplexity by walking ExprMeta.parent up to the enclosing SparkPlanMeta, then reading wrapped.logicalLink.stats:
rowCount if present (exact)
sizeInBytes / StringType.defaultSize as upper bound otherwise
For the reproduction above, the catalyst-projected sizeInBytes is 5.5 KiB (Project output is boolean + bigint = 9 bytes per row), so bySize ≈ 281, and 64 × 281 × 2 ≈ 35 KB — comfortably below the 2 GiB budget. RLike (further rewritten to GpuContainsAny) runs on GPU.
Workaround in current releases
Either raise spark.rapids.sql.regexp.maxStateMemoryBytes (e.g. 8 GiB) or lower spark.rapids.sql.batchSizeBytes (e.g. 256 MiB) so the per-batch row ceiling shrinks proportionally. Both are safe for small inputs; neither is a great default.
Describe the issue
RegexComplexityEstimator.isValidalways computesnumRowsasgpuTargetBatchSizeBytes / StringType.defaultSize(≈ 53,687,091 with the default 1 GiB batch and Spark's 20-byte string default). This is intended as a per-batch upper bound, but it is applied even when the actual input is far smaller than one batch — small lookup tables, small joins, post-aggregation outputs — which are exactly the cases where regex over-rejection is the most painful (no real memory risk, but the planner still falls back to CPU).After #14849 fixed the long-standing
Intoverflow innumStates * numRows * 2, some patterns that previously slipped through (because the overflow wrapped to a negative number that compared<= 2 GiB) are now correctly rejected. The over-rejection becomes user-visible.Reproduction
70k-row table written to ORC, then queried with a 7-alternation pattern. The pattern has
numStates = 64, and at the per-batch ceiling the estimator reports64 × 53,687,091 × 2 ≈ 6.4 GiB, above the default 2 GiBspark.rapids.sql.regexp.maxStateMemoryBytes. The query falls back to CPU even though the table has only 70,000 rows and the actual NFA scratch needed is in the tens of KB.Pattern:
(quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)Observed:
RLikeruns on CPU with the message "estimated memory needed for regular expression exceeds the maximum. Set spark.rapids.sql.regexp.maxStateMemoryBytes to change it."Root cause
RegexComplexityEstimator.estimateGpuMemoryderivesnumRowspurely fromgpuTargetBatchSizeBytes / StringType.defaultSize. This is the worst-case rows-per-batch and never considers the actual input size. Catalyst already tracksStatistics.sizeInBytes(always populated) andStatistics.rowCount(when CBO or leaf relation provides it) on theLogicalPlanthat the enclosingSparkPlanlinks to vialogicalLink. We can read this through the meta hierarchy at planning time and pass it as arowsHintto the estimator.Proposed fix
Add an optional
rowsHint: Option[Long]parameter toRegexComplexityEstimator.isValid, used asmin(perBatchRows, rowsHint)— strictly tightens, never inflates, the estimate. Compute the hint inGpuRegExpUtils.validateRegExpComplexityby walkingExprMeta.parentup to the enclosingSparkPlanMeta, then readingwrapped.logicalLink.stats:rowCountif present (exact)sizeInBytes / StringType.defaultSizeas upper bound otherwiseFor the reproduction above, the catalyst-projected
sizeInBytesis 5.5 KiB (Project output isboolean + bigint= 9 bytes per row), sobySize ≈ 281, and64 × 281 × 2 ≈ 35 KB— comfortably below the 2 GiB budget.RLike(further rewritten toGpuContainsAny) runs on GPU.Workaround in current releases
Either raise
spark.rapids.sql.regexp.maxStateMemoryBytes(e.g. 8 GiB) or lowerspark.rapids.sql.batchSizeBytes(e.g. 256 MiB) so the per-batch row ceiling shrinks proportionally. Both are safe for small inputs; neither is a great default.