Skip to content

[FEA] Use catalyst stats as rowsHint in RegexComplexityEstimator to avoid CPU fallback on small inputs #14887

@wjxiz1992

Description

@wjxiz1992

Describe the issue

RegexComplexityEstimator.isValid always computes numRows as gpuTargetBatchSizeBytes / StringType.defaultSize (≈ 53,687,091 with the default 1 GiB batch and Spark's 20-byte string default). This is intended as a per-batch upper bound, but it is applied even when the actual input is far smaller than one batch — small lookup tables, small joins, post-aggregation outputs — which are exactly the cases where regex over-rejection is the most painful (no real memory risk, but the planner still falls back to CPU).

After #14849 fixed the long-standing Int overflow in numStates * numRows * 2, some patterns that previously slipped through (because the overflow wrapped to a negative number that compared <= 2 GiB) are now correctly rejected. The over-rejection becomes user-visible.

Reproduction

70k-row table written to ORC, then queried with a 7-alternation pattern. The pattern has numStates = 64, and at the per-batch ceiling the estimator reports 64 × 53,687,091 × 2 ≈ 6.4 GiB, above the default 2 GiB spark.rapids.sql.regexp.maxStateMemoryBytes. The query falls back to CPU even though the table has only 70,000 rows and the actual NFA scratch needed is in the tens of KB.

Pattern: (quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)

import datetime, os
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([
    StructField("rlike_multiple_match", StringType(), True),
])
data = [(f"row-{i} The quick brown fox jumps over the lazy dog.",) for i in range(70000)]
spark = SparkSession.builder.appName("regex-fallback-repro").getOrCreate()
spark.conf.set("spark.rapids.sql.enabled", "true")
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.createDataFrame(data, schema).write.orc("/tmp/df_rlike_orc", mode="overwrite")
df = spark.read.orc("/tmp/df_rlike_orc").withColumn("uniqueID", F.monotonically_increasing_id())
df.createOrReplaceTempView("df_rlike_orc_table")
spark.sql("""
  select rlike(rlike_multiple_match,
    "(quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)") as literal,
    uniqueID from df_rlike_orc_table order by literal, uniqueID
""").explain()

Observed: RLike runs on CPU with the message "estimated memory needed for regular expression exceeds the maximum. Set spark.rapids.sql.regexp.maxStateMemoryBytes to change it."

Root cause

RegexComplexityEstimator.estimateGpuMemory derives numRows purely from gpuTargetBatchSizeBytes / StringType.defaultSize. This is the worst-case rows-per-batch and never considers the actual input size. Catalyst already tracks Statistics.sizeInBytes (always populated) and Statistics.rowCount (when CBO or leaf relation provides it) on the LogicalPlan that the enclosing SparkPlan links to via logicalLink. We can read this through the meta hierarchy at planning time and pass it as a rowsHint to the estimator.

Proposed fix

Add an optional rowsHint: Option[Long] parameter to RegexComplexityEstimator.isValid, used as min(perBatchRows, rowsHint) — strictly tightens, never inflates, the estimate. Compute the hint in GpuRegExpUtils.validateRegExpComplexity by walking ExprMeta.parent up to the enclosing SparkPlanMeta, then reading wrapped.logicalLink.stats:

  • rowCount if present (exact)
  • sizeInBytes / StringType.defaultSize as upper bound otherwise

For the reproduction above, the catalyst-projected sizeInBytes is 5.5 KiB (Project output is boolean + bigint = 9 bytes per row), so bySize ≈ 281, and 64 × 281 × 2 ≈ 35 KB — comfortably below the 2 GiB budget. RLike (further rewritten to GpuContainsAny) runs on GPU.

Workaround in current releases

Either raise spark.rapids.sql.regexp.maxStateMemoryBytes (e.g. 8 GiB) or lower spark.rapids.sql.batchSizeBytes (e.g. 256 MiB) so the per-batch row ceiling shrinks proportionally. Both are safe for small inputs; neither is a great default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classifyperformanceA performance related task/issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions