[FEA] Use catalyst stats as rowsHint in RegexComplexityEstimator to avoid CPU fallback on small inputs

## Describe the issue

`RegexComplexityEstimator.isValid` always computes `numRows` as `gpuTargetBatchSizeBytes / StringType.defaultSize` (≈ **53,687,091** with the default 1 GiB batch and Spark's 20-byte string default). This is intended as a per-batch upper bound, but it is applied even when the actual input is far smaller than one batch — small lookup tables, small joins, post-aggregation outputs — which are exactly the cases where regex over-rejection is the most painful (no real memory risk, but the planner still falls back to CPU).

After https://github.com/NVIDIA/spark-rapids/pull/14849 fixed the long-standing `Int` overflow in `numStates * numRows * 2`, some patterns that previously slipped through (because the overflow wrapped to a negative number that compared `<= 2 GiB`) are now correctly rejected. The over-rejection becomes user-visible.

## Reproduction

70k-row table written to ORC, then queried with a 7-alternation pattern. The pattern has `numStates = 64`, and at the per-batch ceiling the estimator reports `64 × 53,687,091 × 2 ≈ 6.4 GiB`, above the default 2 GiB `spark.rapids.sql.regexp.maxStateMemoryBytes`. The query falls back to CPU even though the table has only 70,000 rows and the actual NFA scratch needed is in the tens of KB.

Pattern: `(quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)`

```python
import datetime, os
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as F

schema = StructType([
    StructField("rlike_multiple_match", StringType(), True),
])
data = [(f"row-{i} The quick brown fox jumps over the lazy dog.",) for i in range(70000)]
spark = SparkSession.builder.appName("regex-fallback-repro").getOrCreate()
spark.conf.set("spark.rapids.sql.enabled", "true")
spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.createDataFrame(data, schema).write.orc("/tmp/df_rlike_orc", mode="overwrite")
df = spark.read.orc("/tmp/df_rlike_orc").withColumn("uniqueID", F.monotonically_increasing_id())
df.createOrReplaceTempView("df_rlike_orc_table")
spark.sql("""
  select rlike(rlike_multiple_match,
    "(quick brown|brown fox|fox jumps|jumps over|over the|the lazy|lazy dog)") as literal,
    uniqueID from df_rlike_orc_table order by literal, uniqueID
""").explain()
```

Observed: `RLike` runs on CPU with the message *"estimated memory needed for regular expression exceeds the maximum. Set spark.rapids.sql.regexp.maxStateMemoryBytes to change it."*

## Root cause

`RegexComplexityEstimator.estimateGpuMemory` derives `numRows` purely from `gpuTargetBatchSizeBytes / StringType.defaultSize`. This is the worst-case rows-per-batch and never considers the actual input size. Catalyst already tracks `Statistics.sizeInBytes` (always populated) and `Statistics.rowCount` (when CBO or leaf relation provides it) on the `LogicalPlan` that the enclosing `SparkPlan` links to via `logicalLink`. We can read this through the meta hierarchy at planning time and pass it as a `rowsHint` to the estimator.

## Proposed fix

Add an optional `rowsHint: Option[Long]` parameter to `RegexComplexityEstimator.isValid`, used as `min(perBatchRows, rowsHint)` — strictly tightens, never inflates, the estimate. Compute the hint in `GpuRegExpUtils.validateRegExpComplexity` by walking `ExprMeta.parent` up to the enclosing `SparkPlanMeta`, then reading `wrapped.logicalLink.stats`:

* `rowCount` if present (exact)
* `sizeInBytes / StringType.defaultSize` as upper bound otherwise

For the reproduction above, the catalyst-projected `sizeInBytes` is **5.5 KiB** (Project output is `boolean + bigint` = 9 bytes per row), so `bySize ≈ 281`, and `64 × 281 × 2 ≈ 35 KB` — comfortably below the 2 GiB budget. `RLike` (further rewritten to `GpuContainsAny`) runs on GPU.

## Workaround in current releases

Either raise `spark.rapids.sql.regexp.maxStateMemoryBytes` (e.g. 8 GiB) or lower `spark.rapids.sql.batchSizeBytes` (e.g. 256 MiB) so the per-batch row ceiling shrinks proportionally. Both are safe for small inputs; neither is a great default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use catalyst stats as rowsHint in RegexComplexityEstimator to avoid CPU fallback on small inputs #14887

Describe the issue

Reproduction

Root cause

Proposed fix

Workaround in current releases

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEA] Use catalyst stats as rowsHint in RegexComplexityEstimator to avoid CPU fallback on small inputs #14887

Description

Describe the issue

Reproduction

Root cause

Proposed fix

Workaround in current releases

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions