Skip to content

[BUG] PPL eval with non-deterministic functions (e.g. RAND) re-evaluates on each reference #4507

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

# Failing query - demonstrates the bug
source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue

# Original user query
source=travel
| regex `Traveler nationality`='Korea'
| where YEAR(`Start date`)=2023
| eventstats COUNT() as TravelerCount
| eval randomInt=CEIL(RAND() * TravelerCount), EnddateDays=TO_DAYS(`End date`), Modulo=EnddateDays % TravelerCount
| where Modulo=randomInt

Expected Result:
The WHERE value=randomValue clause should filter rows where the value field equals the randomValue computed by the eval command. Once eval computes randomValue for a row, that value should remain stable for that row across all subsequent operations. Only rows where value truly equals randomValue should be returned.

Actual Result:
The query returns rows where value does NOT equal randomValue. Example output:

id=4:  value=4, randomValue=2  (4 ≠ 2, should be filtered out)
id=13: value=3, randomValue=4  (3 ≠ 4, should be filtered out)
id=14: value=4, randomValue=2  (4 ≠ 2, should be filtered out)

The RAND() function is being re-evaluated during the WHERE clause comparison, producing a different value than what's displayed in the output fields. This violates the fundamental expectation that a column value assigned to a record should be stable.

Dataset Information

Dataset/Schema Type

  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "id": { "type": "integer" },
      "value": { "type": "integer" }
    }
  }
}

Sample Data

{"id": 1, "value": 1}
{"id": 2, "value": 2}
{"id": 3, "value": 3}
{"id": 4, "value": 4}
{"id": 5, "value": 0}

Bug Description

Issue Summary:
Non-deterministic functions (like RAND()) in eval expressions are re-evaluated each time the variable is referenced in subsequent PPL commands, rather than being computed once per row and frozen. This causes WHERE clauses to compare against different values than what appears in the output, breaking the fundamental contract that a column value should be stable once assigned to a record.

Steps to Reproduce:

  1. Create a test index: PUT /test_rand with integer fields id and value
  2. Insert test data with values 0-4
  3. Run: source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue
  4. Observe that returned rows have value ≠ randomValue
  5. Get an explain plan: POST _plugins/_ppl/_explain with the same query
  6. Note that RAND() appears in the physical plan's EnumerableCalc alongside the filter condition

Physical Plan Analysis:

Buggy behaviour (without sort):

EnumerableCalc(expr#3=[RAND()], expr#4=[*($t3, $t1)], expr#5=[CEIL($t4)], expr#6=[=($t2, $t5)], ...)
  CalciteEnumerableIndexScan(...)

The RAND() (expr#3) is evaluated within the same EnumerableCalc that performs the filter comparison (expr#6), causing it to be called multiple times per row - once for the filter evaluation and once for the output projection.

Correct behaviour (with sort before where):

EnumerableCalc(expr#3=[=($t0, $t2)], ..., $condition=[$t3])
  EnumerableSort(sort0=[$2])
    EnumerableCalc(expr#2=[RAND()], expr#3=[5], expr#4=[*($t2, $t3)], expr#5=[CEIL($t4)], $f2=[$t5])
      CalciteEnumerableIndexScan(...)

The RAND() is evaluated in a separate EnumerableCalc, then the EnumerableSort forces materialization, and the subsequent filter uses the already-computed value ($t2).

Impact:

  • Cannot reliably use non-deterministic functions in eval expressions that are later referenced in WHERE clauses
  • Affects use cases like random sampling, random assignment, or any computation involving RAND()
  • The behaviour is unintuitive and violates the fundamental expectation that eval creates a computed field with a fixed value per row
  • Affects any non-deterministic function, not just RAND()

Environment Information

OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT

Additional Details:

  • Affects PPL queries using the Calcite engine
  • The eval command documentation does not mention this limitation
  • RAND is registered via SqlStdOperatorTable.RAND in PPLFuncImpTable.java (line 691)

Tentative Root Cause Analysis

This is a preliminary analysis and requires further investigation by the Calcite integration team.

Core Issue

The root cause is in how Apache Calcite's query optimizer handles projections containing non-deterministic functions. When translating PPL to Calcite logical plans and optimizing them to physical plans, Calcite merges the LogicalProject (from eval) with LogicalFilter (from where) into a single EnumerableCalc operation. This optimization is correct for deterministic expressions but incorrect for non-deterministic ones.

Code References

  1. RAND registration: /core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java (line 691)

    • registerOperator(RAND, SqlStdOperatorTable.RAND);
    • Uses Calcite's standard RAND operator
  2. Eval translation: /core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java (lines 813-842)

    • visitEval method uses projectPlusOverriding to add computed fields
    • Creates a LogicalProject node via context.relBuilder.projectPlus(newFields)
  3. RAND implementation: /core/src/main/java/org/opensearch/sql/expression/operator/arthmetic/MathematicalFunctions.java (lines 486-498)

    • impl(() -> new ExprFloatValue(new Random().nextFloat()), FLOAT)
    • Creates a new Random instance on each evaluation

Why This Happens

  1. Projection Merging: Calcite's optimizer applies rules like ProjectFilterTransposeRule or FilterCalcMergeRule that merge adjacent projections and filters for efficiency
  2. Expression Inlining: When merging, Calcite inlines the expression definition rather than treating it as a materialized value
  3. No Materialization Barrier: Without an operation that forces materialization (like sort, aggregation, or join), the optimizer freely merges operations
  4. Non-determinism Not Respected: Even though SqlStdOperatorTable.RAND should be marked as non-deterministic, the optimizer's merging rules don't check for this property before inlining expressions

Why Sort Works as Workaround

Sort operations require actual materialized values to perform ordering. This creates a barrier that prevents the projection from being merged with subsequent filters, ensuring RAND() is evaluated once and the result is reused.


Tentative Proposed Fix

This is a preliminary analysis and requires further investigation.

Possible Approach

The fix requires preventing Calcite from merging projections containing non-deterministic functions with subsequent operations. Several approaches are possible:

Option 1: Custom Calcite Rule
Create a custom Calcite optimization rule that detects non-deterministic functions in projections and prevents merging:

  • Check if a LogicalProject contains calls to non-deterministic functions
  • Mark such projections as non-mergeable or add a barrier
  • Insert this rule early in the optimization phase

Option 2: Force Materialization After Eval
Modify visitEval in CalciteRelNodeVisitor.java to insert a materialization hint or dummy operation after projections containing non-deterministic functions:

  • Detect non-deterministic function calls in eval expressions
  • Add a Calcite hint or create a barrier operation
  • This ensures values are computed once before subsequent operations

Implementation Considerations

  • Must verify that SqlStdOperatorTable.RAND is properly marked as non-deterministic in Calcite
  • Need to handle all non-deterministic functions, not just RAND
  • Must ensure the fix doesn't negatively impact performance for deterministic expressions
  • Requires thorough testing with various query patterns

Workaround

Temporary workaround: Insert a sort command on the eval-computed field before using it in a where clause:

source=test_rand 
| eval randomValue=CEIL(RAND() * 5) 
| sort randomValue 
| where value=randomValue 
| fields id, value, randomValue

This forces materialization of the random value before the filter is applied, ensuring consistent behaviour.

Alternative workaround (for random sampling): Use top command:

source=travel
| where match(`Traveler nationality`,'Korea')
| where YEAR(`Start date`)=2023
| eval randomInt=RAND()
| fields `Traveler name`, randomInt
| top 1 randomInt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcalcitecalcite migration releatedpushdownpushdown related issues

    Type

    No type

    Projects

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions