-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Query Information
PPL Command/Query:
# Failing query - demonstrates the bug
source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue
# Original user query
source=travel
| regex `Traveler nationality`='Korea'
| where YEAR(`Start date`)=2023
| eventstats COUNT() as TravelerCount
| eval randomInt=CEIL(RAND() * TravelerCount), EnddateDays=TO_DAYS(`End date`), Modulo=EnddateDays % TravelerCount
| where Modulo=randomInt
Expected Result:
The WHERE value=randomValue
clause should filter rows where the value
field equals the randomValue
computed by the eval
command. Once eval
computes randomValue
for a row, that value should remain stable for that row across all subsequent operations. Only rows where value
truly equals randomValue
should be returned.
Actual Result:
The query returns rows where value
does NOT equal randomValue
. Example output:
id=4: value=4, randomValue=2 (4 ≠ 2, should be filtered out)
id=13: value=3, randomValue=4 (3 ≠ 4, should be filtered out)
id=14: value=4, randomValue=2 (4 ≠ 2, should be filtered out)
The RAND()
function is being re-evaluated during the WHERE clause comparison, producing a different value than what's displayed in the output fields. This violates the fundamental expectation that a column value assigned to a record should be stable.
Dataset Information
Dataset/Schema Type
- Custom (details below)
Index Mapping
{
"mappings": {
"properties": {
"id": { "type": "integer" },
"value": { "type": "integer" }
}
}
}
Sample Data
{"id": 1, "value": 1}
{"id": 2, "value": 2}
{"id": 3, "value": 3}
{"id": 4, "value": 4}
{"id": 5, "value": 0}
Bug Description
Issue Summary:
Non-deterministic functions (like RAND()
) in eval
expressions are re-evaluated each time the variable is referenced in subsequent PPL commands, rather than being computed once per row and frozen. This causes WHERE clauses to compare against different values than what appears in the output, breaking the fundamental contract that a column value should be stable once assigned to a record.
Steps to Reproduce:
- Create a test index:
PUT /test_rand
with integer fieldsid
andvalue
- Insert test data with values 0-4
- Run:
source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue
- Observe that returned rows have
value ≠ randomValue
- Get an explain plan:
POST _plugins/_ppl/_explain
with the same query - Note that
RAND()
appears in the physical plan'sEnumerableCalc
alongside the filter condition
Physical Plan Analysis:
Buggy behaviour (without sort):
EnumerableCalc(expr#3=[RAND()], expr#4=[*($t3, $t1)], expr#5=[CEIL($t4)], expr#6=[=($t2, $t5)], ...)
CalciteEnumerableIndexScan(...)
The RAND()
(expr#3) is evaluated within the same EnumerableCalc
that performs the filter comparison (expr#6), causing it to be called multiple times per row - once for the filter evaluation and once for the output projection.
Correct behaviour (with sort before where):
EnumerableCalc(expr#3=[=($t0, $t2)], ..., $condition=[$t3])
EnumerableSort(sort0=[$2])
EnumerableCalc(expr#2=[RAND()], expr#3=[5], expr#4=[*($t2, $t3)], expr#5=[CEIL($t4)], $f2=[$t5])
CalciteEnumerableIndexScan(...)
The RAND()
is evaluated in a separate EnumerableCalc
, then the EnumerableSort
forces materialization, and the subsequent filter uses the already-computed value ($t2).
Impact:
- Cannot reliably use non-deterministic functions in eval expressions that are later referenced in WHERE clauses
- Affects use cases like random sampling, random assignment, or any computation involving
RAND()
- The behaviour is unintuitive and violates the fundamental expectation that eval creates a computed field with a fixed value per row
- Affects any non-deterministic function, not just
RAND()
Environment Information
OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT
Additional Details:
- Affects PPL queries using the Calcite engine
- The
eval
command documentation does not mention this limitation - RAND is registered via
SqlStdOperatorTable.RAND
in PPLFuncImpTable.java (line 691)
Tentative Root Cause Analysis
This is a preliminary analysis and requires further investigation by the Calcite integration team.
Core Issue
The root cause is in how Apache Calcite's query optimizer handles projections containing non-deterministic functions. When translating PPL to Calcite logical plans and optimizing them to physical plans, Calcite merges the LogicalProject (from eval
) with LogicalFilter (from where
) into a single EnumerableCalc
operation. This optimization is correct for deterministic expressions but incorrect for non-deterministic ones.
Code References
-
RAND registration:
/core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java
(line 691)registerOperator(RAND, SqlStdOperatorTable.RAND);
- Uses Calcite's standard RAND operator
-
Eval translation:
/core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
(lines 813-842)visitEval
method usesprojectPlusOverriding
to add computed fields- Creates a LogicalProject node via
context.relBuilder.projectPlus(newFields)
-
RAND implementation:
/core/src/main/java/org/opensearch/sql/expression/operator/arthmetic/MathematicalFunctions.java
(lines 486-498)impl(() -> new ExprFloatValue(new Random().nextFloat()), FLOAT)
- Creates a new Random instance on each evaluation
Why This Happens
- Projection Merging: Calcite's optimizer applies rules like
ProjectFilterTransposeRule
orFilterCalcMergeRule
that merge adjacent projections and filters for efficiency - Expression Inlining: When merging, Calcite inlines the expression definition rather than treating it as a materialized value
- No Materialization Barrier: Without an operation that forces materialization (like sort, aggregation, or join), the optimizer freely merges operations
- Non-determinism Not Respected: Even though
SqlStdOperatorTable.RAND
should be marked as non-deterministic, the optimizer's merging rules don't check for this property before inlining expressions
Why Sort Works as Workaround
Sort operations require actual materialized values to perform ordering. This creates a barrier that prevents the projection from being merged with subsequent filters, ensuring RAND() is evaluated once and the result is reused.
Tentative Proposed Fix
This is a preliminary analysis and requires further investigation.
Possible Approach
The fix requires preventing Calcite from merging projections containing non-deterministic functions with subsequent operations. Several approaches are possible:
Option 1: Custom Calcite Rule
Create a custom Calcite optimization rule that detects non-deterministic functions in projections and prevents merging:
- Check if a LogicalProject contains calls to non-deterministic functions
- Mark such projections as non-mergeable or add a barrier
- Insert this rule early in the optimization phase
Option 2: Force Materialization After Eval
Modify visitEval
in CalciteRelNodeVisitor.java
to insert a materialization hint or dummy operation after projections containing non-deterministic functions:
- Detect non-deterministic function calls in eval expressions
- Add a Calcite hint or create a barrier operation
- This ensures values are computed once before subsequent operations
Implementation Considerations
- Must verify that
SqlStdOperatorTable.RAND
is properly marked as non-deterministic in Calcite - Need to handle all non-deterministic functions, not just RAND
- Must ensure the fix doesn't negatively impact performance for deterministic expressions
- Requires thorough testing with various query patterns
Workaround
Temporary workaround: Insert a sort
command on the eval-computed field before using it in a where
clause:
source=test_rand
| eval randomValue=CEIL(RAND() * 5)
| sort randomValue
| where value=randomValue
| fields id, value, randomValue
This forces materialization of the random value before the filter is applied, ensuring consistent behaviour.
Alternative workaround (for random sampling): Use top
command:
source=travel
| where match(`Traveler nationality`,'Korea')
| where YEAR(`Start date`)=2023
| eval randomInt=RAND()
| fields `Traveler name`, randomInt
| top 1 randomInt
Metadata
Metadata
Assignees
Labels
Type
Projects
Status