[BUG] PPL eval with non-deterministic functions (e.g. `RAND`) re-evaluates on each reference

## Query Information
**PPL Command/Query:**
```ppl
# Failing query - demonstrates the bug
source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue

# Original user query
source=travel
| regex `Traveler nationality`='Korea'
| where YEAR(`Start date`)=2023
| eventstats COUNT() as TravelerCount
| eval randomInt=CEIL(RAND() * TravelerCount), EnddateDays=TO_DAYS(`End date`), Modulo=EnddateDays % TravelerCount
| where Modulo=randomInt
```

**Expected Result:**
The `WHERE value=randomValue` clause should filter rows where the `value` field equals the `randomValue` computed by the `eval` command. Once `eval` computes `randomValue` for a row, that value should remain stable for that row across all subsequent operations. Only rows where `value` truly equals `randomValue` should be returned.

**Actual Result:**
The query returns rows where `value` does NOT equal `randomValue`. Example output:
```
id=4:  value=4, randomValue=2  (4 ≠ 2, should be filtered out)
id=13: value=3, randomValue=4  (3 ≠ 4, should be filtered out)
id=14: value=4, randomValue=2  (4 ≠ 2, should be filtered out)
```

The `RAND()` function is being re-evaluated during the WHERE clause comparison, producing a different value than what's displayed in the output fields. This violates the fundamental expectation that a column value assigned to a record should be stable.

## Dataset Information
**Dataset/Schema Type**
- [x] Custom (details below)

**Index Mapping**
```json
{
  "mappings": {
    "properties": {
      "id": { "type": "integer" },
      "value": { "type": "integer" }
    }
  }
}
```

**Sample Data**
```json
{"id": 1, "value": 1}
{"id": 2, "value": 2}
{"id": 3, "value": 3}
{"id": 4, "value": 4}
{"id": 5, "value": 0}
```

## Bug Description
**Issue Summary:**
Non-deterministic functions (like `RAND()`) in `eval` expressions are re-evaluated each time the variable is referenced in subsequent PPL commands, rather than being computed once per row and frozen. This causes WHERE clauses to compare against different values than what appears in the output, breaking the fundamental contract that a column value should be stable once assigned to a record.

**Steps to Reproduce:**
1. Create a test index: `PUT /test_rand` with integer fields `id` and `value`
2. Insert test data with values 0-4
3. Run: `source=test_rand | eval randomValue=CEIL(RAND() * 5) | where value=randomValue | fields id, value, randomValue`
4. Observe that returned rows have `value ≠ randomValue`
5. Get an explain plan: `POST _plugins/_ppl/_explain` with the same query
6. Note that `RAND()` appears in the physical plan's `EnumerableCalc` alongside the filter condition

**Physical Plan Analysis:**

*Buggy behaviour (without sort):*
```
EnumerableCalc(expr#3=[RAND()], expr#4=[*($t3, $t1)], expr#5=[CEIL($t4)], expr#6=[=($t2, $t5)], ...)
  CalciteEnumerableIndexScan(...)
```
The `RAND()` (expr#3) is evaluated within the same `EnumerableCalc` that performs the filter comparison (expr#6), causing it to be called multiple times per row - once for the filter evaluation and once for the output projection.

*Correct behaviour (with sort before where):*
```
EnumerableCalc(expr#3=[=($t0, $t2)], ..., $condition=[$t3])
  EnumerableSort(sort0=[$2])
    EnumerableCalc(expr#2=[RAND()], expr#3=[5], expr#4=[*($t2, $t3)], expr#5=[CEIL($t4)], $f2=[$t5])
      CalciteEnumerableIndexScan(...)
```
The `RAND()` is evaluated in a separate `EnumerableCalc`, then the `EnumerableSort` forces materialization, and the subsequent filter uses the already-computed value ($t2).

**Impact:**
- Cannot reliably use non-deterministic functions in eval expressions that are later referenced in WHERE clauses
- Affects use cases like random sampling, random assignment, or any computation involving `RAND()`
- The behaviour is unintuitive and violates the fundamental expectation that eval creates a computed field with a fixed value per row
- Affects any non-deterministic function, not just `RAND()`

## Environment Information
**OpenSearch Version:**
OpenSearch 3.3.0-SNAPSHOT

**Additional Details:**
- Affects PPL queries using the Calcite engine
- The `eval` command documentation does not mention this limitation
- RAND is registered via `SqlStdOperatorTable.RAND` in PPLFuncImpTable.java (line 691)

---

## Tentative Root Cause Analysis

**This is a preliminary analysis and requires further investigation by the Calcite integration team.**

### Core Issue
The root cause is in how Apache Calcite's query optimizer handles projections containing non-deterministic functions. When translating PPL to Calcite logical plans and optimizing them to physical plans, Calcite merges the LogicalProject (from `eval`) with LogicalFilter (from `where`) into a single `EnumerableCalc` operation. This optimization is correct for deterministic expressions but incorrect for non-deterministic ones.

### Code References
1. **RAND registration**: `/core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java` (line 691)
   - `registerOperator(RAND, SqlStdOperatorTable.RAND);`
   - Uses Calcite's standard RAND operator

2. **Eval translation**: `/core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java` (lines 813-842)
   - `visitEval` method uses `projectPlusOverriding` to add computed fields
   - Creates a LogicalProject node via `context.relBuilder.projectPlus(newFields)`

3. **RAND implementation**: `/core/src/main/java/org/opensearch/sql/expression/operator/arthmetic/MathematicalFunctions.java` (lines 486-498)
   - `impl(() -> new ExprFloatValue(new Random().nextFloat()), FLOAT)`
   - Creates a new Random instance on each evaluation

### Why This Happens
1. **Projection Merging**: Calcite's optimizer applies rules like `ProjectFilterTransposeRule` or `FilterCalcMergeRule` that merge adjacent projections and filters for efficiency
2. **Expression Inlining**: When merging, Calcite inlines the expression definition rather than treating it as a materialized value
3. **No Materialization Barrier**: Without an operation that forces materialization (like sort, aggregation, or join), the optimizer freely merges operations
4. **Non-determinism Not Respected**: Even though `SqlStdOperatorTable.RAND` should be marked as non-deterministic, the optimizer's merging rules don't check for this property before inlining expressions

### Why Sort Works as Workaround
Sort operations require actual materialized values to perform ordering. This creates a barrier that prevents the projection from being merged with subsequent filters, ensuring RAND() is evaluated once and the result is reused.

---

## Tentative Proposed Fix

**This is a preliminary analysis and requires further investigation.**

### Possible Approach
The fix requires preventing Calcite from merging projections containing non-deterministic functions with subsequent operations. Several approaches are possible:

**Option 1: Custom Calcite Rule**
Create a custom Calcite optimization rule that detects non-deterministic functions in projections and prevents merging:
- Check if a LogicalProject contains calls to non-deterministic functions
- Mark such projections as non-mergeable or add a barrier
- Insert this rule early in the optimization phase

**Option 2: Force Materialization After Eval**
Modify `visitEval` in `CalciteRelNodeVisitor.java` to insert a materialization hint or dummy operation after projections containing non-deterministic functions:
- Detect non-deterministic function calls in eval expressions
- Add a Calcite hint or create a barrier operation
- This ensures values are computed once before subsequent operations

### Implementation Considerations
- Must verify that `SqlStdOperatorTable.RAND` is properly marked as non-deterministic in Calcite
- Need to handle all non-deterministic functions, not just RAND
- Must ensure the fix doesn't negatively impact performance for deterministic expressions
- Requires thorough testing with various query patterns

---

## Workaround

**Temporary workaround**: Insert a `sort` command on the eval-computed field before using it in a `where` clause:

```ppl
source=test_rand 
| eval randomValue=CEIL(RAND() * 5) 
| sort randomValue 
| where value=randomValue 
| fields id, value, randomValue
```

This forces materialization of the random value before the filter is applied, ensuring consistent behaviour.

**Alternative workaround** (for random sampling): Use `top` command:
```ppl
source=travel
| where match(`Traveler nationality`,'Korea')
| where YEAR(`Start date`)=2023
| eval randomInt=RAND()
| fields `Traveler name`, randomInt
| top 1 randomInt
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] PPL eval with non-deterministic functions (e.g. `RAND`) re-evaluates on each reference #4507

Query Information

Dataset Information

Bug Description

Environment Information

Tentative Root Cause Analysis

Core Issue

Code References

Why This Happens

Why Sort Works as Workaround

Tentative Proposed Fix

Possible Approach

Implementation Considerations

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] PPL eval with non-deterministic functions (e.g. RAND) re-evaluates on each reference #4507

Description

Query Information

Dataset Information

Bug Description

Environment Information

Tentative Root Cause Analysis

Core Issue

Code References

Why This Happens

Why Sort Works as Workaround

Tentative Proposed Fix

Possible Approach

Implementation Considerations

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] PPL eval with non-deterministic functions (e.g. `RAND`) re-evaluates on each reference #4507