Skip to content

[BUG] PPL filter on computed fields from nested paths returns zero results #4508

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

# This works - returns 3 results with computed NameLen field
source=test-nested-eval-filter | eval NameLen=LENGTH(items.name) | fields id, items.name, NameLen

# This fails - returns 0 results (should return 2: banana and orange with length > 5)
source=test-nested-eval-filter | eval NameLen=LENGTH(items.name) | fields id, items.name, NameLen | where NameLen > 5

Expected Result:
The query should return 2 documents where the computed NameLen field (length of items.name) is greater than 5:

  • order2 with banana (length 6)
  • order3 with orange (length 6)

Actual Result:
The query returns 0 results:

{
  "schema": [...],
  "datarows": [],
  "total": 0,
  "size": 0
}

Dataset Information

Dataset/Schema Type

  • OpenTelemetry (OTEL)
  • Simple Schema for Observability (SS4O)
  • Open Cybersecurity Schema Framework (OCSF)
  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "items": {
        "type": "nested",
        "properties": {
          "name": {"type": "keyword"}
        }
      }
    }
  }
}

Sample Data

{"id":"order1","items":[{"name":"apple"}]}
{"id":"order2","items":[{"name":"banana"}]}
{"id":"order3","items":[{"name":"orange"}]}

Bug Description

Issue Summary:
Filtering on computed fields (created with eval) that reference nested field paths returns zero results, even though the computed field is correctly calculated when not filtered. The issue is specific to nested fields - the same pattern works correctly with regular (non-nested) fields.

Steps to Reproduce:

  1. Create an index with a nested field mapping
  2. Insert documents with nested objects
  3. Run eval to create a computed field based on a nested field path (e.g., LENGTH(items.name))
  4. Add a where clause to filter on the computed field
  5. Observe that the query returns 0 results instead of the expected filtered results

Minimal Reproduction:

# Create test index
curl -X PUT "localhost:9200/test-nested-eval-filter" -H 'Content-Type: application/json' -d '{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "items": {
        "type": "nested",
        "properties": {
          "name": {"type": "keyword"}
        }
      }
    }
  }
}'

# Insert test data
curl -X POST "localhost:9200/test-nested-eval-filter/_bulk" -H 'Content-Type: application/json' -d '
{"index":{"_id":"1"}}
{"id":"order1","items":[{"name":"apple"}]}
{"index":{"_id":"2"}}
{"id":"order2","items":[{"name":"banana"}]}
{"index":{"_id":"3"}}
{"id":"order3","items":[{"name":"orange"}]}
'

curl -X POST "localhost:9200/test-nested-eval-filter/_refresh"

# This works - returns 3 results
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-nested-eval-filter | eval NameLen=LENGTH(items.name) | fields id, items.name, NameLen"
}'

# This fails - returns 0 results (should return 2)
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-nested-eval-filter | eval NameLen=LENGTH(items.name) | fields id, items.name, NameLen | where NameLen > 5"
}'

Verification with Regular Fields:
The same pattern works correctly with regular (non-nested) fields:

# Create index with regular field
curl -X PUT "localhost:9200/test-regular-eval-filter" -H 'Content-Type: application/json' -d '{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "name": {"type": "keyword"}
    }
  }
}'

# Insert data
curl -X POST "localhost:9200/test-regular-eval-filter/_bulk" -H 'Content-Type: application/json' -d '
{"index":{"_id":"1"}}
{"id":"order1","name":"apple"}
{"index":{"_id":"2"}}
{"id":"order2","name":"banana"}
{"index":{"_id":"3"}}
{"id":"order3","name":"orange"}
'

curl -X POST "localhost:9200/test-regular-eval-filter/_refresh"

# This works correctly - returns 2 results
curl -X POST "localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d '{
  "query": "source=test-regular-eval-filter | eval NameLen=LENGTH(name) | fields id, name, NameLen | where NameLen > 5"
}'

Impact:
This bug prevents users from filtering on computed fields that are based on nested field values, which is a common use case in observability data (OTEL, Jaeger) and other structured data with nested objects.

Environment Information

OpenSearch Version:
OpenSearch 3.3.0-SNAPSHOT

Additional Details:

  • Affects PPL queries using Calcite engine (V3)
  • The issue is specific to nested fields; regular fields work correctly
  • The eval command successfully computes values from nested fields when no filter is applied

Tentative Root Cause Analysis

This is a preliminary analysis and requires further investigation.

Core Issue

When a where clause filters on a computed field that references a nested field, the filter is pushed down to OpenSearch as a script query. The script attempts to access the nested field value using doc values (via docProvider.get()), but nested fields in OpenSearch do not have doc values. This causes the script to return null for all documents, making the filter evaluate to false for every document.

Code References

  1. Script Execution: /opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/CalciteScriptEngine.java (lines 226-275)
    • ScriptDataContext.get(String name) method (line 263)
    • Attempts to retrieve field values from docProvider.get(name)
    • Returns null when doc values are empty or missing (line 267)
@Override
public Object get(String name) {
  // ...
  ScriptDocValues<?> docValue = this.docProvider.get(name);
  if (docValue == null || docValue.isEmpty()) {
    return null; // No way to differentiate null and missing from doc value
  }
  // ...
}
  1. Alternative Access Method: The same class has a getFromSource(String name) method that could access nested fields from _source, but it's not used by the generated Calcite code.

Why This Happens

  1. Doc Values Limitation: Nested fields in OpenSearch are stored differently and don't have doc values accessible via the standard doc[] API
  2. Script Generation: The Calcite-generated code always uses the DataContext.get() method, which relies on doc values
  3. No Nested Field Detection: The code doesn't detect when a field path references a nested field and switch to _source access
  4. Silent Failure: When doc values are unavailable, the method returns null rather than falling back to _source access

Direct OpenSearch Query Verification

Testing the underlying OpenSearch script query directly confirms the issue:

curl -X POST "localhost:9200/test-nested-eval-filter/_search" -H 'Content-Type: application/json' -d '{
  "query": {
    "script": {
      "script": {
        "source": "doc[\"items.name\"].value.length() > 5",
        "lang": "painless"
      }
    }
  }
}'

Returns error:

"caused_by": {
  "type": "illegal_state_exception",
  "reason": "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
}

Tentative Proposed Fix

This is a preliminary analysis and requires further investigation.

Possible Approach (Tentative - Needs Investiagtion)

The fix requires modifying how nested fields are accessed in filter scripts. Several approaches are possible:

Option 1: Fallback to Source Access
Modify ScriptDataContext.get() to detect nested fields and fall back to _source access:

  1. Detection: Check if the field path corresponds to a nested field in the index mapping
  2. Fallback: If doc values are unavailable and the field is nested, use sourceLookup.get(name) instead
  3. Implementation Location: /opensearch/src/main/java/org/opensearch/sql/opensearch/storage/script/CalciteScriptEngine.java (ScriptDataContext class)
@Override
public Object get(String name) {
  // ... existing UTC_TIMESTAMP handling ...
  
  ScriptDocValues<?> docValue = this.docProvider.get(name);
  if (docValue == null || docValue.isEmpty()) {
    // Try to get from source for nested fields
    Object sourceValue = this.sourceLookup.get(name);
    if (sourceValue != null) {
      return sourceValue;
    }
    return null;
  }
  // ... existing value processing ...
}

Option 2: Nested Field Detection in Code Generation
Modify the Calcite code generation to use getFromSource() for nested fields:

  • Detect nested fields during query planning
  • Generate code that calls getFromSource() instead of get() for nested field paths
  • More complex but potentially more efficient

Option 3: Nested Query Wrapper
Instead of using script queries for nested fields, generate proper nested queries:

  • Detect when filter expressions reference nested fields
  • Generate OpenSearch nested query DSL instead of script queries
  • Most efficient but requires significant refactoring

Implementation Considerations

  • Performance: _source access is slower than doc values, but it's the only option for nested fields
  • Mapping Access: Need access to index mapping to detect nested fields
  • Array Handling: Nested fields are arrays; need to handle multiple values correctly
  • Backward Compatibility: Ensure the fix doesn't break existing functionality for regular fields

Related Issues


Workaround

No easy workaround available. The issue is fundamental to how nested fields are accessed in filter scripts. Possible alternatives:

  1. Restructure Data: Flatten nested structures into regular fields (not always feasible)
  2. Pre-filter Before Eval: Apply filters before computing fields (doesn't work when filter depends on computed value)
  3. Use Regular Fields: If possible, use non-nested field mappings

None of these are practical workarounds for most use cases involving nested data structures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PPLPiped processing languagebugSomething isn't workingpushdownpushdown related issues

    Type

    No type

    Projects

    Status

    Not Started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions