[RFC] PPL Query Result Optimization for LLM

# Background

PPL execution can occur in both PPL paragraphs and/or search index tools via the execution agent. The query, depending on whether it's a user's custom input or generated by the agent, can potentially return hundreds or thousands of query results. Using our SS4O log as an example, even a return of 100 results could consume around 50k tokens if we input the raw query results directly to the LLM. 

Additionally, for multi-step analysis, the presence of raw data in the notebook context means we may input this amount of data to the LLM every time we trigger a paragraph or initiate a new round of investigation. This is not ideal for our continued use of raw query data, as we could easily reach the 200k input token limit per minute and experience throttling. Additionally, excessive tokens can significantly increase response time and reduce LLM accuracy.

# Possible Solutions to Address

The main purpose of all the ways to address the excessive token issue is to obviously truncate or reduce the amount of data sent to the LLM as input. There could potentially be multiple approaches for this.

## Data Formatting

The most straightforward method is to format our raw query results from JSON object strings to TSV (values separated by tabs) or CSV (values separated by commas) by:

* Eliminating the JSON structural overhead (`{, }, [, ], ", :`)
* Removing repetitive field names for every record

The formatting from JSON to TSV was tested with sample flight data for 100 query results. The results show that the token amount can be reduced by about 50%.

JSON:

<img width="1682" height="689" alt="Image" src="https://github.com/user-attachments/assets/260e7206-bdd7-4cc9-bc5a-98eff7c53d2b" />

TSV:

<img width="1716" height="704" alt="Image" src="https://github.com/user-attachments/assets/ca846156-d1f1-44bd-ba51-9c72ce7cee45" />

Another formatting attempt was applied to SS4O log data. Due to the presence of nested JSON objects in fields such as log and resource, the token reduction appears to be less sufficient at about 20%. We may need to further consider flattening the nested JSON data structure.

JSON:

<img width="1708" height="711" alt="Image" src="https://github.com/user-attachments/assets/d4f03fc1-674a-43ad-b2de-30e3f4ad1948" />

TSV: 

<img width="1699" height="700" alt="Image" src="https://github.com/user-attachments/assets/eb2b88b1-b085-4441-b7e2-0621a47ab84d" />

However, the reduction does not seem to be enough if we have a large amount of data. Imagine we have 200 records of SS4O log data; even if the token count is reduced from 100k to 50k, an input of 50k tokens is still not acceptable given the token limit of 200k tokens per minute.

## Data Sampling

A possible way to reduce the number of data records to a certain amount, regardless of the original number of records, is to apply sampling to the query results. We can reduce the number of records by:

1. Before retrieving the actual query result, execute a count query first:

```
source = ... | filter ... | stats count()
```
2. Depending on the result count received, sample the data records to a fixed amount:

Sample the first 100 logs:
```
source = ... | sort - _id | head 100
```

Sample randomly:

```
source = ... | eval random_score=rand() | where random_score > 0.9 | head 100
```

Alternatively, we can also achieve sampling in JS memory by:


1. Executing the actual query and receiving the result 
2. Performing the count in memory
3. Sampling the data based on the length of a single record

In this way, we can achieve sampling based on the data size of individual records, which appears to be slightly more precise for reducing the token count.

Random sampling would intuitively be more meaningful as we assume a random distribution of the original data and ensure we capture statistical representatives. We can implement conditional logic for sampling, for example:

* If the count is below 100, no sampling is required
* If the count is 200, sample 50% to retrieve 100 records
* If the count is 1000, sample 10% to still retrieve 100 records

The downside of sampling would always be the possible information loss, as we may filter out records that contain important information. In a payment failure scenario, if payment failure logs are occurring on a large scale, we can assume the sampling could easily pick payment failure logs as representative results. However, if error logs are rarely present in very large-scale data results, sampling may easily exclude those critical insights from the original data. Nevertheless, the sampling method always sacrifices data integrity when we cannot process the overall dataset.

## LLM Summary

Another valid solution is using an LLM to summarize the query results, which means we only send the raw query results to the LLM once, and use the result summary as context for the following paragraphs to continue the investigation. This method could greatly reduce the amount of input tokens to 1% to 5%, as we no longer use the raw data but rather a structured summary of the original query results with examples. With the correct prompt, the LLM appears to be effective at identifying noticeable information and anomalies with variations in the original dataset. 

Since we are using the query results as context to provide insights for the following analytic steps in a notebook, theoretically the use of summaries instead of raw data should not cause significant information loss or reduce the final result's quality. An unavoidable concern would still be the input token limit when we have a large amount of raw data, and the solution could be to apply sampling before we trigger the LLM summary.

Example prompt for the LLM to perform a summary on the original data:

```
## PPL Query Results Summarization

## Context
You're a data analyst tasked with creating ultra-condensed summaries of PPL query results to maximize processing efficiency while preserving essential analytical value. Your expertise is needed to intelligently distill large datasets down to their most critical components.
## Core Objectives
- Reduce data volume by 80-95% while preserving the fundamental analytical insights
- Extract and retain only the most statistically significant and business-critical data points
- Create an extremely concise output optimized for downstream analytics
- Eliminate all non-essential fields and redundancies
## Important User Input Consideration
**CRITICAL**: The user's natural language request (${payload}) must be carefully analyzed to understand their specific summarization needs. Pay close attention to:
- What aspects of the data the user considers most important
- Any specific metrics, patterns, or anomalies they want highlighted
- The intended use case for the summary
- Any particular level of detail or format they require
Tailor your summarization approach based on these specific requirements rather than using a one-size-fits-all approach.

## Summarization Approach
1. **Aggressive information condensation**: Drastically summarize and aggregate data points, preserving only the most business-critical analytical value
2. **Minimal representative examples**: Include only 2-3 of the most illustrative examples that:
   - Best represent the overall dataset patterns
   - Highlight notable anomalies, errors, or critical business events
   - Demonstrate boundary conditions
   - **IMPORTANT**: These examples should clearly demonstrate key insights, trends, or outliers that would be otherwise lost in pure statistics
3. **Maximum field reduction**: Keep only absolute minimum fields needed for analysis:
   - Primary keys and identifiers
   - Essential metrics and KPIs
   - Critical timestamps for temporal context
4. **Statistical representation**: Replace raw data with statistical summaries (mean, median, range, distribution patterns) whenever possible
5. **Anomaly highlighting**: Ensure any statistically significant outliers or business-critical edge cases are preserved

## Output Requirements
Deliver an extremely concise summary focusing exclusively on the most critical analytical insights. The output must:
- Provide key statistical measures of central tendency and variation
- Include representative examples, prioritizing anomalous or noteworthy records
- Highlight any data quality issues or patterns requiring attention
- Maintain just enough context for informed decision-making

## QUERY RESULTS TO SUMMARIZE:
```

Example response from the LLM (~500 token size)

```
# Log Analysis Summary

## Overview
The dataset contains telemetry logs from a microservices e-commerce application spanning multiple services (checkout, shipping, product-catalog, cart, kafka, frontend-proxy, ad, load-generator). Logs collected between 2025-09-09 06:48:39 and 06:58:35.

## Key Statistics
- **Service distribution**: 9 distinct services (checkout, shipping, cart, kafka, product-catalog, frontend-proxy, ad, load-generator, quote)
- **Severity distribution**: 98% INFO logs, 2% ERROR logs
- **Time range**: ~10 minutes of operation

## Error Detection
**One critical error found**:
```
could not charge the card: rpc error: code = Unknown desc = Payment request failed. Invalid token. app.loyalty.level=gold
```
- Timestamp: 2025-09-09 06:58:35
- Service: checkout
- TraceID: 2927fc7311fc85d495344a4738fd3110
- Root cause: Payment processing failure due to invalid token for a gold loyalty level customer

## Representative Events
1. **Checkout workflow example** (TraceID: 2927fc7311fc85d495344a4738fd3110):
   - Started checkout process for user_id="67ad40b8-8d4a-11f0-99ab-0242f00b011b"
   - Prepared order for 20 items of product "Solar System Color Imager"
   - Retrieved shipping quote
   - Attempted card charge of $5,062 USD
   - Failed with payment token error

2. **Shipping quote calculation** (TraceID: 36805c7f424de6fd146312249abc076e):
   - Processed request with multiple cart items and zip code 94103
   - Successfully calculated quote of $1,104.59

3. **Product catalog operations**:
   - Regular catalog reloading (occurring approximately every 5 minutes)
   - Multiple product lookups by ID with consistent "Product Found" responses

## System Health Indicators
- **Kafka operations**: Regular log segment rotation and snapshot management
- **Frontend proxy**: Consistent HTTP 200 responses (with one 500 error corresponding to the payment failure)
- **Load generator**: Continuous simulation of user browsing and cart interactions
- **Ad service**: Processing targeted ad requests across different categories

## Conclusion
The system is generally functioning normally with a single payment processing error detected. The error appears related to an invalid payment token specifically for a gold loyalty customer, suggesting potential integration issues with the payment or loyalty systems.
```

## Data Distribution, Log Pattern and Log Sequence

Statistical and algorithmic tools, such as the data distribution, log pattern and log sequence analysis that we have already implemented in the notebook, could also be valid to apply to the raw data to receive summarized results. The primary benefit is that the analysis will almost always be possible to apply to the full amount of raw data. 

However, a limitation is that these existing log analysis tools can only be applied to log-related data that inferred by log insight, but not to other types. Even for log type data, the analysis can only be applied to the information in the log field. There could also potentially be other constraints, as they may not always be available or produce optimized insights from the original raw data. We may require further examination of each tool we have.

# Implementation 

The actual implementation to reduce the amount of data in the most optimized way could be a conditional combination of the above solutions, depending on how much data we actually have, for example:

1. Below 20: use raw data
2. Between 20 and 100: LLM summary
3. Log-related Data:
    1. Between 100 and 500: log analysis and then summary
    2. Above 500: sample the data, apply log analysis, and then summary
4. Non-log-related data: 
    1. Above 100: LLM summary

The actual boundaries would definitely require more investigation of the actual outcomes within a full notebook investigation.

# Considerations

The most effective approach to reduce PPL query result token size is encouraging users to narrow results through aggregation or more restrictive time filters. For instance, when a user requests data anomalies within an hour but receives 1,000 records, we could prompt them to iteratively reduce the time range to five minutes through warning messages, significantly decreasing result volume. However, this approach cannot be guaranteed and may create overly complex user experiences. Therefore, when we cannot address the root cause of large data volumes, employing methods that sacrifice some data integrity while preserving key information remains a viable solution, given LLMs' hard token input limits.

Still, The PPL query results in the notebook serve primarily as context for subsequent paragraphs. While this context may not be a complete representation of the original data, it can still be beneficial in providing highlights that help agents determine their next steps.


# Related PRs

Data sampling: https://github.com/opensearch-project/dashboards-investigation/pull/118
Using TSV data structure: https://github.com/opensearch-project/dashboards-investigation/pull/131


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] PPL Query Result Optimization for LLM #119

Background

Possible Solutions to Address

Data Formatting

Data Sampling

LLM Summary

Data Distribution, Log Pattern and Log Sequence

Implementation

Considerations

Related PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] PPL Query Result Optimization for LLM #119

Description

Background

Possible Solutions to Address

Data Formatting

Data Sampling

LLM Summary

Data Distribution, Log Pattern and Log Sequence

Implementation

Considerations

Related PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions