Skip to content

[Bug]: RetryGuidelineQueryEngine/GuidelineEvaluator causes ~400k token prompts when used with NLSQLTableQueryEngine(synthesize_response=False) #20300

@csttsn

Description

@csttsn

Bug Description

I have a working Text-to-SQL setup using NLSQLTableQueryEngine with synthesize_response=False (intentionally disabled to return raw database rows and avoid extra summarization cost/latency).

This setup worked perfectly and used reasonable token counts (~20k tokens max, coming from my custom text_to_sql_prompt).

Previously working code (minimal relevant part)

# Create the base Text-to-SQL query engine
query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=tables_to_include,
    llm=llm,
    text_to_sql_prompt=prompt_template,
    synthesize_response=False,   # We need raw rows, not a summary
)

This engine alone behaved correctly: generated SQL, executed it, returned raw results, token usage stayed within expected bounds.

What changed

I wanted to add safety by validating generated SQL against guidelines (e.g. correct PostgreSQL syntax, avoid dangerous patterns, and so on). I wrapped the engine like this:

from llama_index.core.evaluation import GuidelineEvaluator
from llama_index.core.query_engine import RetryGuidelineQueryEngine

# Create the base Text-to-SQL query engine
base_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=tables_to_include,
    llm=_llm,
    text_to_sql_prompt=prompt_template,
    synthesize_response=False, 
 )

# Create the GuidelineEvaluator using our LLM and guidelines from config
# This evaluator will check the generated SQL against rules semantically.
guideline_evaluator = GuidelineEvaluator(
    llm=_llm, 
    guidelines=GUIDELINES 
)

# Wrap the base engine with RetryGuidelineQueryEngine for retries
# This adds error recovery: If evaluation fails, it refines and retries.
retry_query_engine = RetryGuidelineQueryEngine(
    query_engine=base_query_engine,
    guideline_evaluator=guideline_evaluator, 
    resynthesize_query=True,  # Enable query refinement using feedback
    max_retries=1
)

As soon as a retry is triggered (i.e. the evaluator rejects the first SQL), the next LLM call during the rewrite phase receives an extremely large prompt in my logs up to ~400,000 tokens, far exceeding my 20k-token custom prompt.

What appears to be happening (but I'm not 100% sure)

When synthesize_response=False:

  • The Response object returned by NLSQLTableQueryEngine contains the entire raw result set (thousands of rows) as a large string in response.response.
  • GuidelineEvaluator seems to send response.response to the LLM for evaluation.
  • Since my guideline is about SQL syntax, I would expect only the generated SQL query to be evaluated.
  • Instead, it looks like the evaluator is receiving and sending the full database result set (CSV-like text of all returned rows), causing massive token usage and hitting rate limits immediately.

Expected behavior

When a guideline is about the SQL query itself (syntax, safety, etc.), the evaluator should inspect only the generated SQL (available in response.source_nodes[0].text or similar), not the executed result data.

Or maybe I’m doing something wrong; in that case, I’d really appreciate a solution for implementing the validation of SQL queries generated by NLSQLTableQueryEngine.

Version

0.12.49

Steps to Reproduce

Minimal reproduction (copied from bug description)

from llama_index.core.evaluation import GuidelineEvaluator
from llama_index.core.query_engine import RetryGuidelineQueryEngine

# Create the base Text-to-SQL query engine
base_query_engine = NLSQLTableQueryEngine(
    sql_database=sql_database,
    tables=tables_to_include,
    llm=_llm,
    text_to_sql_prompt=prompt_template,
    synthesize_response=False, 
 )

# Create the GuidelineEvaluator using our LLM and guidelines from config
# This evaluator will check the generated SQL against rules semantically.
guideline_evaluator = GuidelineEvaluator(
    llm=_llm, 
    guidelines=GUIDELINES 
)

# Wrap the base engine with RetryGuidelineQueryEngine for retries
# This adds error recovery: If evaluation fails, it refines and retries.
retry_query_engine = RetryGuidelineQueryEngine(
    query_engine=base_query_engine,
    guideline_evaluator=guideline_evaluator, 
     resynthesize_query=True,  # Enable query refinement using feedback
     max_retries=1
)

That is, adding GuidelineEvaluator and RetryGuidelineQueryEngine causes extremely large prompts when the result obtained from base_query_engine is rewritten by RetryGuidelineQueryEngine.

Relevant Logs/Tracbacks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageIssue needs to be triaged/prioritized

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions