Skip to content

DX-118394: Add configurable row and byte limits to RunSqlQuery#97

Open
aniket-s-kulkarni wants to merge 1 commit into
mainfrom
DX-118394-runsqlquery-result-limits
Open

DX-118394: Add configurable row and byte limits to RunSqlQuery#97
aniket-s-kulkarni wants to merge 1 commit into
mainfrom
DX-118394-runsqlquery-result-limits

Conversation

@aniket-s-kulkarni

Copy link
Copy Markdown
Contributor

Summary

  • Adds max_result_rows (default 500) and max_result_bytes (default 200 KB) fields to Dremio settings model, configurable via YAML and env vars (DREMIOAI_DREMIO__MAX_RESULT_ROWS, DREMIOAI_DREMIO__MAX_RESULT_BYTES)
  • Adds run_query_capped() to sql.py — fetches at most max_result_rows rows from Dremio (0 = unlimited), all existing callers of run_query/get_results are unaffected
  • Updates RunSqlQuery.invoke() to use run_query_capped(), enforce a byte cap via per-row JSON counting, and return structured truncation metadata when limits fire; response is unchanged when no truncation occurs

Test plan

  • Unit: test_run_query_capped_under_limit — no truncation, all rows returned
  • Unit: test_run_query_capped_row_limit_hittruncated=True, truncation_reason="row_limit"
  • Unit: test_run_query_capped_unlimitedmax_rows=0 fetches all rows
  • Unit: test_run_query_capped_empty_result — empty DataFrame, no truncation
  • Unit: test_run_query_capped_exact_boundaryrow_count == max_rows, no truncation
  • Unit: test_invoke_no_truncation — response has only "result" key (backward compat)
  • Unit: test_invoke_row_limit_hittruncation_reason="row_limit"
  • Unit: test_invoke_byte_limit_hittruncation_reason="byte_limit", fewer records returned
  • Unit: test_invoke_both_limits_zero — no truncation, full result
  • Unit: test_invoke_settings_override — correct max_rows passed from settings
  • All 316 existing tests pass (uv run pytest tests/ -q)

JIRA

https://dremio.atlassian.net/browse/DX-118394

Reviewer Verdict

APPROVE (after trivial fixes: itertools.chain bug in run_query_capped, MCP return type annotation, and updated mock targets in 2 existing tests)

🤖 Generated with Claude Code

Add max_result_rows (default 500) and max_result_bytes (default 200 KB)
settings to Dremio model. Add run_query_capped() to sql.py that fetches
at most max_result_rows rows. Update RunSqlQuery.invoke() to use
run_query_capped(), enforce byte cap, and return structured truncation
metadata (truncated, total_rows, returned_rows, truncation_reason) when
limits are hit. Both limits can be disabled by setting to 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@ssaumitra ssaumitra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality already exists in the Dremio query engine and REST API. I request to use that instead.

If we use limit parameter in the Job Results API (https://docs.dremio.com/current/reference/api/job/job-results/) the limits will be applied at the query plan level. Query planner will optimize the query to read less number of Parquet files, the Dremio query execution cost will be lower and Dollar cost/compute cost for the end user will also be lower. On the other hand, if we use limit functionality in MCP server layer, Dremio query engine will run the full cost query. And later trim the output. So user will pay full dollar cost for running the entire query and receive truncated results.

Double implementation of the same functionality at two different services will also end up confusing the end users.

We already call Job results API over here so the MCP integration will also be straight forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants