[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi #7373
Replies: 5 comments 3 replies
-
|
@wangzhigang1999, this sounds like a very promising idea! For BTW, you don't need to care about the chat engine, it's just a toy that was created overnight. |
Beta Was this translation helpful? Give feedback.
-
|
there are only a few people who subscribe to the GitHub Discussion channel, please share this on the mailing list to get broader awareness |
Beta Was this translation helpful? Give feedback.
-
|
Strong proposal. A few observations from working in the text-to-SQL space: Schema discovery as the critical path. The multi-signal approach (FK metadata + naming conventions + MinHash value overlap) is the right call. In practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on naming convention inference. One addition worth considering: column-level statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. This helps the agent avoid generating queries that join on high-cardinality columns without filters, which is one of the most common causes of cartesian-join-like blowups that the ResultVerification middleware would catch too late. The 60% BIRD target is realistic but worth nuancing. BIRD includes questions that require domain knowledge not present in schema metadata (e.g., knowing that "big city" means population > 1M). The self-correction loop should help here — the agent can inspect result distributions and re-query when something looks off. Tracking accuracy separately for schema-answerable vs. domain-knowledge questions would give a clearer signal during development. Middleware ordering matters more than it seems. Guardrails before ResultVerification means the agent can't accidentally run a write query during self-correction. But Compaction before ResultVerification means the agent might lose context about why a previous attempt failed. Documenting a recommended middleware ordering (or making it configurable per deployment) would help operators avoid subtle bugs. Dialect-aware SQL generation. The proposal mentions Spark SQL, Trino, and Hive — these have meaningful syntax differences (e.g., Disclosure: I work on ai2sql.io, a natural language to SQL tool focused on the simpler end of this spectrum (single-turn query generation for learning and ad-hoc analysis). The agentic multi-turn approach described here is the natural next step for production-grade systems where single-shot accuracy isn't sufficient. |
Beta Was this translation helpful? Give feedback.
-
|
Strong proposal. A few observations from working in the text-to-SQL space: Schema discovery as the critical path. The multi-signal approach (FK metadata + naming conventions + MinHash value overlap) is the right call. In practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on naming convention inference. One addition worth considering: column-level statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. This helps the agent avoid generating queries that join on high-cardinality columns without filters, which is one of the most common causes of cartesian-join-like blowups that the ResultVerification middleware would catch too late. The 60% BIRD target is realistic but worth nuancing. BIRD includes questions that require domain knowledge not present in schema metadata (e.g., knowing that "big city" means population > 1M). The self-correction loop should help here — the agent can inspect result distributions and re-query when something looks off. Tracking accuracy separately for schema-answerable vs. domain-knowledge questions would give a clearer signal during development. Middleware ordering matters more than it seems. Guardrails before ResultVerification means the agent can't accidentally run a write query during self-correction. But Compaction before ResultVerification means the agent might lose context about why a previous attempt failed. Documenting a recommended middleware ordering (or making it configurable per deployment) would help operators avoid subtle bugs. Dialect-aware SQL generation. The proposal mentions Spark SQL, Trino, and Hive — these have meaningful syntax differences (e.g., Disclosure: I work on ai2sql.io, a natural language to SQL tool focused on the simpler end of this spectrum (single-turn query generation for learning and ad-hoc analysis). The agentic multi-turn approach described here is the natural next step for production-grade systems where single-shot accuracy isn't sufficient. |
Beta Was this translation helpful? Give feedback.
-
|
Implementation started. Umbrella issue: #7379, first PR: #7385. Will submit remaining PRs incrementally after review. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi
Abstract
This proposal introduces a Data Agent Engine to Apache Kyuubi, enabling users to perform data analysis through natural language. Unlike the existing Chat Engine, which provides stateless LLM Q&A without data access, the Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution capabilities — allowing an AI agent to autonomously explore schemas, generate SQL, execute queries, verify results, and self-correct through multi-turn reasoning.
Q1. What are you trying to do?
Add a new engine type
DATA_AGENTto Kyuubi that provides agentic, multi-turn data analysis via natural language. Users connect through standard JDBC/REST interfaces, ask questions in plain language, and the agent autonomously:This enables business users, analysts, and non-SQL-proficient engineers to query data warehouses and lakehouses through Kyuubi without writing SQL.
Q2. How is it done today, and what are the limits of current practice?
Current state in Kyuubi
Kyuubi has a Chat Engine (
externals/kyuubi-chat-engine/) that integrates LLMs (ChatGPT, ErnieBot) via a pluggableChatProviderinterface.Limitations of the current Chat Engine:
ask()interface — no iterative reasoningCurrent state in the industry
Tools like text-to-SQL assistants generate a single SQL query from a natural language question. This works for simple lookups but fails on complex analytical questions that require:
Q3. What is new in your approach, and why do you think it will be successful?
Core innovation: Agentic SQL analysis within Kyuubi's multi-tenant gateway
Instead of single-shot text-to-SQL, we introduce a ReAct (Reasoning + Acting) agent loop that iterates toward correct answers:
Key design decisions
1. Reuse Kyuubi's engine infrastructure for SQL execution
The Data Agent does NOT connect to databases directly. Instead, its
sql_querytool uses Kyuubi's JDBC driver to connect back to Kyuubi Server, which then routes the SQL to the appropriate compute engine (Spark SQL, Trino, Hive, etc.). This is similar to the existing JDBC Engine pattern, where an engine acts as a JDBC client to a backend data source — except here the "backend" is Kyuubi Server itself.This design ensures that every SQL query the agent executes goes through the Server gateway, inheriting multi-tenant resource isolation, authentication/authorization (Kyuubi AuthZ / Apache Ranger), query auditing, and engine lifecycle management. The agent connects using the original user's credentials, so the same ACL rules are enforced.
2. Pluggable middleware pipeline for reasoning control
An onion-model middleware system allows operators to customize agent behavior without modifying core logic:
3. Multi-signal schema understanding
The agent actively discovers schema structure through:
*_idcolumns matched to primary keys)4. Java/Scala implementation on the JVM
Implemented in Java/Scala, consistent with all existing Kyuubi engines. Uses LangChain4j for LLM interaction (tool calling, chat memory, streaming). This allows direct reuse of Kyuubi's
Serverable,SessionManager,OperationManager, and Thrift service infrastructure.Why it will be successful
Q4. Who cares? If you are successful, what difference will it make?
Strategic value for Kyuubi
Q5. How will you measure success?
Q6. What are the mid-term and final "exams" to check for success?
Phase 1: Foundation (Mid-term)
DATA_AGENTengine type registeredDataAgentProcessBuilderlaunches the agent engine processSuccess criterion: A user connects via beeline/JDBC, types a natural language question, and receives a SQL-backed answer.
Phase 2: Intelligent Analysis (Final)
Success criterion: Agent correctly answers multi-step analytical questions that require schema exploration, multi-table joins, and result verification.
Q7. What are the risks?
Implementation Roadmap
Rejected Alternatives
Alternative 1: Extend the existing Chat Engine
The Chat Engine's
ChatProvider.ask()interface is fundamentally single-turn and synchronous. Adding multi-turn reasoning, tool execution, and streaming would require rewriting most of the internals. A new engine type is cleaner and avoids breaking existing Chat Engine users.Alternative 2: Implement as a Kyuubi Server plugin / extension
The agent needs its own process lifecycle (LLM client, conversation memory, middleware pipeline). Running inside the Server JVM would compete for resources and complicate failure isolation.
Alternative 3: External service with REST integration
Would bypass Kyuubi's authentication, authorization, and audit pipeline. Users would need to manage a separate service.
Alternative 4: Implement in Python
While the AI/ML ecosystem is richer in Python, the agent's core operations (LLM API calls, SQL orchestration, schema introspection) are well-supported in Java via LangChain4j. A Python implementation would introduce a technology stack inconsistency, require re-implementing Kyuubi infrastructure, and add a runtime dependency that the existing community cannot effectively review or maintain.
References
externals/kyuubi-chat-engine/Beta Was this translation helpful? Give feedback.
All reactions