[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi #7373

wangzhigang1999 · 2026-03-31T03:36:24Z

wangzhigang1999
Mar 31, 2026

[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi

Abstract

This proposal introduces a Data Agent Engine to Apache Kyuubi, enabling users to perform data analysis through natural language. Unlike the existing Chat Engine, which provides stateless LLM Q&A without data access, the Data Agent Engine bridges LLMs with Kyuubi's multi-engine SQL execution capabilities — allowing an AI agent to autonomously explore schemas, generate SQL, execute queries, verify results, and self-correct through multi-turn reasoning.

Q1. What are you trying to do?

Add a new engine type DATA_AGENT to Kyuubi that provides agentic, multi-turn data analysis via natural language. Users connect through standard JDBC/REST interfaces, ask questions in plain language, and the agent autonomously:

Explores table schemas and discovers join relationships
Generates SQL queries tailored to the underlying engine's dialect (Spark SQL, Trino, Hive, etc.)
Executes queries through Kyuubi's existing engine infrastructure
Verifies results for anomalies (empty results, cartesian joins, NULL explosions)
Self-corrects when issues are detected — without user intervention
Presents conclusions with supporting data

This enables business users, analysts, and non-SQL-proficient engineers to query data warehouses and lakehouses through Kyuubi without writing SQL.

Q2. How is it done today, and what are the limits of current practice?

Current state in Kyuubi

Kyuubi has a Chat Engine (externals/kyuubi-chat-engine/) that integrates LLMs (ChatGPT, ErnieBot) via a pluggable ChatProvider interface.

Limitations of the current Chat Engine:

Limitation	Impact
No data access — LLM cannot execute SQL or inspect schemas	Cannot answer data questions; purely conversational
Single-turn `ask()` interface — no iterative reasoning	Cannot decompose complex questions into multi-step analysis
No result verification — no awareness of query correctness	Cannot detect or fix wrong joins, empty results, or data anomalies
No schema awareness — LLM has no knowledge of the user's tables	Generates hallucinated SQL if users try to ask data questions

Current state in the industry

Tools like text-to-SQL assistants generate a single SQL query from a natural language question. This works for simple lookups but fails on complex analytical questions that require:

Multi-step reasoning (explore → hypothesize → query → verify → refine)
Schema understanding (which table owns which field, how tables join)
Error recovery (fixing wrong JOINs, adjusting WHERE clauses)

Q3. What is new in your approach, and why do you think it will be successful?

Core innovation: Agentic SQL analysis within Kyuubi's multi-tenant gateway

Instead of single-shot text-to-SQL, we introduce a ReAct (Reasoning + Acting) agent loop that iterates toward correct answers:

User question
    → Agent: inspect schema (Tool: describe_schema)
    → Agent: discover join paths (Tool: find_relationships)
    → Agent: generate and execute SQL (Tool: sql_query → JDBC → Kyuubi Server → Compute Engine)
    → Middleware: verify results (empty? NULL explosion? cartesian join?)
    → Agent: self-correct if needed, or present final answer

Key design decisions

1. Reuse Kyuubi's engine infrastructure for SQL execution

The Data Agent does NOT connect to databases directly. Instead, its sql_query tool uses Kyuubi's JDBC driver to connect back to Kyuubi Server, which then routes the SQL to the appropriate compute engine (Spark SQL, Trino, Hive, etc.). This is similar to the existing JDBC Engine pattern, where an engine acts as a JDBC client to a backend data source — except here the "backend" is Kyuubi Server itself.

This design ensures that every SQL query the agent executes goes through the Server gateway, inheriting multi-tenant resource isolation, authentication/authorization (Kyuubi AuthZ / Apache Ranger), query auditing, and engine lifecycle management. The agent connects using the original user's credentials, so the same ACL rules are enforced.

┌──────────┐     ┌───────────────┐     ┌───────────────────────┐
│  Client   │───▶│ Kyuubi Server  │───▶│  Data Agent Engine     │
│(JDBC/REST)│     │  (Gateway)    │     │                       │
└──────────┘     └──┬────────────┘     │  ReAct Loop            │
                     │      ▲           │  LLM ←→ Tools         │
                     │      │           │         │              │
                     │      │           └─────────┼──────────────┘
                     │      │                     │
                     │      │  JDBC (user creds)  │
                     │      └─────────────────────┘
                     │         sql_query tool connects
                     │         back to Kyuubi Server
                     ▼
              ┌─────────────────┐
              │  Compute Engine  │
              │ (Spark/Trino/   │
              │  Hive/...)      │
              └─────────────────┘

2. Pluggable middleware pipeline for reasoning control

An onion-model middleware system allows operators to customize agent behavior without modifying core logic:

ResultVerification — detects anomalies and injects correction hints
Guardrails — enforces SQL safety policies (SELECT-only)
Deadline — enforces time budgets for multi-turn analysis
Compaction — compresses conversation context to stay within LLM token limits

3. Multi-signal schema understanding

The agent actively discovers schema structure through:

Explicit foreign keys from database metadata
Naming conventions (*_id columns matched to primary keys)
Value overlap detection (MinHash sketches)
Type compatibility filtering

4. Java/Scala implementation on the JVM

Implemented in Java/Scala, consistent with all existing Kyuubi engines. Uses LangChain4j for LLM interaction (tool calling, chat memory, streaming). This allows direct reuse of Kyuubi's Serverable, SessionManager, OperationManager, and Thrift service infrastructure.

Why it will be successful

Natural extension of Kyuubi's existing architecture — follows the same engine model as Chat, Spark, Flink, and JDBC engines
Validated approach — the underlying agent design has been tested against the BIRD benchmark (a standard text-to-SQL evaluation with 500 questions across 11 databases)
Enterprise-ready from day one — inherits Kyuubi's authentication, authorization, multi-tenancy, and audit capabilities
Addresses real demand — natural language data access is a top request in the data platform space

Q4. Who cares? If you are successful, what difference will it make?

User type	Current pain point	How Data Agent helps
Business analysts	Must learn SQL or wait for engineers to write queries	Ask questions in natural language, get verified results
Data engineers	Spend time on ad-hoc query requests from business teams	Offload routine queries to the agent
Platform teams	Need to provide self-service analytics while maintaining governance	Data Agent respects Kyuubi's AuthZ — users can only query what they're authorized to access

Strategic value for Kyuubi

Differentiator — one of the first open-source SQL gateways with built-in agentic AI analysis
Broader adoption — lowers the barrier to entry for non-technical users
Ecosystem growth — attracts AI/ML community contributors to the project

Q5. How will you measure success?

Metric	Target	How to measure
Text-to-SQL accuracy	≥ 60% EX on BIRD benchmark	Automated benchmark suite
Self-correction rate	≥ 50% recovery from initial wrong queries	Log analysis of multi-turn sessions
Query safety	0 unauthorized writes or data leaks	SQL validator + AuthZ integration tests
Latency	< 30s for simple questions	End-to-end timing
Community adoption	≥ 3 external contributors within 6 months	GitHub activity tracking

Q6. What are the mid-term and final "exams" to check for success?

Phase 1: Foundation (Mid-term)

DATA_AGENT engine type registered
DataAgentProcessBuilder launches the agent engine process
Agent can execute simple single-turn text-to-SQL through Kyuubi's engine
SQL validation enforces SELECT-only policy
Basic authentication pass-through works

Success criterion: A user connects via beeline/JDBC, types a natural language question, and receives a SQL-backed answer.

Phase 2: Intelligent Analysis (Final)

Multi-turn ReAct reasoning with schema exploration
Relationship discovery (explicit FK + inferred)
Result verification and self-correction
Streaming output via REST API
Context compaction for long conversations
Documentation and user guide

Success criterion: Agent correctly answers multi-step analytical questions that require schema exploration, multi-table joins, and result verification.

Q7. What are the risks?

Risk	Likelihood	Impact	Mitigation
LLM hallucination generates wrong SQL	High	Medium	Result verification + SQL validation + max 3 retries
LLM API latency causes poor UX	Medium	Medium	Streaming output + deadline middleware + async execution
Agent generates unauthorized queries	Low	High	SQL validator (SELECT-only) + Kyuubi AuthZ integration
Token cost for complex analysis	Medium	Low	Context compaction + configurable iteration limits
LLM provider lock-in	Low	Medium	Pluggable provider interface (OpenAI-compatible API as standard)

Implementation Roadmap

Phase	Scope	Estimated PRs
Phase 0	Community discussion + design review	This KPIP
Phase 1	Engine skeleton: EngineType, ProcessBuilder, Thrift service	2-3 PRs
Phase 2	Core agent: ReAct loop, basic tools (schema + sql_query)	3-4 PRs
Phase 3	Advanced: relationship discovery, result verification, middleware	3-5 PRs
Phase 4	Streaming, documentation, benchmark CI	2-3 PRs

Rejected Alternatives

Alternative 1: Extend the existing Chat Engine

The Chat Engine's ChatProvider.ask() interface is fundamentally single-turn and synchronous. Adding multi-turn reasoning, tool execution, and streaming would require rewriting most of the internals. A new engine type is cleaner and avoids breaking existing Chat Engine users.

Alternative 2: Implement as a Kyuubi Server plugin / extension

The agent needs its own process lifecycle (LLM client, conversation memory, middleware pipeline). Running inside the Server JVM would compete for resources and complicate failure isolation.

Alternative 3: External service with REST integration

Would bypass Kyuubi's authentication, authorization, and audit pipeline. Users would need to manage a separate service.

Alternative 4: Implement in Python

While the AI/ML ecosystem is richer in Python, the agent's core operations (LLM API calls, SQL orchestration, schema introspection) are well-supported in Java via LangChain4j. A Python implementation would introduce a technology stack inconsistency, require re-implementing Kyuubi infrastructure, and add a runtime dependency that the existing community cannot effectively review or maintain.

References

BIRD Benchmark — Text-to-SQL evaluation standard
ReAct: Synergizing Reasoning and Acting in Language Models
LangChain4j — Java LLM framework
Kyuubi Chat Engine — externals/kyuubi-chat-engine/

pan3793 · 2026-03-31T05:12:13Z

pan3793
Mar 31, 2026
Collaborator

@wangzhigang1999, this sounds like a very promising idea!

For Data Agent Tools => Kyuubi Engine, do you mean the tool invokes BeeLine or REST API to talk with Kyuubi Server, or directly connect to Kyuubi Engine?

BTW, you don't need to care about the chat engine, it's just a toy that was created overnight.

2 replies

wangzhigang1999 Mar 31, 2026
Author

I suggests sticking with the former approach—having the tool connect back to the Kyuubi Server via JDBC/REST, rather than connecting directly to the engine.

By routing through the Kyuubi Server, the Data Agent can fully leverage existing capabilities like AuthZ (Ranger integration), query auditing, multi-tenant pooling, and engine lifecycle management. The key benefit is that this requires no changes on the Server side.

If we connect directly to the engine, it would bypass the Server's authorization and session management. Then, the agent would need to handle engine discovery and lifecycle on its own, which might break the current isolation model.

Here is the proposed flow:

Client ──JDBC──▶ Kyuubi Server ──▶ Data Agent Engine
                                          │
                                     ReAct Loop
                                     LLM ←→ Tools
                                          │
                                     sql_query tool
                                          │
                                     JDBC Driver
                                          │
                                     Kyuubi Server ──▶ Spark/Trino/Hive Engine

What do you think about this direction?

pan3793 Mar 31, 2026
Collaborator

yea, I also think the former is better, could you please update the proposal to reflect that? the current description might confuse.

pan3793 · 2026-03-31T05:48:36Z

pan3793
Mar 31, 2026
Collaborator

there are only a few people who subscribe to the GitHub Discussion channel, please share this on the mailing list to get broader awareness

1 reply

wangzhigang1999 Mar 31, 2026
Author

Thanks for the feedback! I've updated the proposal to clarify the SQL execution path: the agent's sql_query tool connects to the Kyuubi Server via JDBC using the original user's credentials, following the same pattern as the existing JDBC Engine. The architecture diagram has been updated accordingly. I've also shared this on the mailing list for broader discussion.

mergisi · 2026-04-03T13:00:24Z

mergisi
Apr 3, 2026

Strong proposal. A few observations from working in the text-to-SQL space:

Schema discovery as the critical path. The multi-signal approach (FK metadata + naming conventions + MinHash value overlap) is the right call. In practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on naming convention inference. One addition worth considering: column-level statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. This helps the agent avoid generating queries that join on high-cardinality columns without filters, which is one of the most common causes of cartesian-join-like blowups that the ResultVerification middleware would catch too late.

The 60% BIRD target is realistic but worth nuancing. BIRD includes questions that require domain knowledge not present in schema metadata (e.g., knowing that "big city" means population > 1M). The self-correction loop should help here — the agent can inspect result distributions and re-query when something looks off. Tracking accuracy separately for schema-answerable vs. domain-knowledge questions would give a clearer signal during development.

Middleware ordering matters more than it seems. Guardrails before ResultVerification means the agent can't accidentally run a write query during self-correction. But Compaction before ResultVerification means the agent might lose context about why a previous attempt failed. Documenting a recommended middleware ordering (or making it configurable per deployment) would help operators avoid subtle bugs.

Dialect-aware SQL generation. The proposal mentions Spark SQL, Trino, and Hive — these have meaningful syntax differences (e.g., DATE_TRUNC vs TRUNC vs date_trunc, lateral views, array handling). Is dialect awareness handled at the LLM prompt level (system prompt per engine type), or is there a SQL rewrite layer? A rewrite layer would be more reliable but adds complexity.

Disclosure: I work on ai2sql.io, a natural language to SQL tool focused on the simpler end of this spectrum (single-turn query generation for learning and ad-hoc analysis). The agentic multi-turn approach described here is the natural next step for production-grade systems where single-shot accuracy isn't sufficient.

0 replies

mergisi · 2026-04-03T13:01:14Z

mergisi
Apr 3, 2026

Strong proposal. A few observations from working in the text-to-SQL space:

Schema discovery as the critical path. The multi-signal approach (FK metadata + naming conventions + MinHash value overlap) is the right call. In practice, explicit FKs cover maybe 40% of real-world schemas — the rest rely on naming convention inference. One addition worth considering: column-level statistics (cardinality, NULL ratio) fed to the LLM alongside schema metadata. This helps the agent avoid generating queries that join on high-cardinality columns without filters, which is one of the most common causes of cartesian-join-like blowups that the ResultVerification middleware would catch too late.

The 60% BIRD target is realistic but worth nuancing. BIRD includes questions that require domain knowledge not present in schema metadata (e.g., knowing that "big city" means population > 1M). The self-correction loop should help here — the agent can inspect result distributions and re-query when something looks off. Tracking accuracy separately for schema-answerable vs. domain-knowledge questions would give a clearer signal during development.

Middleware ordering matters more than it seems. Guardrails before ResultVerification means the agent can't accidentally run a write query during self-correction. But Compaction before ResultVerification means the agent might lose context about why a previous attempt failed. Documenting a recommended middleware ordering (or making it configurable per deployment) would help operators avoid subtle bugs.

Dialect-aware SQL generation. The proposal mentions Spark SQL, Trino, and Hive — these have meaningful syntax differences (e.g., DATE_TRUNC vs TRUNC vs date_trunc, lateral views, array handling). Is dialect awareness handled at the LLM prompt level (system prompt per engine type), or is there a SQL rewrite layer? A rewrite layer would be more reliable but adds complexity.

Disclosure: I work on ai2sql.io, a natural language to SQL tool focused on the simpler end of this spectrum (single-turn query generation for learning and ad-hoc analysis). The agentic multi-turn approach described here is the natural next step for production-grade systems where single-shot accuracy isn't sufficient.

0 replies

wangzhigang1999 · 2026-04-04T17:26:06Z

wangzhigang1999
Apr 4, 2026
Author

Implementation started. Umbrella issue: #7379, first PR: #7385. Will submit remaining PRs incrementally after review.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi #7373

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi #7373

Uh oh!

Uh oh!

wangzhigang1999 Mar 31, 2026

[KPIP] Data Agent Engine — AI-Powered Autonomous Data Analysis for Kyuubi

Abstract

Q1. What are you trying to do?

Q2. How is it done today, and what are the limits of current practice?

Current state in Kyuubi

Current state in the industry

Q3. What is new in your approach, and why do you think it will be successful?

Core innovation: Agentic SQL analysis within Kyuubi's multi-tenant gateway

Key design decisions

Why it will be successful

Q4. Who cares? If you are successful, what difference will it make?

Strategic value for Kyuubi

Q5. How will you measure success?

Q6. What are the mid-term and final "exams" to check for success?

Phase 1: Foundation (Mid-term)

Phase 2: Intelligent Analysis (Final)

Q7. What are the risks?

Implementation Roadmap

Rejected Alternatives

Alternative 1: Extend the existing Chat Engine

Alternative 2: Implement as a Kyuubi Server plugin / extension

Alternative 3: External service with REST integration

Alternative 4: Implement in Python

References

Replies: 5 comments · 3 replies

Uh oh!

pan3793 Mar 31, 2026 Collaborator

Uh oh!

wangzhigang1999 Mar 31, 2026 Author

Uh oh!

pan3793 Mar 31, 2026 Collaborator

Uh oh!

pan3793 Mar 31, 2026 Collaborator

Uh oh!

wangzhigang1999 Mar 31, 2026 Author

Uh oh!

mergisi Apr 3, 2026

Uh oh!

mergisi Apr 3, 2026

Uh oh!

wangzhigang1999 Apr 4, 2026 Author

wangzhigang1999
Mar 31, 2026

Replies: 5 comments 3 replies

pan3793
Mar 31, 2026
Collaborator

wangzhigang1999 Mar 31, 2026
Author

pan3793 Mar 31, 2026
Collaborator

pan3793
Mar 31, 2026
Collaborator

wangzhigang1999 Mar 31, 2026
Author

mergisi
Apr 3, 2026

mergisi
Apr 3, 2026

wangzhigang1999
Apr 4, 2026
Author