feat: distributed EXPLAIN / EXPLAIN ANALYZE via logical extension codec by lukekim · Pull Request #31 · spiceai/datafusion-ballista

lukekim · 2026-04-22T17:46:25Z

Summary

Adds native Ballista support for EXPLAIN FORMAT <fmt> and EXPLAIN ANALYZE so they execute through the distributed scheduler instead of being quietly lost by datafusion-proto's default ExplainNode (which drops explain_format) or blocked by AnalyzeExec having no distributed plan.

Approach

Logical extension codec

Two UserDefinedLogicalNodeCores in ballista/core/src/serde/logical_plan_ext.rs:

BallistaExplainNode — carries verbose + ExplainFormat (Indent/Tree/PostgresJSON/Graphviz) + inner plan.
BallistaAnalyzeNode — carries verbose + inner plan.

BallistaLogicalExtensionCodec::try_encode/try_decode serialize them through a new BallistaLogicalExtensionNode { oneof node { explain, analyze } } proto wrapper, falling back to the default codec for anything else.

The client planner wraps LogicalPlan::Explain / LogicalPlan::Analyze in the matching extension node before building DistributedQueryExec. The scheduler unwraps back to the native node in submit_job before optimize() / create_physical_plan. The previous "run FORMAT TREE locally" workaround in planner.rs is removed.

Distributed `EXPLAIN ANALYZE`

For analyze jobs the scheduler:

Strips the `LogicalPlan::Analyze` wrapper and runs the inner plan as a regular distributed job.
Records an `AnalyzeJobInfo` on the `ExecutionGraph`.
In `ExecutionGraph::succeed_job()`, if analyze-tracked, iterates successful stages and renders each via `DisplayableBallistaExecutionPlan::new(plan, &stage_metrics).indent()` into a new `SuccessfulJob.analyzed_plan_text` proto field (reusing the existing per-stage metrics plumbing).

The client's `DistributedQueryExec`, on seeing `analyzed_plan_text = Some(..)`, skips partition fetching and synthesizes a single-row 2-column `RecordBatch` (`"Plan with Metrics"`, rendered text) matching the `Analyze` output schema.

Test Coverage

Unit (ballista-core `serde::test`):

`test_ballista_explain_node_codec_roundtrip` — all 4 `ExplainFormat` variants × verbose flags survive encode→decode.
`test_ballista_analyze_node_codec_roundtrip` — verbose flag preserved.
`test_explain_format_str_stable` — stable format-string identifiers, unknown value rejected.

Integration (ballista-client `context_checks`, standalone + remote):

`should_execute_explain_query_correctly`
`should_execute_explain_format_tree_query_correctly` — asserts tree box-drawing chars + `distributed_plan` row.
`should_execute_explain_analyze_query_correctly` — asserts `Stage[stage_id=`, `metrics=[`, `output_rows=` in rendered text.
`should_execute_explain_analyze_verbose_query_correctly` — new VERBOSE case.

Results: 54 client integration tests, 51 scheduler tests, 51 core lib tests — all green.

Copilot

Pull request overview

Adds end-to-end distributed support for EXPLAIN FORMAT <fmt> and EXPLAIN ANALYZE in Ballista by preserving lost logical-plan fields across datafusion-proto serialization and returning scheduler-rendered analyze output to the client.

Changes:

Introduces Ballista logical extension nodes + codec wrapper to round-trip ExplainFormat/verbose and Analyze verbose.
Scheduler unwraps extensions back into native DataFusion LogicalPlan::{Explain,Analyze}, distributes EXPLAIN, and implements distributed EXPLAIN ANALYZE by returning analyzed_plan_text.
Client detects analyzed_plan_text and synthesizes the expected EXPLAIN ANALYZE RecordBatch locally; adds integration tests for TREE format and ANALYZE(+VERBOSE).

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
ballista/scheduler/src/state/task_manager.rs	Threads optional analyze metadata into the created `ExecutionGraph`.
ballista/scheduler/src/state/mod.rs	Unwraps Ballista logical extensions, implements distributed `EXPLAIN` formatting handling, and strips top-level `Analyze` for distributed execution while tracking analyze info.
ballista/scheduler/src/state/execution_graph.rs	Tracks analyze jobs and renders per-stage “plan with metrics” text into `SuccessfulJob.analyzed_plan_text`.
ballista/scheduler/src/state/distributed_explain.rs	Adapts distributed explain output rows based on `ExplainFormat` (Tree vs non-Tree).
ballista/core/src/serde/mod.rs	Adds Ballista logical-extension codec wrapper encode/decode and unit tests for round-trips.
ballista/core/src/serde/logical_plan_ext.rs	New Ballista `UserDefinedLogicalNodeCore` wrappers for Explain/Analyze and format-string mapping helpers.
ballista/core/src/serde/generated/ballista.rs	Updates generated protobuf types for new logical-extension wrapper and `analyzed_plan_text`.
ballista/core/src/planner.rs	Wraps `LogicalPlan::{Explain,Analyze}` before sending to scheduler.
ballista/core/src/execution_plans/distributed_query.rs	Returns boxed streams and synthesizes `EXPLAIN ANALYZE` output when `analyzed_plan_text` is present.
ballista/core/proto/datafusion.proto	Adds `ExplainFormat` + `ExplainNode.format` to local proto definition.
ballista/core/proto/ballista.proto	Adds logical-extension wrapper protos and `SuccessfulJob.analyzed_plan_text`.
ballista/client/tests/context_checks.rs	Adds integration tests for `EXPLAIN FORMAT TREE` and `EXPLAIN ANALYZE` (incl. VERBOSE).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Splits the two features into complementary mechanisms: 1. EXPLAIN (including FORMAT TREE) - New BallistaExplainNode UserDefinedLogicalNode preserves LogicalPlan::Explain fields (verbose, explain_format) across the datafusion-proto round trip from client to scheduler. - BallistaLogicalExtensionCodec encodes/decodes it. - Scheduler unwraps it back into LogicalPlan::Explain before physical planning so DataFusion renders the requested format (Indent, TreeRender, or PostgresJSON) natively. 2. EXPLAIN ANALYZE (based on upstream apache#1567) - Client planner strips LogicalPlan::Analyze and wraps the inner plan in DistributedQueryExec, then wraps that in the new DistributedExplainAnalyzeExec. - DistributedQueryExec records its job_id after submission so the parent exec can retrieve it. - After the child query drains, the wrapper calls a new SchedulerGrpc.GetJobMetrics RPC that returns structured per-stage / per-operator metrics. - Metrics are formatted client-side into the familiar 'Plan with Metrics' single-row RecordBatch. Tests: - ballista-core: codec round-trip for BallistaExplainNode and unit test for format_metrics_as_record_batch. - ballista client integration: EXPLAIN, EXPLAIN FORMAT TREE, and sanitized EXPLAIN ANALYZE end-to-end assertions.

…llista into lukim/explain-analyze-codec

- Remove unused ExplainFormat enum and ExplainNode.format from ballista/core/proto/datafusion.proto. The Rust types at runtime come from datafusion_proto via extern_path, so these additions were dead schema that could mislead external consumers. - BallistaLogicalExtensionCodec::try_decode now falls through to the default codec when BallistaExplainNode::format_from_str cannot parse explain.explain_format, avoiding spurious errors when an unrelated extension payload decodes permissively into the wrapper shape. - Fix pre-existing clippy 'iterating on a map's values' in execution_graph::running_tasks and ExecutorManager terminating heartbeat filter; switch to .values() iterators. - cargo fmt.

Copilot

Pull request overview

Copilot reviewed 14 out of 15 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove unused verbose flag plumbing from DistributedExplainAnalyzeExec: the flag was never honored by format_metrics_as_record_batch so EXPLAIN ANALYZE VERBOSE rendered identically to plain EXPLAIN ANALYZE. Drop the field/constructor arg, call-site (planner), and unused job_id parameter; pass only the schema. - Fix misleading inline comment in distributed_explain.rs that claimed the FinalLogicalPlan fallback was for 'non-indent formats' when the branch also handles Indent. - Update PR description to match the final implementation (no more BallistaAnalyzeNode / analyzed_plan_text; client-side Analyze stripping via DistributedExplainAnalyzeExec + GetJobMetrics RPC).

phillipleblanc · 2026-04-23T06:19:29Z


  rpc GetJobStatus (GetJobStatusParams) returns (GetJobStatusResult) {}

+  rpc GetJobMetrics (GetJobMetricsParams) returns (GetJobMetricsResult) {}


This seems unrelated to the explain changes?

phillipleblanc

This was implemented on the wrong branch, I will re-implement this on spiceai-52.5

…rebased onto spiceai-52.5) (#34) * feat: distributed EXPLAIN, EXPLAIN FORMAT TREE, and EXPLAIN ANALYZE Reimplements PR #31 on top of spiceai-52.5 (DataFusion 52). EXPLAIN Round-trip the ExplainFormat through the client -> scheduler boundary by wrapping the LogicalPlan::Explain in a BallistaExplainNode logical extension before serialization. The scheduler unwraps it back to a native LogicalPlan::Explain so its existing physical-planning intercept can substitute a distributed-aware ExplainExec replacement. EXPLAIN FORMAT TREE Honored end-to-end by threading the ExplainFormat through extract_logical_and_physical_plans and construct_distributed_explain_exec in scheduler/state/distributed_explain.rs (Tree format omits the logical_plan row to match DataFusion's native behavior). EXPLAIN ANALYZE - Client (BallistaQueryPlanner): strips the LogicalPlan::Analyze and runs the inner plan via DistributedQueryExec, wrapped in a new DistributedExplainAnalyzeExec. After the child stream drains, the wrapper publishes the job_id (added Arc<Mutex<Option<String>>> handle on DistributedQueryExec) and calls the scheduler's GetJobMetrics RPC. - Scheduler: new GetJobMetrics RPC walks the execution graph in the same pre-order DFS order as ballista_core::utils::collect_plan_metrics so per-operator metrics line up with the rendered plan text. Falls back from the active-job cache to the saved completed-job graph so the call still succeeds after succeed_job moves the graph out of active_job_cache. Includes ballista/client tests covering all three forms in both standalone and remote modes. * Support for Tree formatting + tests * Fortmatting * More formatting * Add insta snapshots for FORMAT TREE integration tests * Lint * Improve * Lint --------- Co-authored-by: Sergei Grebnov <sergei.grebnov@gmail.com>

Copilot AI review requested due to automatic review settings April 22, 2026 17:46

lukekim force-pushed the lukim/explain-analyze-codec branch from 18bee66 to 6bb2bd4 Compare April 22, 2026 17:46

Copilot started reviewing on behalf of lukekim April 22, 2026 17:46 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread ballista/core/src/serde/mod.rs Outdated

Comment thread ballista/core/proto/datafusion.proto Outdated

lukekim force-pushed the lukim/explain-analyze-codec branch from 6bb2bd4 to 7a80162 Compare April 22, 2026 18:48

lukekim self-assigned this Apr 22, 2026

lukekim added the enhancement New feature or request label Apr 22, 2026

lukekim added 2 commits April 22, 2026 12:04

Merge branch 'spiceai-51' of https://github.com/spiceai/datafusion-ba…

36b76b2

…llista into lukim/explain-analyze-codec

Copilot AI review requested due to automatic review settings April 22, 2026 19:09

Copilot started reviewing on behalf of lukekim April 22, 2026 19:09 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread ballista/core/proto/ballista.proto

Comment thread ballista/core/src/execution_plans/distributed_explain_analyze.rs

Comment thread ballista/scheduler/src/state/distributed_explain.rs

lukekim requested review from peasee and phillipleblanc April 22, 2026 21:41

lukekim assigned phillipleblanc Apr 22, 2026

phillipleblanc reviewed Apr 23, 2026

View reviewed changes

phillipleblanc mentioned this pull request Apr 23, 2026

feat: distributed EXPLAIN, EXPLAIN FORMAT TREE, and EXPLAIN ANALYZE (rebased onto spiceai-52.5) #34

Merged

phillipleblanc closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: distributed EXPLAIN / EXPLAIN ANALYZE via logical extension codec#31

feat: distributed EXPLAIN / EXPLAIN ANALYZE via logical extension codec#31
lukekim wants to merge 4 commits into
spiceai-51from
lukim/explain-analyze-codec

lukekim commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phillipleblanc Apr 23, 2026

Uh oh!

phillipleblanc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		rpc GetJobStatus (GetJobStatusParams) returns (GetJobStatusResult) {}

		rpc GetJobMetrics (GetJobMetricsParams) returns (GetJobMetricsResult) {}

Conversation

lukekim commented Apr 22, 2026

Summary

Approach

Logical extension codec

Distributed `EXPLAIN ANALYZE`

Test Coverage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

phillipleblanc Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

phillipleblanc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants