feat: format-agnostic serde hook for PhysicalExpr (Column + NotExpr)#21966
Draft
adriangb wants to merge 5 commits intoapache:mainfrom
Draft
feat: format-agnostic serde hook for PhysicalExpr (Column + NotExpr)#21966adriangb wants to merge 5 commits intoapache:mainfrom
adriangb wants to merge 5 commits intoapache:mainfrom
Conversation
Introduces a format-agnostic serialization hook on PhysicalExpr:
- `serde_tag(&self) -> &'static str` (default empty = "not serializable")
- `erased_serialize(&self) -> Box<dyn erased_serde::Serialize + '_>`
(default returns an error sentinel)
Plus an `impl Serialize for dyn PhysicalExpr` that wraps the two into a
`{tag, data}` envelope. `Arc<dyn PhysicalExpr>` becomes serializable
automatically via serde's `rc` feature.
This is the producer half of an in-tree, format-agnostic serialization
story for PhysicalExpr (JSON for debugging, proto via a serde adapter
in a follow-up). No deserialization yet — split for review.
The trait method returns a Box<dyn erased_serde::Serialize> rather than
taking a &mut dyn erased_serde::Serializer because erased_serde 0.4
sealed its Serialize trait, so the obvious bridge (custom impl) is no
longer possible. Returning an erased Serialize value sidesteps that.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rRegistry Adds the deserialization half of the format-agnostic PhysicalExpr serde story: - `PhysicalExprDeserialize` trait — `const TAG: &'static str` plus a `deserialize(ctx, data: serde_json::Value)` constructor. Implementers recurse on trait-object children via `ctx.registry().deserialize_value`. - `DeserializeContext` — currently just bundles the registry; carries whatever child-decoding helpers grow over time. - `PhysicalExprRegistry` — builder-style (`empty().with::<T>()...`) map from tag to constructor, modeled on the optimizer's rule list. No globals, no inventory crate — explicit per-deployment construction. Duplicate-tag registration panics; the tag-collision behavior is finalized in a later commit. - `Registry::deserialize_json(&str)` — convenience entry point, parses to `serde_json::Value` and dispatches. The trait method takes `serde_json::Value` for the body, not a streaming serde Deserializer. erased_serde 0.4 sealed both Serialize and Deserializer, so the streaming bridge isn't reachable from outside the crate; using a buffered Value keeps things simple. A future PR can swap to a streaming/format-agnostic shape — the surface (Registry, trait, ctx) is designed to absorb that change. No built-in expressions are registered yet — that's C3+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… serde Replaces the JSON-specific deserialize trait method with a format-agnostic one. The previous shape pinned `serde_json::Value` into the trait signature, which would have forced parallel methods for every other format. This commit reworks the surface so the trait is independent of the wire format: - `PhysicalExprDeserialize::deserialize(ctx: &mut DeserializeContext<'_, '_>)` — `ctx` exposes a `&mut dyn erased_serde::Deserializer<'de>` plus the registry. Leaf types call `ctx.deserialize::<Self>()`; expressions with trait-object children write a custom Visitor and use `ctx.registry().expr_seed()` for `next_value_seed`. - `PhysicalExprRegistry::deserialize` for any `serde::Deserializer`, with `deserialize_json` retained as a convenience. - The dispatch (read tag, look up in registry, drive constructor) lives in a `DeserializeSeed` chain so it composes with serde streaming. Adds the first round-trippable expression — `Column` — to validate the shape end-to-end, plus a `default_registry()` in `physical-expr` and unit tests for round-trip and unknown-tag errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the second built-in expression to the new format-agnostic serde
hook (Column was the first, in the previous commit):
- `NotExpr` opts in via `serde_tag` + `erased_serialize` and a
`PhysicalExprDeserialize` impl that uses a small new helper,
`DeserializeContext::deserialize_unary("arg")`. The helper covers
the unary-wrapper shape (a single `Arc<dyn PhysicalExpr>` field)
that's common across the built-ins, so this is the boilerplate
factored out for reuse later.
- `default_registry()` now ships with `Column` and `NotExpr`.
- Tag-collision finalization (open question from the plan): `with::<T>`
panics on duplicate registration; new `try_with::<T>` returns a
`Result` for the runtime-plugin case. Tests cover both paths.
Round-trip tests cover Column, NotExpr, and a nested NOT(NOT(a)) tree.
The remaining built-ins follow the same shape and will land in
follow-ups, in batches grouped by what they need from the rest of the
codebase (Operator, ScalarValue, arrow::DataType, etc., all need their
own serde impls before the expressions that hold them can be
registered).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New example `datafusion-examples/examples/physical_expr_serde` —
defines a custom `PhysicalExpr` (with a trait-object child),
registers it on top of the default registry, and round-trips it
through JSON. Covers the case that needs a hand-rolled Visitor.
- Library-user-guide page `physical-expr-serde.md` walks through the
trait surface, the `{tag, data}` envelope, child handling via
`expr_seed`, tag uniqueness, format choice (and how this differs
from `datafusion-proto`'s wire-stable path), and what's currently
registered. Linked from `index.rst`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Related to #21835 (no
Closes— this PR only does thePhysicalExprhalf and only registers two built-ins; full closure is multi-PR).Rationale for this change
DataFusion has no in-tree serialization on the
PhysicalExprtrait. The proto crate handles serialization with a downcast chain insidedatafusion-proto/src/physical_plan/{from_proto,to_proto}.rs, which forces every expression author to expose private state viapubaccessors so the proto crate can poke at it (#21835).PR #21929's draft branch added a
PhysicalExpr::to_protohook to fix that — but the hook is protobuf-specific. If we ever want JSON, bincode, or anything else, we'd need a parallelto_json,to_bincode, etc., one trait method per format.This PR introduces a single
serde-style hook onPhysicalExprthat works with anyserdeformat (JSON for debugging, bincode for binary, a serde adapter on top of protobuf, etc.). The shape is intentionally minimal so it can be reviewed in isolation:PhysicalExpr::serde_tag(&self) -> &'static str— default""means "not serializable".PhysicalExpr::erased_serialize(&self) -> Box<dyn erased_serde::Serialize + '_>— default returns an error sentinel.PhysicalExprDeserializetrait +PhysicalExprRegistrybuilder +DeserializeContextover&mut dyn erased_serde::Deserializer<'de>.The deserialization side runs through any
serde::Deserializer(Registry::deserialize) withdeserialize_jsonas a convenience entry point. Trait-object children recurse throughregistry.expr_seed()inside hand-rolled visitors.Wire stability across DataFusion versions is not a goal here — the
datafusion-protocrate stays the path for that. The serde hook is for in-process tooling, debugging, and intra-cluster ephemeral serialization. A future PR can write a serde adapter on top of prost so proto becomes another consumer oferased_serialize, at which point the proto crate's downcast chain can collapse.What changes are included in this PR?
PhysicalExpr(physical-expr-common):serde_taganderased_serialize, both with default impls so this is non-breaking for existing in-tree and out-of-tree implementers.impl Serialize for dyn PhysicalExprproducing a{tag, data}envelope.Arc<dyn PhysicalExpr>is automaticallySerializevia serde'srcfeature.PhysicalExprDeserializetrait +PhysicalExprRegistry(physical-expr-common). Builder-style (empty().with::<T>().with::<U>()), modeled on the optimizer's rule list. No globals, noinventorycrate (which has wasm/FFI gotchas DataFusion cares about).with::<T>()panics on duplicate tag;try_with::<T>()returnsResultfor the runtime-plugin case.Column(leaf) andNotExpr(single trait-object child, exercises theDeserializeContext::deserialize_unaryhelper).datafusion-examples/examples/physical_expr_serde— defines a customPhysicalExprwith a trait-object child, registers it on top ofdefault_registry(), and round-trips through JSON.datafusion-proto.The remaining built-ins follow the same shape and will land in follow-up PRs in batches grouped by what they need from the rest of the codebase (
Operatorfirst, thenScalarValue/arrow::DataTypeforLiteral/Cast*/Like/InList/Case).Are these changes tested?
Yes:
datafusion-physical-exprexercise: Column round-trip, NotExpr round-trip, NOT(NOT(a)) nested round-trip, unknown-tag error path,try_withduplicate-tag error,withduplicate-tag panic.cargo run --example physical_expr_serde) and prints the JSON, the round-tripped expression, and the expected error when the registry doesn't know about the custom tag.cargo check --workspaceandcargo clippy --all-targets --all-features -- -D warnings(on touched crates) clean.Are there any user-facing changes?
Yes, but additive:
PhysicalExprtrait, both with defaults. Existing implementers (in-tree and out-of-tree) compile unchanged; they just won't be serializable through the new path until they opt in.PhysicalExprRegistry,PhysicalExprDeserialize,DeserializeContext,ExprSeed, plusdefault_registry()indatafusion-physical-expr.🤖 Generated with Claude Code