Commit 924691c
feat(datafusion): flatten_json_properties + json_tree UDTFs (spiceai#10406)
* feat(datafusion): add flatten_json_properties and json_tree UDTFs (M1)
M1 skeleton of the `flatten_json_properties` table function from spiceai#10399 —
recursively walks a JSON-Schema-shaped document's `properties` tree and
emits one row per field with path, parent_path, name, description, type,
required, format, enum_values, and metadata columns.
Also adds `json_tree`, a schema-agnostic recursive JSON walker modeled on
DuckDB/SQLite's function of the same name (cols: key, value, type, atom,
id, parent, fullkey, path) so users have a generic alternative when their
input isn't JSON-Schema-shaped.
Both are experimental and gated behind `flatten-json-properties` and
`json-tree` Cargo features (off by default). M1 accepts only literal JSON
string arguments; per-row LATERAL invocation with a column reference lands
in M2 alongside `$ref` / `allOf` / `oneOf` / `anyOf` resolution,
`items.properties`, `additionalProperties` maps, the options struct,
cycle detection, and metrics.
Refs spiceai#10399
* feat(datafusion): complete M2-M4 for flatten_json_properties + json_tree
M1 shipped a `properties`-only skeleton behind a feature flag. This commit
lands the rest of the milestones for both functions.
M2 — Full shape coverage:
- `items.properties` — arrays of objects; leaves emit at `array.field`.
- `additionalProperties` — typed maps; `type = "map"` and children at
`map.child`.
- `allOf` / `oneOf` / `anyOf` — fields merged across branches with first-
declaration dedupe; `required` is union across branches.
- Local `$ref` resolution (JSON Pointer syntax, including `~0` / `~1`
escapes) with an active-ref set for cycle detection — cycles yield a
`kind=cycle` metric, no stack overflow.
- External `$ref` URIs — surfaced as `type = "ref"` rows with the URI
captured in `metadata`. Never dereferenced (no network / file IO).
- Options surface (named args on both UDTF and planning path):
`max_depth`, `max_rows`, `max_bytes`, `dialect`, `include_internal`,
`path_style` (`dot` or `json-pointer`).
- OpenTelemetry counters: `flatten_json_properties_invocations_total`,
`_rows_emitted_total`, `_errors_total{kind}` where kind ∈ {parse,
depth_exceeded, row_cap_hit, cycle, input_too_large}. Same set for
`json_tree` (with applicable kinds).
- Scalar UDF companion registered under the same name, returning
`List<Struct<...>>` — gives per-row / LATERAL semantics via
`UNNEST(flatten_json_properties(s.body))`.
`json_tree` brought to parity: max_depth / max_bytes options, scalar UDF
variant, cycle-independent depth cap, metrics.
M3 — UX + perf:
- Cookbook recipe at `examples/flatten-json-properties/` with a worked
spicepod.yaml (dataset → view via UNNEST → DuckDB acceleration →
column-level embeddings → vector_search) plus a 3-document sample.
- Bench harness at `crates/runtime/benches/flatten_json_properties.rs`
with Criterion groups for flat-schema fan-out, nested depth, and a
1k-schema catalog simulation.
M4 — Release decision:
- Feature flags dropped. Both UDTFs + UDFs register unconditionally on
every build.
Default behavior change vs M1: `include_internal` is now `false` (spec
default), so container rows (`object` / `array` / `map`) are suppressed
unless the caller opts in.
32 unit tests covering the full shape matrix, ref resolution, cycle
termination, option parsing, limit tripping, path-style variants, scalar
UDF per-row dispatch with NULLs, and UDTF plan integration.
Refs spiceai#10399
* refactor(datafusion/udtf): simplify walker per /simplify review
- Replace hand-rolled `resolve_local_ref` with `serde_json::Value::pointer`.
- Delete `collect_effective_owned` and the `Cow<'static>` lifetime-laundering
dance; everything walked lives under the walker's `&'a Value` root, so
`&'a Value` suffices. Removes two identical recursion paths and the deep
target clone on every `$ref` resolution.
- Drop the dead `depth` parameter from `collect_effective`.
- Hoist `property_fields` / `tree_fields` into static `LazyLock<Fields>`
handles so the schema isn't reallocated on every call.
- Extract `build_tree_arrays` in `json_tree` so `rows_to_batch` and the
scalar-UDF struct-array builder share one implementation.
- Borrow-not-clone for `HashSet<&str>` required / seen_names in the walker.
- Strip WHAT-style comments and task-references from the bench.
* fix(datafusion/udtf): address PR review feedback
- Update copyright headers to 2024-2026 across the new UDTF files.
- Tighten scalar UDF signatures (`flatten_json_properties` / `json_tree`)
to accept Utf8 / LargeUtf8 / Utf8View; normalize via `cast` so non-Utf8
string columns no longer panic in `as_string_array`.
- Cap combinator / `$ref` expansion in `collect_effective` by threading a
ref-depth counter through recursion; prevents pathological chains from
bypassing `max_depth` / exhausting the stack.
- Clarify `dialect` option semantics in docs: currently only tags
invocation metrics; OpenAPI-specific walker behavior is future scope.
- `compute_type` no longer treats non-object `properties` / non-object-or-
array `items` as `object` / `array`.
- Collapse duplicate-row emission in `handle_field`: recurse once on the
original `spec` so `walk_schema`'s `seen_names` de-duplicates fields
across allOf/oneOf/anyOf / `$ref` branches.
- Document single-node-only scan for both UDTFs (cluster mode requires a
`UdtfArgs` proto variant + codec, tracked as follow-up).
- Fix three branch-local clippy `collapsible_if` errors and annotate
`emit_row`'s argument count.
* fix(datafusion/udtf): address second round of PR review
- `json_tree`: root row now emits `path = NULL` (field is nullable) to
match DuckDB / SQLite `json_tree` semantics; children still carry the
parent fullkey as `path`.
- `json_tree`: array element rows now set `key = idx.to_string()` so
consumers can distinguish array siblings (previously NULL).
- `flatten_json_properties`: container fields with no walkable children
(array of primitives, map of primitives, empty object) are now emitted
as leaf rows in `include_internal = false` mode, so the field still
appears in output.
- Deny `flatten_json_properties` / `json_tree` scalar UDFs for federation
pushdown; add them to the existing `deny_list_blocks_spice_builtins`
test so regressions are caught.
README double-pipe comment was a false positive (the file already uses
single `|` with `\|` escapes inside cells).
* fix(datafusion/udtf): address round-3 PR review
- `json_tree`: add `max_rows` option (default 1,000,000) so bounded
`max_bytes` input can't explode into unbounded row counts. Walker
records `row_cap_hit` metric when hit and truncates cleanly.
- `json_tree`: clarify module-level docs — named options are UDTF-form
only; the scalar UDF takes just the JSON argument with default caps.
- Both scalar UDFs now truncate deterministically at `i32::MAX` flattened
rows (with a `row_cap_hit` metric) instead of returning a query-level
`Execution` error on `List<Struct>` offset overflow. Preserves the
"never a query-level error" contract.
Not addressed: re-raised comments on `DataSourceExec` / cluster-mode
`UdtfExec` wrapping — documented as follow-up scope in the prior commit;
wrapping requires a new `UdtfArgs` proto variant + codec.
* style: cargo fmt line-wrap in flatten_json_properties scalar UDF
* fix(datafusion/udtf): bracket-quote JSON-path keys with hyphens
SQLite / DuckDB `json_tree` path shorthand only accepts identifier-style
keys; anything else must be bracket-quoted so consumers can re-parse the
`fullkey`. Previously a key like `has-hyphen` was rendered as `$.a-b`,
which isn't a valid shorthand. Now forces bracket-quoting for keys with
any non-identifier character, and extends the existing special-character
test to cover hyphens.
* fix(datafusion/udtf): switch scalar UDFs to LargeList<Struct<...>>
Copilot flagged that i32 ListArray offsets could silently truncate
results when the flattened row count across a batch exceeds i32::MAX
(only a metric signal was emitted). Silent incomplete results risk
query correctness.
Switching to LargeList (i64 offsets) makes overflow effectively
impossible with no behavior change — UNNEST works transparently on both
variants. Drops the `max_flattened_rows` truncation path entirely.
* style(datafusion/udtf): fix pedantic clippy + fmt errors
CI's `make lint-rust` uses `clippy::pedantic + clippy::allow_attributes +
clippy::unwrap_used + clippy::expect_used`, which surfaced:
- `#[allow(clippy::too_many_arguments)]` → `#[expect(...)]` with reason
(lint 1.81+ requires explicit expect for cleared warnings).
- `doc_markdown`: backtick-wrap `UInt`, `Bool`, `Utf8`, numeric defaults,
`DuckDB`, `SQLite`, `DoS`, `DataFusion`, `OpenAPI` in module docs.
- `single_match_else` + `match_like_matches_macro`: rewrite the
`serde_json::from_str` match as `let Ok(root) = ... else { ... }`.
- `.unwrap()` on `key.chars().next()` in `escape_object_key` → `is_some_and`.
- `name.to_string()` on `&String` → `name.clone()`.
- `all_rows.len() as i64` → `i64::try_from(...).unwrap_or(i64::MAX)`
(walker caps bound the count well under i64::MAX; saturate instead of
unwrap since the lint config bans `.unwrap()`/`.expect()`).
* fix(datafusion/udtf): type-union ordering + fail-loud on offset overflow
- `compute_type`: when `"type"` is an array (JSON-Schema nullable syntax,
e.g. `["null", "string"]`), pick the first non-null entry so optional
fields classify as their real type instead of `"null"`. Falls back to
`"null"` only when it's the sole type. Extended test coverage.
- Both scalar UDFs: `i64::try_from(row_count)` now returns a
`DataFusionError::Execution` on overflow instead of saturating to
`i64::MAX`. Saturation would silently misalign `LargeList` offsets;
erroring surfaces the (unreachable-in-practice) condition loudly.
* fix(datafusion/udtf): cross-walk cycle detection + batch row cap
- `walk_schema` now persists `$ref` insertion in `visited_refs` for the
duration of the tree-walk recursion, not just for a single
`collect_effective` pass. Fixes a leak where schemas like `{$defs:
{Node: {properties: {next: {$ref: #/$defs/Node}}}}, properties: {root:
{$ref: #/$defs/Node}}}` could descend past the first resolution
boundary. Tightened `local_ref_cycle_terminates` to assert stopping at
`root.next`.
- Both scalar UDFs now error on `DataFusionError::Execution` if the
accumulated cross-batch row count exceeds `SCALAR_BATCH_MAX_ROWS`
(10M). Per-document caps bound single rows, but a wide batch could
previously reach `number_rows * max_rows` in memory before returning.
* fix(udtf): pass projection to MemorySourceConfig in json_properties and json_tree
Both UDTFs were ignoring the projection parameter in scan(), causing a
schema mismatch error when selecting specific columns (e.g. SELECT path,
name, type FROM flatten_json_properties(...)). Pass projection.cloned()
to MemorySourceConfig::try_new() so DataFusion can push column pruning
down into the scan.
* fix: format MemorySourceConfig initialization for better readability
* Tests + Lint
* fix(tests): improve error handling and assertions in JSON property tests
* fix(tests): update projection comments for clarity in JSON schema tests
---------
Co-authored-by: Viktor Yershov <viktor@spice.ai>1 parent 4e679c0 commit 924691c
9 files changed
Lines changed: 2742 additions & 0 deletions
File tree
- crates/runtime
- benches
- src/datafusion
- udtf
- examples/flatten-json-properties
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
374 | 374 | | |
375 | 375 | | |
376 | 376 | | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
131 | 131 | | |
132 | 132 | | |
133 | 133 | | |
| 134 | + | |
134 | 135 | | |
135 | 136 | | |
136 | 137 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
20 | 24 | | |
21 | 25 | | |
22 | 26 | | |
| |||
80 | 84 | | |
81 | 85 | | |
82 | 86 | | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
83 | 98 | | |
84 | 99 | | |
85 | 100 | | |
| |||
101 | 116 | | |
102 | 117 | | |
103 | 118 | | |
| 119 | + | |
| 120 | + | |
104 | 121 | | |
105 | 122 | | |
106 | 123 | | |
| |||
191 | 208 | | |
192 | 209 | | |
193 | 210 | | |
| 211 | + | |
| 212 | + | |
194 | 213 | | |
195 | 214 | | |
196 | 215 | | |
| |||
0 commit comments