Skip to content

fix(queries): defer expensive projection + functions follow-up#67

Merged
tonyalaribe merged 2 commits into
masterfrom
fix/wal-replay-buffer-sharing
Jun 11, 2026
Merged

fix(queries): defer expensive projection + functions follow-up#67
tonyalaribe merged 2 commits into
masterfrom
fix/wal-replay-buffer-sharing

Conversation

@tonyalaribe

Copy link
Copy Markdown
Contributor

Follow-up to #65. Adds the deferred-expensive-projection optimizer and functions fixes that landed on the branch after #65 merged.

🤖 Generated with Claude Code

…er-sharing

# Conflicts:
#	src/functions.rs
#	src/optimizers/defer_expensive_projection.rs
@tonyalaribe tonyalaribe merged commit ae5a9ab into master Jun 11, 2026
8 checks passed
@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Code Review

Note: This PR is already merged. Leaving review feedback for the record and future reference.

Overview

This follow-up to #65 delivers two performance/correctness fixes:

  1. DeferExpensiveProjection optimizer — moves expensive scalar computations (JSON building, casts, string functions) above a TopK (Sort.fetch) node so they run only on surviving rows instead of every scanned row. Directly addresses a prod OOM on wide-window span-list queries.
  2. functions.rs fixes — resolves an O(n²) regression in json_build_array, a ->> operator correctness bug for non-string Variant leaves, and to_jsonb(text[]) array elements incorrectly being JSON-sniffed.

defer_expensive_projection.rs

Correctness

  • The rewrite is logically sound: Sort(fetch) → Projection(expensive) becomes Projection(expensive) → Sort(fetch) → Projection(raw cols). Schema is preserved by forwarding Arc::clone(&proj.schema) to try_new_with_schema, keeping parent column references valid.
  • apply_order: TopDown is correct — the outermost Sort needs to be matched first, before inner rewrites run.
  • The unalias() call in out_map correctly strips outer aliases before building the column substitution map.

Potential issue — needed.is_empty() guard is opaque (line 95)

if needed.is_empty() {
    return Ok(Transformed::no(LogicalPlan::Sort(sort)));
}

This guard fires when the projection has non-trivial expressions with zero column references (e.g. SELECT now() FROM t ORDER BY ts LIMIT 5). In that case we can't build a min_proj, so bailing is correct — but the condition looks like dead code unless you know about pure-constant projections. A one-line comment would make intent clear.

Minor — O(n) deduplication (line 90)

if !needed.contains(c) { needed.push(c.clone()); }

Vec::contains is O(n) per insertion. For realistic projection widths (< ~50 columns) this is negligible, but a HashSet<Column> would be more idiomatic and correct-by-construction. Not blocking.

Test coverage

Three focused tests: happy path (defers_expensive_projection_past_topk), sort-key inlining (inlines_expensive_sort_key), and no-fetch skip (leaves_unfetched_sort_alone). A test for a sort key that directly references an input column (bypassing the projection, so out_map returns None) would round out coverage and guard against accidental regression in the transform_up no-op path.


functions.rs

O(n²) fix in json_build_array (line 966–983)

The fix is correct — columns are converted once up-front, and broadcast scalars are handled cleanly via the c.len() == 1 check in the row loop. The accompanying performance test (test_json_build_array_linear_and_broadcast) pins both the O(n) guarantee and the scalar broadcast case. Good.

->> bridge (json_to_pg_text, lines 274–306)

Logic matches Postgres semantics precisely:

  • JSON string → raw text (escape sequences resolved via serde_json parse)
  • JSON null → SQL NULL
  • number / bool / object / array → JSON literal text

The args.args.into_iter().next().unwrap() is safe given the uniform(1, ...) signature guarantee enforced by DataFusion, but .expect("json_to_pg_text: requires 1 argument") would surface a clearer panic message if the invariant ever breaks.

to_jsonb(text[]) sniff fix

The sniff_json flag correctly passes false when recursing into list elements (push_list_json always calls array_to_json_values_inner(..., false)). The regression test (test_to_jsonb_text_array_elements_stay_strings) is thorough and covers edge cases (JSON-looking strings, plain strings, nested calls via to_json).

parse_pg_format / to_char (lines 681–792)

  • DY is correctly matched before bare D (explicit starts_with(b"DY") check precedes the TOKENS linear scan and the bare-D guard).
  • The TOKENS ordering comment ("ORDER IS LOAD-BEARING") is clear and the array respects it (YYYY before YY, Month before Mon before MM, etc.).
  • Known divergences (case-sensitivity, Month/Day padding) are documented as intentional.
  • The postgres-parity test suite (test_to_char_postgres_parity) covers 20 format patterns with expected values captured from real Postgres 16 — this is the right level of coverage for a format-string parser.

VariantAwareExprPlanner — hardcoded variant-producing function names (line ~160)

matches!(name, "json_to_variant" | "variant_get" | "cast_to_variant" | ...)

This list will silently miss new variant-producing functions unless manually kept in sync. A trait method on the UDF (or a shared const VARIANT_FN_NAMES: &[&str]) would be more robust. Low priority, but worth a follow-up.


CI / Workflow

  • autoformat.yml uses github.event.head_commit.author.email != 'noreply@github.com' to skip bot commits. GitHub Actions bot commits use 41898282+github-actions[bot]@users.noreply.github.com — the email suffix matches noreply@github.com, so this guard works, but it could also inadvertently skip commits from any user whose email ends in that domain. Using github.actor != 'github-actions[bot]' is more targeted.
  • The vendor-patch sentinel check in ci.yml (grep for was_pre_optimized) is a smart guard against silent revert-on-resync.

Summary

The optimizer rewrite is well-designed and the tests confirm the before/after plan shape at both logical and physical levels. The functions.rs fixes are correct with good test coverage. Main action items:

  1. Add a comment to the needed.is_empty() guard explaining the constant-expression edge case.
  2. Consider HashSet<Column> for deduplication in the optimizer (minor).
  3. Add a test for sort keys that bypass the projection (out_map miss path).
  4. Fix autoformat.yml's bot-skip condition to use github.actor instead of email matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant