Skip to content

fix(insights): coerce nullable string args of array-returning functions#58704

Draft
sampennington wants to merge 1 commit into
masterfrom
sam/fix-split-functions-nullable-arg
Draft

fix(insights): coerce nullable string args of array-returning functions#58704
sampennington wants to merge 1 commit into
masterfrom
sam/fix-split-functions-nullable-arg

Conversation

@sampennington
Copy link
Copy Markdown
Contributor

Problem

An insight using splitByChar (or another string-splitting function) on a nullable value crashed the whole query with ClickHouse Nested type Array(String) cannot be inside Nullable type. Surfaced from system.query_log (exception_code = 43) as part of the effort to reduce deterministic query-builder bugs (dashboard).

splitByChar and friends return Array(...). A JSON property resolves to a Nullable(String), so e.g. a breakdown on splitByChar('@', properties.$distinct_id) produces Nullable(Array(String)) — a type ClickHouse forbids.

Changes

  • The ClickHouse printer wraps a nullable string argument of array-returning string functions (splitByChar, splitByString, splitByRegexp, splitByWhitespace, splitByNonAlpha, alphaTokens, extractAllGroups, ngrams, tokens) in ifNull(arg, ''), keeping the result a plain Array.

How did you test this code?

I'm an agent. Automated tests run locally:

  • New test test_split_function_on_nullable_property executing splitByChar('@', properties.email).
  • Full posthog/hogql/printer/test/test_printer.py suite — 608 passed, 175 snapshots unchanged.

Publish to changelog?

no

🤖 Agent context

Authored by Claude Code (Opus 4.7). Found via system.query_log analysis (exception_code = 43, Nested type Array(...) cannot be inside Nullable).

Only nullable string args are wrapped (checked via the resolved arg type), so non-nullable args and non-string args like the split-count are untouched — no snapshot churn for existing split-function usage.

Agent-authored; requires human review.

splitByChar (and the other string-splitting functions) return Array(...).
Given a Nullable string argument — e.g. a breakdown on splitByChar('@',
properties.$distinct_id) — ClickHouse would produce Nullable(Array(...)),
which it rejects, failing the whole insight query.

The printer now wraps a nullable string argument of these functions in
ifNull(arg, '') so the result type stays a plain Array.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down.

Add the run-playwright label if you want an E2E sweep before merging — CI will pick it up automatically.

Most PRs don't need this. Real regressions still get caught on master and fix-forward.

@sampennington
Copy link
Copy Markdown
Contributor Author

Automated code review

Reviewed posthog/hogql/printer/clickhouse.py and the new test on branch sam/fix-split-functions-nullable-arg.

Summary

The fix is correct and minimal: array-returning string functions with a Nullable(String) argument produce an illegal Nullable(Array(...)), and wrapping nullable string args in ifNull(arg, '') keeps the result a plain Array. The pattern matches existing code in the same file (the concat branch directly above it, and the resolve_constant_type usage at L215), so it is idiomatic. The nullable detection via arg.type.resolve_constant_type(self.context) + isinstance(..., ast.StringType) + .nullable is the same approach used elsewhere in the printer and is reliable.

Function-list completeness (main concern)

The new _ARRAY_RETURNING_STRING_FUNCTIONS set covers every splitBy* function plus alphaTokens/tokens/ngrams/extractAllGroups. But posthog/hogql/functions/clickhouse/strings.py registers several other array-returning string functions that have the exact same bug and are not covered:

  • extractAll (L75) — returns Array(String)
  • extractAllGroupsHorizontal (L76) — returns Array(Array(String))
  • extractAllGroupsVertical (L77) — returns Array(Array(String))
  • extractGroups (L78) — returns Array(String)

extractAllGroups is included but its three siblings are not, which is an inconsistent and likely accidental omission. A regexpExtract with multiple groups can also return an array in some ClickHouse versions, though its registered HogQL form (L107) is the scalar variant, so it can reasonably be left out. Recommend adding the four extract* functions above so the fix is complete and the set is internally consistent.

Minor

  • clickhouse.py:181 — coercion applies to any nullable string arg regardless of position. For splitByChar(sep, str) the separator (arg 0) is normally a non-nullable constant, so in practice only the string is wrapped — correct. Worth a one-line note that position-agnostic coercion is intentional (a nullable separator would also be illegal), so a future reader does not "fix" it to only target the data arg.
  • The # These return Array(...) inline comment in the new branch duplicates the comment already on the frozenset definition (L74-75). Drop one of the two.

Test

test_split_function_on_nullable_property only asserts results is not None — it confirms the query no longer crashes but does not verify the emitted SQL contains ifNull(...) or that results are correct. Consider either (a) asserting on the printed SQL (the printer test file has many assert print_ast(...) == ... examples) to lock in the ifNull wrapping, and/or (b) inserting a row with a NULL email and a non-NULL email and asserting the split output, so a regression that drops the coercion is caught. Also add a case for a non-splitByChar function (e.g. alphaTokens or one of the extract* functions) so the broader set is exercised, and ideally a non-nullable-arg case to confirm no spurious ifNull is emitted.

Verdict

Needs changes — ship-quality fix, but add the missing extract* array-returning functions and strengthen the test to assert the actual coercion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant