Skip to content

fix(insights): reject invalid regex filters with a clear error#58706

Draft
sampennington wants to merge 1 commit into
masterfrom
sam/fix-invalid-regex-error
Draft

fix(insights): reject invalid regex filters with a clear error#58706
sampennington wants to merge 1 commit into
masterfrom
sam/fix-invalid-regex-error

Conversation

@sampennington
Copy link
Copy Markdown
Contributor

@sampennington sampennington commented May 17, 2026

Problem

A property filter with an invalid regex crashed the whole insight query with ClickHouse CANNOT_COMPILE_REGEXP, surfacing as a 500. Surfaced from system.query_log (exception_code = 427) as part of the effort to reduce deterministic query-builder bugs (dashboard).

Examples seen: a trailing backslash, and PCRE lookaheads that ClickHouse's RE2 engine doesn't support. These are user-input errors but were reaching ClickHouse and failing as 500s.

Changes

  • regex / not_regex property filters now validate the pattern with RE2 (the same engine ClickHouse uses) before building the query.
  • An invalid pattern raises a user-facing QueryError naming the bad pattern, instead of a ClickHouse 500.

How did you test this code?

I'm an agent. Automated tests run locally:

  • New parameterized test test_property_to_expr_invalid_regex_raises_query_error covering a trailing backslash, an unsupported lookahead, and an unbalanced paren — for both regex and not_regex.
  • Full posthog/hogql/test/test_property.py suite — 129 passed.

Publish to changelog?

no

🤖 Agent context

Authored by Claude Code (Opus 4.7). Found via system.query_log analysis (exception_code = 427, CANNOT_COMPILE_REGEXP).

Validation uses the google-re2 package (already a dependency) so it matches ClickHouse's RE2 behavior exactly — Python's re would wrongly accept PCRE-only constructs like lookahead. This converts the failure from a 500 (incident noise) to a 4xx user-input error; the query still can't run with an invalid pattern, but the classification is now correct.

Agent-authored; requires human review.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down.

Add the run-playwright label if you want an E2E sweep before merging — CI will pick it up automatically.

Most PRs don't need this. Real regressions still get caught on master and fix-forward.

A property filter with an invalid regex (e.g. a trailing backslash, or a
PCRE lookahead that RE2 doesn't support) failed the whole insight query
with ClickHouse CANNOT_COMPILE_REGEXP — surfacing as a 500.

regex / not_regex property filters now validate the pattern with RE2 (the
same engine ClickHouse uses) and raise a user-facing QueryError naming the
bad pattern instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sampennington sampennington force-pushed the sam/fix-invalid-regex-error branch from 6106dd7 to 0ff9e49 Compare May 17, 2026 14:48
@sampennington
Copy link
Copy Markdown
Contributor Author

Automated code review

Reviewed posthog/hogql/property.py and the new test in test_property.py. The change is small, well-scoped, and does exactly what the PR description claims: converts a deterministic ClickHouse 500 (CANNOT_COMPILE_REGEXP) into a user-facing 4xx QueryError. Using google-re2 rather than Python re is the right call — it matches ClickHouse's engine so PCRE-only constructs like lookahead are correctly rejected rather than falsely accepted.

Strengths

  • Correct engine choice (google-re2, an existing dependency — confirmed in pyproject.toml:194).
  • Non-string guard in _validate_regex (property.py:430-431) means list values that survive collapsing won't blow up — they just skip validation rather than raising a confusing TypeError.
  • Parameterized test covers three distinct failure classes (trailing backslash, lookahead, unbalanced paren) across both operators.

Suggestions (non-blocking)

  • property.py:434 — error message echoes raw user input. f"Invalid regular expression: '{value}'" interpolates an arbitrary user-controlled string into the exception message. QueryError messages surface to API responses and into logs/error tracking. For a regex pattern this is low-risk, but consider truncating very long patterns (e.g. value[:200]) so a pathological multi-KB pattern can't bloat error payloads/log lines. Including the RE2 reason (str(err)) would also make the message more actionable than just echoing the pattern.

  • property.py:432 — catch is narrower than RE2 can throw. re2.compile on a malformed pattern raises re2.error, which is caught. But RE2 also rejects patterns that exceed its compiled-program size budget — the other realistic cause of CANNOT_COMPILE_REGEXP. Worth confirming an oversized-but-syntactically-valid pattern also raises re2.error; if it raises something else, that case still 500s and the fix is incomplete for it. A test with a deeply repeated pattern would confirm.

  • Performance: per-build compile. _validate_regex compiles the pattern every time _expr_to_compare_op runs and discards the compiled object. That's wasted work, but regex compilation is microseconds and query building is not a hot loop relative to the ClickHouse query that follows — fine as-is. No memoization needed; flagging only so it's a conscious choice.

  • RE2 stderr noise. google-re2 can write RE2 diagnostics to stderr on a failed compile in some builds. In practice the Python binding raises cleanly and is quiet, and any stray line is harmless log noise — not worth suppressing. Mentioning so it's not a surprise if a line appears in worker logs.

  • property.py:425-428 — comment length. Three-line docstring restating the PR rationale. Per repo convention, trim to one line (e.g. # Validate with RE2 so an invalid pattern is a 4xx, not a ClickHouse 500.); the "why" belongs in the commit/PR.

Functional note

A regex filter with a multi-element list value isn't collapsed to a scalar by the list-handling block (REGEX isn't in the EXACT/IN/ICONTAINS branches) — pre-existing behavior, untouched here, and _validate_regex safely skips it via the isinstance guard. Out of scope for this PR; noting only for awareness.

Verdict: Ship it — minor message-truncation and the oversized-pattern test would strengthen it, but nothing blocks merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant