fix(clickhouse): teach auto-materialization to see property group reads by andyzzhao · Pull Request #62585 · PostHog/posthog

andyzzhao · 2026-06-09T23:43:38Z

Problem

The weekly materialization analyzer (_analyze in ee/clickhouse/materialized_columns/analyze.py) finds candidate properties by regex-extracting JSONExtract* calls from slow queries in query_log. That predates property groups: with propertyGroupsMode=OPTIMIZED (the cloud default), the HogQL printer reads unmaterialized properties through map columns like properties_group_custom['foo'], which contain no JSONExtract at all. The queries that most need a dedicated materialized column are exactly the ones reading fat property maps, and they are structurally invisible to the analyzer, no matter how much they scan.

This is not hypothetical: one team's recurring user-written HogQL aggregations over a handful of custom properties read about 5 PiB per week (thousands of runs each scanning hundreds of GiB of the custom-properties map), with hundreds of those runs individually clearing the analyzer's 40s and 20GB gates. Running the fixed analysis against production query logs with unchanged thresholds immediately surfaces six properties for that team alone, the top one qualifying through 133 slow queries in a week. None of them could ever be suggested today.

Changes

_analyze runs the same gating (bytes, rows, duration/timeout, occurrence count) over property group map accesses, extracted with a regex restricted to the group columns registered for the events table, and translates each group column back to its source column via a new PropertyGroupManager.get_group_columns_to_source_columns. Suggestions merge with the existing JSONExtract-based ones, deduplicated.
The hardcoded 20GB / 5M-row floors become keyword parameters with unchanged defaults. This is what makes the analyzer testable: inserting synthetic rows into system.query_log (the old test's technique, skipped for a long time) is silently ignored on current ClickHouse versions, so the new test runs real queries and opens the floors instead.

How did you test this code?

I am an agent; automated tests plus a production dry run:

New end-to-end test: executes real property-group-reading queries (plain equality, the has(...) ? ... : null ternary the printer emits, person map, and feature flag group), flushes logs, and asserts _analyze suggests each with the right source column. Notable test detail: every probed key and value must exist on a seeded event, since the map bloom filter indexes otherwise prune reads to zero rows, failing the analyzer's gates. A unit test covers the new registry method.
ee/clickhouse/materialized_columns/test/: 21 passed.
Production dry run of the new extraction SQL with real thresholds (40s, 20GB, more than 9 occurrences, 7 days) against query_log_archive scoped to the team above: returns 6 properties with 14 to 133 qualifying queries each. The existing JSONExtract pass returns nothing for them.

Automatic notifications

Publish to changelog?
Alert Sales and Marketing teams?

Docs update

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

Claude Code session, continuing a production slow-query investigation: a sweep of slow user-written HogQL surfaced recurring aggregations reading hundreds of GiB per run via properties_group_custom, with the properties unmaterialized. Traced why auto-materialization never picked them up: the compiled queries contain zero JSONExtract, so the analyzer's regex (and its query LIKE '%JSONExtract%' prefilter) can never match.
Verified before changing code: the weekly task and DDL machinery work (recent materializer DDL exists), the target queries clear every other analyzer gate, and the new extraction regex matches the printer's literal and ternary access shapes in query_log text.
The old skipped test documented why this area went untested: ClickHouse now silently ignores INSERTs into system log tables. The replacement test exercises the analyzer against genuine query log entries.

The materialization analyzer finds candidate properties by extracting JSONExtract calls from slow queries in query_log. With property groups enabled the printer reads unmaterialized properties through map columns like properties_group_custom['foo'] instead, so exactly the reads that most need a dedicated column never match and can never be suggested, no matter how much they scan. Run the same gating over property group map accesses, restricted to the group columns registered for the events table, and translate each group column back to its source column. The byte and row floors are now parameters with unchanged defaults so tests can exercise the analyzer against real query log entries: inserting synthetic rows into system.query_log is silently ignored on current ClickHouse. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

greptile-apps · 2026-06-09T23:46:57Z

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
ee/clickhouse/materialized_columns/analyze.py:109-164
**Duplicated SQL structure across both query passes**

The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.

### Issue 2 of 2
ee/clickhouse/materialized_columns/test/test_analyze.py:13-47
**Test covers four cases but is not parameterised**

`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.

_{Reviews (1): Last reviewed commit: "fix(clickhouse): teach auto-materializat..." | Re-trigger Greptile}

greptile-apps · 2026-06-09T23:47:01Z

+        raw_group_queries = sync_execute(
+            """
+WITH
+    {min_query_time} as slow_query_minimum,
+    (
+        159, -- TIMEOUT EXCEEDED
+        160, -- TOO SLOW (estimated query execution time)
+    ) as exception_codes,
+    {min_bytes_read} as min_bytes_read,
+    {min_read_rows} as min_read_rows
+SELECT
+    group_access[1] as column,
+    group_access[2] as prop_to_materialize
+FROM
+    (
+        SELECT
+            arrayJoin(
+                extractAllGroups(query, '({group_column_alternation})\\[\\'([a-zA-Z0-9_\\-\\.\\$\\/\\ ]*?)\\'\\]')
+            ) as group_access,
+            exception_code,
+            query_duration_ms
+        FROM
+            clusterAllReplicas({cluster}, system, query_log)
+        WHERE
+            query_start_time > now() - toIntervalHour({since})
+            and ({group_column_prefilter})
+            and type > 1
+            and is_initial_query
+            and JSONExtractString(log_comment, 'access_method') != 'personal_api_key' -- API requests failing is less painful than queries in the interface
+            and JSONExtractString(log_comment, 'kind') != 'celery'
+            and JSONExtractInt(log_comment, 'team_id') != 0
+            and query not like '%person_distinct_id2%' -- Old style person properties that are joined, no need to optimize those queries
+            and read_bytes > min_bytes_read
+            and (exception_code IN exception_codes OR query_duration_ms > slow_query_minimum)
+            and read_rows > min_read_rows
+            {team_id_filter}
+    )
+GROUP BY
+    1, 2
+HAVING
+    countIf(exception_code IN exception_codes) > 0 OR countIf(query_duration_ms > slow_query_minimum) > 9
+ORDER BY
+    countIf(exception_code IN exception_codes) DESC,
+    countIf(query_duration_ms > slow_query_minimum) DESC
+LIMIT 100 -- Make sure we don't add 100s of columns in one run
+            """.format(
+                since=since_hours_ago,
+                min_query_time=min_query_time,
+                team_id_filter=f"and JSONExtractInt(log_comment, 'team_id') = {team_id}" if team_id else "",
+                cluster=CLICKHOUSE_CLUSTER,
+                group_column_alternation=group_column_alternation,
+                group_column_prefilter=group_column_prefilter,
+                min_bytes_read=min_bytes_read,
+                min_read_rows=min_read_rows,
+            ),
+        )


Duplicated SQL structure across both query passes

The group-query block repeats the CTE definitions (slow_query_minimum, exception_codes, min_bytes_read, min_read_rows), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded > 9), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.

Prompt To Fix With AI

This is a comment left during a code review. Path: ee/clickhouse/materialized_columns/analyze.py Line: 109-164 Comment: **Duplicated SQL structure across both query passes** The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-09T23:47:02Z

+    def test_property_group_reads_suggest_materialization(self):
+        # Real queries against events whose log entries read via property group map columns. Inserting
+        # synthetic rows into system.query_log is silently ignored on current ClickHouse versions, so the
+        # analyzer is exercised against genuine query log entries with the gating thresholds opened up.
+        # Every probed key must exist on at least one event: the map-key bloom filter otherwise prunes the
+        # scan to zero rows read, which fails the analyzer's read_rows/read_bytes gates even when opened to 0.
+        _create_event(
+            team=self.team,
+            distinct_id="d1",
+            event="e",
+            properties={"materialize_me_group": "x", "mat_group_ternary": "y2", "$feature/my-flag": "true"},
+            person_properties={"mat_person_group": "z"},
+        )
+        flush_persons_and_events()
+
+        group_read_queries = [
+            f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_custom['materialize_me_group'] = 'x'",
+            f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND if(has(properties_group_custom, 'mat_group_ternary'), properties_group_custom['mat_group_ternary'], NULL) != 'y'",
+            f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND person_properties_map_custom['mat_person_group'] = 'z'",
+            f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_feature_flags['$feature/my-flag'] = 'true'",
+        ]
+        with tags_context(team_id=self.team.pk):
+            for query in group_read_queries:
+                for _ in range(10):
+                    sync_execute(query)
+        sync_execute("SYSTEM FLUSH LOGS")
+
+        suggestions = set(
+            _analyze(since_hours_ago=1, min_query_time=-1, team_id=self.team.pk, min_bytes_read=0, min_read_rows=0)
+        )
+
+        assert ("events", "properties", "materialize_me_group") in suggestions
+        assert ("events", "properties", "mat_group_ternary") in suggestions
+        assert ("events", "person_properties", "mat_person_group") in suggestions
+        assert ("events", "properties", "$feature/my-flag") in suggestions


Test covers four cases but is not parameterised

test_property_group_reads_suggest_materialization exercises four distinct property group types (custom, ternary form of custom, person_properties, feature_flags) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected source_column + expected property_name) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.

Prompt To Fix With AI

This is a comment left during a code review. Path: ee/clickhouse/materialized_columns/test/test_analyze.py Line: 13-47 Comment: **Test covers four cases but is not parameterised** `test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-06-10T00:03:42Z

🎭 Playwright report · View test results →

⚠️ 2 flaky tests:

Add a new person property (chromium)
Save an insight, make changes, discard them, and save a copy (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

andyzzhao self-assigned this Jun 9, 2026

assign-reviewers-posthog Bot requested a review from a team June 9, 2026 23:43

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

fix(clickhouse): satisfy ty on analyzer suggestion types

7c82602

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

andyzzhao closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clickhouse): teach auto-materialization to see property group reads#62585

fix(clickhouse): teach auto-materialization to see property group reads#62585
andyzzhao wants to merge 2 commits into
masterfrom
andy/materialize-analyzer-property-groups

andyzzhao commented Jun 9, 2026

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andyzzhao commented Jun 9, 2026

Problem

Changes

How did you test this code?

Automatic notifications

Docs update

🤖 Agent context

Uh oh!

greptile-apps Bot commented Jun 9, 2026

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant