Skip to content

fix(clickhouse): teach auto-materialization to see property group reads#62585

Closed
andyzzhao wants to merge 2 commits into
masterfrom
andy/materialize-analyzer-property-groups
Closed

fix(clickhouse): teach auto-materialization to see property group reads#62585
andyzzhao wants to merge 2 commits into
masterfrom
andy/materialize-analyzer-property-groups

Conversation

@andyzzhao

Copy link
Copy Markdown
Contributor

Problem

The weekly materialization analyzer (_analyze in ee/clickhouse/materialized_columns/analyze.py) finds candidate properties by regex-extracting JSONExtract* calls from slow queries in query_log. That predates property groups: with propertyGroupsMode=OPTIMIZED (the cloud default), the HogQL printer reads unmaterialized properties through map columns like properties_group_custom['foo'], which contain no JSONExtract at all. The queries that most need a dedicated materialized column are exactly the ones reading fat property maps, and they are structurally invisible to the analyzer, no matter how much they scan.

This is not hypothetical: one team's recurring user-written HogQL aggregations over a handful of custom properties read about 5 PiB per week (thousands of runs each scanning hundreds of GiB of the custom-properties map), with hundreds of those runs individually clearing the analyzer's 40s and 20GB gates. Running the fixed analysis against production query logs with unchanged thresholds immediately surfaces six properties for that team alone, the top one qualifying through 133 slow queries in a week. None of them could ever be suggested today.

Changes

  • _analyze runs the same gating (bytes, rows, duration/timeout, occurrence count) over property group map accesses, extracted with a regex restricted to the group columns registered for the events table, and translates each group column back to its source column via a new PropertyGroupManager.get_group_columns_to_source_columns. Suggestions merge with the existing JSONExtract-based ones, deduplicated.
  • The hardcoded 20GB / 5M-row floors become keyword parameters with unchanged defaults. This is what makes the analyzer testable: inserting synthetic rows into system.query_log (the old test's technique, skipped for a long time) is silently ignored on current ClickHouse versions, so the new test runs real queries and opens the floors instead.

How did you test this code?

I am an agent; automated tests plus a production dry run:

  • New end-to-end test: executes real property-group-reading queries (plain equality, the has(...) ? ... : null ternary the printer emits, person map, and feature flag group), flushes logs, and asserts _analyze suggests each with the right source column. Notable test detail: every probed key and value must exist on a seeded event, since the map bloom filter indexes otherwise prune reads to zero rows, failing the analyzer's gates. A unit test covers the new registry method.
  • ee/clickhouse/materialized_columns/test/: 21 passed.
  • Production dry run of the new extraction SQL with real thresholds (40s, 20GB, more than 9 occurrences, 7 days) against query_log_archive scoped to the team above: returns 6 properties with 14 to 133 qualifying queries each. The existing JSONExtract pass returns nothing for them.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

🤖 Agent context

Autonomy: Human-driven (agent-assisted)

  • Claude Code session, continuing a production slow-query investigation: a sweep of slow user-written HogQL surfaced recurring aggregations reading hundreds of GiB per run via properties_group_custom, with the properties unmaterialized. Traced why auto-materialization never picked them up: the compiled queries contain zero JSONExtract, so the analyzer's regex (and its query LIKE '%JSONExtract%' prefilter) can never match.
  • Verified before changing code: the weekly task and DDL machinery work (recent materializer DDL exists), the target queries clear every other analyzer gate, and the new extraction regex matches the printer's literal and ternary access shapes in query_log text.
  • The old skipped test documented why this area went untested: ClickHouse now silently ignores INSERTs into system log tables. The replacement test exercises the analyzer against genuine query log entries.

The materialization analyzer finds candidate properties by extracting
JSONExtract calls from slow queries in query_log. With property groups
enabled the printer reads unmaterialized properties through map columns
like properties_group_custom['foo'] instead, so exactly the reads that
most need a dedicated column never match and can never be suggested,
no matter how much they scan.

Run the same gating over property group map accesses, restricted to the
group columns registered for the events table, and translate each group
column back to its source column. The byte and row floors are now
parameters with unchanged defaults so tests can exercise the analyzer
against real query log entries: inserting synthetic rows into
system.query_log is silently ignored on current ClickHouse.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@andyzzhao andyzzhao self-assigned this Jun 9, 2026
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team June 9, 2026 23:43
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
ee/clickhouse/materialized_columns/analyze.py:109-164
**Duplicated SQL structure across both query passes**

The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.

### Issue 2 of 2
ee/clickhouse/materialized_columns/test/test_analyze.py:13-47
**Test covers four cases but is not parameterised**

`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.

Reviews (1): Last reviewed commit: "fix(clickhouse): teach auto-materializat..." | Re-trigger Greptile

Comment on lines +109 to +164
raw_group_queries = sync_execute(
"""
WITH
{min_query_time} as slow_query_minimum,
(
159, -- TIMEOUT EXCEEDED
160, -- TOO SLOW (estimated query execution time)
) as exception_codes,
{min_bytes_read} as min_bytes_read,
{min_read_rows} as min_read_rows
SELECT
group_access[1] as column,
group_access[2] as prop_to_materialize
FROM
(
SELECT
arrayJoin(
extractAllGroups(query, '({group_column_alternation})\\[\\'([a-zA-Z0-9_\\-\\.\\$\\/\\ ]*?)\\'\\]')
) as group_access,
exception_code,
query_duration_ms
FROM
clusterAllReplicas({cluster}, system, query_log)
WHERE
query_start_time > now() - toIntervalHour({since})
and ({group_column_prefilter})
and type > 1
and is_initial_query
and JSONExtractString(log_comment, 'access_method') != 'personal_api_key' -- API requests failing is less painful than queries in the interface
and JSONExtractString(log_comment, 'kind') != 'celery'
and JSONExtractInt(log_comment, 'team_id') != 0
and query not like '%person_distinct_id2%' -- Old style person properties that are joined, no need to optimize those queries
and read_bytes > min_bytes_read
and (exception_code IN exception_codes OR query_duration_ms > slow_query_minimum)
and read_rows > min_read_rows
{team_id_filter}
)
GROUP BY
1, 2
HAVING
countIf(exception_code IN exception_codes) > 0 OR countIf(query_duration_ms > slow_query_minimum) > 9
ORDER BY
countIf(exception_code IN exception_codes) DESC,
countIf(query_duration_ms > slow_query_minimum) DESC
LIMIT 100 -- Make sure we don't add 100s of columns in one run
""".format(
since=since_hours_ago,
min_query_time=min_query_time,
team_id_filter=f"and JSONExtractInt(log_comment, 'team_id') = {team_id}" if team_id else "",
cluster=CLICKHOUSE_CLUSTER,
group_column_alternation=group_column_alternation,
group_column_prefilter=group_column_prefilter,
min_bytes_read=min_bytes_read,
min_read_rows=min_read_rows,
),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated SQL structure across both query passes

The group-query block repeats the CTE definitions (slow_query_minimum, exception_codes, min_bytes_read, min_read_rows), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded > 9), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.

Prompt To Fix With AI
This is a comment left during a code review.
Path: ee/clickhouse/materialized_columns/analyze.py
Line: 109-164

Comment:
**Duplicated SQL structure across both query passes**

The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +13 to +47
def test_property_group_reads_suggest_materialization(self):
# Real queries against events whose log entries read via property group map columns. Inserting
# synthetic rows into system.query_log is silently ignored on current ClickHouse versions, so the
# analyzer is exercised against genuine query log entries with the gating thresholds opened up.
# Every probed key must exist on at least one event: the map-key bloom filter otherwise prunes the
# scan to zero rows read, which fails the analyzer's read_rows/read_bytes gates even when opened to 0.
_create_event(
team=self.team,
distinct_id="d1",
event="e",
properties={"materialize_me_group": "x", "mat_group_ternary": "y2", "$feature/my-flag": "true"},
person_properties={"mat_person_group": "z"},
)
flush_persons_and_events()

group_read_queries = [
f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_custom['materialize_me_group'] = 'x'",
f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND if(has(properties_group_custom, 'mat_group_ternary'), properties_group_custom['mat_group_ternary'], NULL) != 'y'",
f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND person_properties_map_custom['mat_person_group'] = 'z'",
f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_feature_flags['$feature/my-flag'] = 'true'",
]
with tags_context(team_id=self.team.pk):
for query in group_read_queries:
for _ in range(10):
sync_execute(query)
sync_execute("SYSTEM FLUSH LOGS")

suggestions = set(
_analyze(since_hours_ago=1, min_query_time=-1, team_id=self.team.pk, min_bytes_read=0, min_read_rows=0)
)

assert ("events", "properties", "materialize_me_group") in suggestions
assert ("events", "properties", "mat_group_ternary") in suggestions
assert ("events", "person_properties", "mat_person_group") in suggestions
assert ("events", "properties", "$feature/my-flag") in suggestions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test covers four cases but is not parameterised

test_property_group_reads_suggest_materialization exercises four distinct property group types (custom, ternary form of custom, person_properties, feature_flags) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected source_column + expected property_name) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.

Prompt To Fix With AI
This is a comment left during a code review.
Path: ee/clickhouse/materialized_columns/test/test_analyze.py
Line: 13-47

Comment:
**Test covers four cases but is not parameterised**

`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

🎭 Playwright report · View test results →

⚠️ 2 flaky tests:

  • Add a new person property (chromium)
  • Save an insight, make changes, discard them, and save a copy (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@andyzzhao andyzzhao closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant