fix(clickhouse): teach auto-materialization to see property group reads#62585
fix(clickhouse): teach auto-materialization to see property group reads#62585andyzzhao wants to merge 2 commits into
Conversation
The materialization analyzer finds candidate properties by extracting JSONExtract calls from slow queries in query_log. With property groups enabled the printer reads unmaterialized properties through map columns like properties_group_custom['foo'] instead, so exactly the reads that most need a dedicated column never match and can never be suggested, no matter how much they scan. Run the same gating over property group map accesses, restricted to the group columns registered for the events table, and translate each group column back to its source column. The byte and row floors are now parameters with unchanged defaults so tests can exercise the analyzer against real query log entries: inserting synthetic rows into system.query_log is silently ignored on current ClickHouse. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
ee/clickhouse/materialized_columns/analyze.py:109-164
**Duplicated SQL structure across both query passes**
The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.
### Issue 2 of 2
ee/clickhouse/materialized_columns/test/test_analyze.py:13-47
**Test covers four cases but is not parameterised**
`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.
Reviews (1): Last reviewed commit: "fix(clickhouse): teach auto-materializat..." | Re-trigger Greptile |
| raw_group_queries = sync_execute( | ||
| """ | ||
| WITH | ||
| {min_query_time} as slow_query_minimum, | ||
| ( | ||
| 159, -- TIMEOUT EXCEEDED | ||
| 160, -- TOO SLOW (estimated query execution time) | ||
| ) as exception_codes, | ||
| {min_bytes_read} as min_bytes_read, | ||
| {min_read_rows} as min_read_rows | ||
| SELECT | ||
| group_access[1] as column, | ||
| group_access[2] as prop_to_materialize | ||
| FROM | ||
| ( | ||
| SELECT | ||
| arrayJoin( | ||
| extractAllGroups(query, '({group_column_alternation})\\[\\'([a-zA-Z0-9_\\-\\.\\$\\/\\ ]*?)\\'\\]') | ||
| ) as group_access, | ||
| exception_code, | ||
| query_duration_ms | ||
| FROM | ||
| clusterAllReplicas({cluster}, system, query_log) | ||
| WHERE | ||
| query_start_time > now() - toIntervalHour({since}) | ||
| and ({group_column_prefilter}) | ||
| and type > 1 | ||
| and is_initial_query | ||
| and JSONExtractString(log_comment, 'access_method') != 'personal_api_key' -- API requests failing is less painful than queries in the interface | ||
| and JSONExtractString(log_comment, 'kind') != 'celery' | ||
| and JSONExtractInt(log_comment, 'team_id') != 0 | ||
| and query not like '%person_distinct_id2%' -- Old style person properties that are joined, no need to optimize those queries | ||
| and read_bytes > min_bytes_read | ||
| and (exception_code IN exception_codes OR query_duration_ms > slow_query_minimum) | ||
| and read_rows > min_read_rows | ||
| {team_id_filter} | ||
| ) | ||
| GROUP BY | ||
| 1, 2 | ||
| HAVING | ||
| countIf(exception_code IN exception_codes) > 0 OR countIf(query_duration_ms > slow_query_minimum) > 9 | ||
| ORDER BY | ||
| countIf(exception_code IN exception_codes) DESC, | ||
| countIf(query_duration_ms > slow_query_minimum) DESC | ||
| LIMIT 100 -- Make sure we don't add 100s of columns in one run | ||
| """.format( | ||
| since=since_hours_ago, | ||
| min_query_time=min_query_time, | ||
| team_id_filter=f"and JSONExtractInt(log_comment, 'team_id') = {team_id}" if team_id else "", | ||
| cluster=CLICKHOUSE_CLUSTER, | ||
| group_column_alternation=group_column_alternation, | ||
| group_column_prefilter=group_column_prefilter, | ||
| min_bytes_read=min_bytes_read, | ||
| min_read_rows=min_read_rows, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Duplicated SQL structure across both query passes
The group-query block repeats the CTE definitions (slow_query_minimum, exception_codes, min_bytes_read, min_read_rows), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded > 9), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.
Prompt To Fix With AI
This is a comment left during a code review.
Path: ee/clickhouse/materialized_columns/analyze.py
Line: 109-164
Comment:
**Duplicated SQL structure across both query passes**
The group-query block repeats the CTE definitions (`slow_query_minimum`, `exception_codes`, `min_bytes_read`, `min_read_rows`), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded `> 9`), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| def test_property_group_reads_suggest_materialization(self): | ||
| # Real queries against events whose log entries read via property group map columns. Inserting | ||
| # synthetic rows into system.query_log is silently ignored on current ClickHouse versions, so the | ||
| # analyzer is exercised against genuine query log entries with the gating thresholds opened up. | ||
| # Every probed key must exist on at least one event: the map-key bloom filter otherwise prunes the | ||
| # scan to zero rows read, which fails the analyzer's read_rows/read_bytes gates even when opened to 0. | ||
| _create_event( | ||
| team=self.team, | ||
| distinct_id="d1", | ||
| event="e", | ||
| properties={"materialize_me_group": "x", "mat_group_ternary": "y2", "$feature/my-flag": "true"}, | ||
| person_properties={"mat_person_group": "z"}, | ||
| ) | ||
| flush_persons_and_events() | ||
|
|
||
| group_read_queries = [ | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_custom['materialize_me_group'] = 'x'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND if(has(properties_group_custom, 'mat_group_ternary'), properties_group_custom['mat_group_ternary'], NULL) != 'y'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND person_properties_map_custom['mat_person_group'] = 'z'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_feature_flags['$feature/my-flag'] = 'true'", | ||
| ] | ||
| with tags_context(team_id=self.team.pk): | ||
| for query in group_read_queries: | ||
| for _ in range(10): | ||
| sync_execute(query) | ||
| sync_execute("SYSTEM FLUSH LOGS") | ||
|
|
||
| suggestions = set( | ||
| _analyze(since_hours_ago=1, min_query_time=-1, team_id=self.team.pk, min_bytes_read=0, min_read_rows=0) | ||
| ) | ||
|
|
||
| assert ("events", "properties", "materialize_me_group") in suggestions | ||
| assert ("events", "properties", "mat_group_ternary") in suggestions | ||
| assert ("events", "person_properties", "mat_person_group") in suggestions | ||
| assert ("events", "properties", "$feature/my-flag") in suggestions |
There was a problem hiding this comment.
Test covers four cases but is not parameterised
test_property_group_reads_suggest_materialization exercises four distinct property group types (custom, ternary form of custom, person_properties, feature_flags) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected source_column + expected property_name) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.
Prompt To Fix With AI
This is a comment left during a code review.
Path: ee/clickhouse/materialized_columns/test/test_analyze.py
Line: 13-47
Comment:
**Test covers four cases but is not parameterised**
`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Problem
The weekly materialization analyzer (
_analyzeinee/clickhouse/materialized_columns/analyze.py) finds candidate properties by regex-extractingJSONExtract*calls from slow queries inquery_log. That predates property groups: withpropertyGroupsMode=OPTIMIZED(the cloud default), the HogQL printer reads unmaterialized properties through map columns likeproperties_group_custom['foo'], which contain noJSONExtractat all. The queries that most need a dedicated materialized column are exactly the ones reading fat property maps, and they are structurally invisible to the analyzer, no matter how much they scan.This is not hypothetical: one team's recurring user-written HogQL aggregations over a handful of custom properties read about 5 PiB per week (thousands of runs each scanning hundreds of GiB of the custom-properties map), with hundreds of those runs individually clearing the analyzer's 40s and 20GB gates. Running the fixed analysis against production query logs with unchanged thresholds immediately surfaces six properties for that team alone, the top one qualifying through 133 slow queries in a week. None of them could ever be suggested today.
Changes
_analyzeruns the same gating (bytes, rows, duration/timeout, occurrence count) over property group map accesses, extracted with a regex restricted to the group columns registered for the events table, and translates each group column back to its source column via a newPropertyGroupManager.get_group_columns_to_source_columns. Suggestions merge with the existing JSONExtract-based ones, deduplicated.system.query_log(the old test's technique, skipped for a long time) is silently ignored on current ClickHouse versions, so the new test runs real queries and opens the floors instead.How did you test this code?
I am an agent; automated tests plus a production dry run:
has(...) ? ... : nullternary the printer emits, person map, and feature flag group), flushes logs, and asserts_analyzesuggests each with the right source column. Notable test detail: every probed key and value must exist on a seeded event, since the map bloom filter indexes otherwise prune reads to zero rows, failing the analyzer's gates. A unit test covers the new registry method.ee/clickhouse/materialized_columns/test/: 21 passed.query_log_archivescoped to the team above: returns 6 properties with 14 to 133 qualifying queries each. The existing JSONExtract pass returns nothing for them.Automatic notifications
Docs update
🤖 Agent context
Autonomy: Human-driven (agent-assisted)
properties_group_custom, with the properties unmaterialized. Traced why auto-materialization never picked them up: the compiled queries contain zeroJSONExtract, so the analyzer's regex (and itsquery LIKE '%JSONExtract%'prefilter) can never match.query_logtext.