-
Notifications
You must be signed in to change notification settings - Fork 2.8k
fix(clickhouse): teach auto-materialization to see property group reads #62585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,13 +1,56 @@ | ||
| import pytest | ||
| from posthog.test.base import BaseTest, ClickhouseTestMixin | ||
| from posthog.test.base import BaseTest, ClickhouseTestMixin, _create_event, flush_persons_and_events | ||
| from unittest.mock import call, patch | ||
|
|
||
| from posthog.clickhouse.client import sync_execute | ||
| from posthog.clickhouse.property_groups import property_groups | ||
| from posthog.clickhouse.query_tagging import tags_context | ||
|
|
||
| from ee.clickhouse.materialized_columns.analyze import materialize_properties_task | ||
| from ee.clickhouse.materialized_columns.analyze import _analyze, materialize_properties_task | ||
|
|
||
|
|
||
| class TestMaterializedColumnsAnalyze(ClickhouseTestMixin, BaseTest): | ||
| def test_property_group_reads_suggest_materialization(self): | ||
| # Real queries against events whose log entries read via property group map columns. Inserting | ||
| # synthetic rows into system.query_log is silently ignored on current ClickHouse versions, so the | ||
| # analyzer is exercised against genuine query log entries with the gating thresholds opened up. | ||
| # Every probed key must exist on at least one event: the map-key bloom filter otherwise prunes the | ||
| # scan to zero rows read, which fails the analyzer's read_rows/read_bytes gates even when opened to 0. | ||
| _create_event( | ||
| team=self.team, | ||
| distinct_id="d1", | ||
| event="e", | ||
| properties={"materialize_me_group": "x", "mat_group_ternary": "y2", "$feature/my-flag": "true"}, | ||
| person_properties={"mat_person_group": "z"}, | ||
| ) | ||
| flush_persons_and_events() | ||
|
|
||
| group_read_queries = [ | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_custom['materialize_me_group'] = 'x'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND if(has(properties_group_custom, 'mat_group_ternary'), properties_group_custom['mat_group_ternary'], NULL) != 'y'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND person_properties_map_custom['mat_person_group'] = 'z'", | ||
| f"SELECT count() FROM events WHERE team_id = {self.team.pk} AND properties_group_feature_flags['$feature/my-flag'] = 'true'", | ||
| ] | ||
| with tags_context(team_id=self.team.pk): | ||
| for query in group_read_queries: | ||
| for _ in range(10): | ||
| sync_execute(query) | ||
| sync_execute("SYSTEM FLUSH LOGS") | ||
|
|
||
| suggestions = set( | ||
| _analyze(since_hours_ago=1, min_query_time=-1, team_id=self.team.pk, min_bytes_read=0, min_read_rows=0) | ||
| ) | ||
|
|
||
| assert ("events", "properties", "materialize_me_group") in suggestions | ||
| assert ("events", "properties", "mat_group_ternary") in suggestions | ||
| assert ("events", "person_properties", "mat_person_group") in suggestions | ||
| assert ("events", "properties", "$feature/my-flag") in suggestions | ||
|
Comment on lines
+13
to
+47
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Prompt To Fix With AIThis is a comment left during a code review.
Path: ee/clickhouse/materialized_columns/test/test_analyze.py
Line: 13-47
Comment:
**Test covers four cases but is not parameterised**
`test_property_group_reads_suggest_materialization` exercises four distinct property group types (`custom`, ternary form of `custom`, `person_properties`, `feature_flags`) in a single assertion block. Following the team's preference for parameterised tests, each case (query pattern + expected `source_column` + expected `property_name`) could be a separate parameter. This would let each case fail independently, simplify debugging, and make it easier to add new group types later without growing one monolithic test.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
||
|
|
||
| def test_group_columns_to_source_columns(self): | ||
| mapping = property_groups.get_group_columns_to_source_columns("events") | ||
| assert mapping["properties_group_custom"] == "properties" | ||
| assert mapping["person_properties_map_custom"] == "person_properties" | ||
|
|
||
| @pytest.mark.skip(reason="Test is failing for some reason") | ||
| @patch("ee.clickhouse.materialized_columns.analyze.materialize") | ||
| @patch("ee.clickhouse.materialized_columns.analyze.backfill_materialized_columns") | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group-query block repeats the CTE definitions (
slow_query_minimum,exception_codes,min_bytes_read,min_read_rows), the six WHERE filters that follow the prefilter, the HAVING clause (including the hardcoded> 9), and the ORDER BY / LIMIT. If a threshold or exception code needs updating — e.g. adding a third exception code, tightening the HAVING floor — it must be changed in two places. Extracting the shared gating into a CTE or a helper that emits a parameterised subquery fragment would satisfy OnceAndOnlyOnce and keep future changes localised.Prompt To Fix With AI
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!