Skip to content

Conversation

orian
Copy link
Contributor

@orian orian commented Oct 13, 2025

Follow up to: #39568

Problem

Some queries we run, that contain multiple JOIN's fail to use indices on events table. It makes them trying read multiple terabytes of data, clogging the cluster.

Internal context:
https://posthog.slack.com/archives/C019RAX2XBN/p1760353830494289

Failing query example:

EXPLAIN PLAN indexes = 1
SELECT feature AS feature, count(DISTINCT email) AS unique_users
FROM (
    SELECT ...
        FROM events LEFT OUTER JOIN (
            SELECT argMax(person_distinct_id_overrides.person_id, person_distinct_id_overrides.version) AS person_id, person_distinct_id_overrides.distinct_id AS distinct_id
            FROM person_distinct_id_overrides
            WHERE equals(person_distinct_id_overrides.team_id, 2)
            GROUP BY person_distinct_id_overrides.distinct_id 
            HAVING ifNull(equals(argMax(person_distinct_id_overrides.is_deleted, person_distinct_id_overrides.version), 0), 0) 
            SETTINGS optimize_aggregation_in_order=1) AS events__override ON equals(events.distinct_id, events__override.distinct_id)
    WHERE and(
        equals(events.team_id, 2),
        or(
            equals(events.event, '$autocapture'), 
            equals(events.event, 'drilldown')), 
        in(
            if(not(empty(events__override.distinct_id)), events__override.person_id, events.person_id), 
            (
                SELECT cohortpeople.person_id AS person_id
                FROM cohortpeople 
                WHERE <some super long expression>)),
        1, -- THIS KILLS INDEXES
        not(match(nullIf(nullIf(events.mat_pp_email, ''), 'null'), ''))
        )
    )
WHERE isNotNull(feature)
GROUP BY feature
ORDER BY unique_users DESC
LIMIT 101 OFFSET 0
SETTINGS readonly=2, max_execution_time=600, allow_experimental_object_type=1, format_csv_allow_double_quotes=0, max_ast_elements=4000000, max_expanded_ast_elements=4000000, max_bytes_before_external_group_by=0, transform_null_in=1, optimize_min_equality_disjunction_chain_length=4294967295, allow_experimental_join_condition=1;

Solution

Skip redundant parts of boolean OR expression

Optimizations:

  • or(expr, 0) <=> expr
  • or(expr, 0, ...) <=> or(expr, ...)
  • or(expr, 1, ...) <=> 1

Example

Input HogQL:
SELECT event FROM events WHERE 1 OR 2

Before:
SELECT event FROM events WHERE and(equals(events.team_id, 2), or(1, 1)) limit 100

After:
SELECT event FROM events WHERE equals(events.team_id, 2) LIMIT 100

@orian orian requested a review from a team as a code owner October 13, 2025 20:40
@posthog-bot
Copy link
Contributor

Hey @orian! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

@posthog-bot posthog-bot requested a review from a team October 13, 2025 20:41
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@orian orian changed the title feat: optimize redundant OR expressions feat: optimize HogQL generated OR expressions Oct 13, 2025
@orian orian force-pushed the pawel/feat/optimize-OR branch from 7414a78 to e98e0cd Compare October 13, 2025 20:57
orian added 3 commits October 13, 2025 22:58
expr AND 1 <=> expr
expr AND 0 <=> 0
expr OR 1 <=> 1
expr OR 0 <=> expr
and(expr, 1) <=> expr
or(expr, 1) <=> 1
and(expr0, 1, ...) <=> and(expr0, ...)
@orian orian force-pushed the pawel/feat/optimize-OR branch from e98e0cd to bd4a7fb Compare October 13, 2025 20:58
@orian orian requested a review from Gilbert09 October 13, 2025 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants