-
Notifications
You must be signed in to change notification settings - Fork 124
feat(DENG-10365): add mobile to events_first_seen #8647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
remove null as special criteria name and simplify join condition encode criteria as dictionary update tests
This comment has been minimized.
This comment has been minimized.
… need to overwrite entire table) update tests and metadata
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Integration report for "Merge branch 'main' into DENG-10365-add-mobile-to-events-first-seen"
|
| - sql/moz-fx-data-shared-prod/org_mozilla_firefox_derived/events_first_seen_v1 | ||
| - sql/moz-fx-data-shared-prod/org_mozilla_ios_firefox_derived/events_first_seen_v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): These two ETLs you're adding to the retention_exclusion_list config here are for the release channels of Firefox Android and Firefox iOS, but there are additional ETLs for their other channels which should presumably also be added here.
| sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')) | ||
| - name: "'linkpreview_ai_consent'" | ||
| sql: event_category = 'genai.linkpreview' AND event_name = 'card_ai_consent' AND JSON_VALUE(event_extra.option) = 'continue' | ||
| any_event: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: Changing this criteria name value from null to any_event at this point means you'll need to manually update existing firefox_desktop_derived.events_first_seen_v1 records that currently have null criteria values when this is deployed to avoid getting extraneous any_event records for existing clients when that ETL runs with this new logic.
| depends_on_past: true | ||
| date_partition_parameter: null | ||
|
|
||
| bigquery: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I'd recommend adding criteria as the primary clustering field, removing normalized_country_code as a clustering field (since you can only have four clustering columns), and maybe moving event_category to be the last clustering field.
| AND event_category NOT IN ( | ||
| 'media.playback', | ||
| 'nimbus_events', | ||
| 'uptake.remotecontent.result' | ||
| ) -- remove unnecessary high-volume categories to reduce cost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quibble: Having the trailing comment was reasonable when this was all on one line, but when the expression is formatted as multiple lines IMO the trailing comment is less readable.
| AND event_category NOT IN ( | |
| 'media.playback', | |
| 'nimbus_events', | |
| 'uptake.remotecontent.result' | |
| ) -- remove unnecessary high-volume categories to reduce cost | |
| -- remove unnecessary high-volume categories to reduce cost | |
| AND event_category NOT IN ( | |
| 'media.playback', | |
| 'nimbus_events', | |
| 'uptake.remotecontent.result' | |
| ) |
| ) | ||
| ORDER BY | ||
| submission_timestamp, | ||
| COALESCE(event_timestamp, '9999-12-31 23:59:59') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I added a document_event_number column to events_stream_v1 tables in #8596, which could be used here to break ties if multiple events have the same timestamp.
| COALESCE(event_timestamp, '9999-12-31 23:59:59') | |
| COALESCE(event_timestamp, '9999-12-31 23:59:59'), | |
| document_event_number |
| 'uptake.remotecontent.result' | ||
| ) -- remove unnecessary high-volume categories to reduce cost | ||
| {% if app_id_dataset == 'firefox_desktop' -%} | ||
| AND profile_group_id IS NOT NULL -- only include non-null IDs so as not to create repeats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Why would including null profile_group_id values create repeats? The query isn't grouping by profile_group_id, so it doesn't seem like including older event records with null profile_group_id values would cause repeats.
In fact, doesn't excluding older event records with null profile_group_id values mean this isn't returning entirely accurate "events first seen" data for Firefox Desktop? It's more like "events first seen since the profile group ID was set".
| , | ||
| _previous_cte AS ( | ||
| SELECT | ||
| first_submission_timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): These _previous_cte queries don't need to select all the data fields anymore since they're only being used for comparison purposes. I think they only need to select client_id, event, and criteria.
| AND _current.event = _previous.event | ||
| AND (_current.criteria = _previous.criteria | ||
| OR (_current.criteria IS NULL AND _previous.criteria IS NULL)) | ||
| AND criteria IS NOT DISTINCT FROM {{ ("'" ~ criteria_name ~ "'") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Now that criteria won't have null values, this could be written more naturally using !=:
| AND criteria IS NOT DISTINCT FROM {{ ("'" ~ criteria_name ~ "'") }} | |
| AND criteria != {{ "'" ~ criteria_name ~ "'" }} |
However, I actually think it would instead be better overall to move the UNION ALL criteria loop into the events_stream_cte CTE and have a single not-criteria-specific _previous_cte CTE and final query that selects from events_stream_cte where there isn't a matching _previous_cte record (rather than repeating that code for each criteria).
| # - firefox_ios | ||
| # - fenix | ||
| - fenix | ||
| - firefox_ios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: How long do you estimate the init process for these new events_first_seen_v1 ETL tables will take?
| WHERE | ||
| {% raw %} | ||
| {% if is_init() %} | ||
| {% endraw %} | ||
| DATE(submission_timestamp) >= '2023-01-01' -- initialize by looking over all of history | ||
| AND sample_id >= @sample_id | ||
| AND sample_id < @sample_id + @sampling_batch_size | ||
| {% raw %} | ||
| {% else %} | ||
| {% endraw %} | ||
| DATE(submission_timestamp) = @submission_date | ||
| {% raw %} | ||
| {% endif %} | ||
| {% endraw %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quibble: I think this would make it easier to see the structure of the Jinja code being generated:
| WHERE | |
| {% raw %} | |
| {% if is_init() %} | |
| {% endraw %} | |
| DATE(submission_timestamp) >= '2023-01-01' -- initialize by looking over all of history | |
| AND sample_id >= @sample_id | |
| AND sample_id < @sample_id + @sampling_batch_size | |
| {% raw %} | |
| {% else %} | |
| {% endraw %} | |
| DATE(submission_timestamp) = @submission_date | |
| {% raw %} | |
| {% endif %} | |
| {% endraw %} | |
| WHERE | |
| {% raw %}{% if is_init() %}{% endraw %} | |
| DATE(submission_timestamp) >= '2023-01-01' -- initialize by looking over all of history | |
| AND sample_id >= @sample_id | |
| AND sample_id < @sample_id + @sampling_batch_size | |
| {% raw %}{% else %}{% endraw %} | |
| DATE(submission_timestamp) = @submission_date | |
| {% raw %}{% endif %}{% endraw %} |
Description
This PR adds mobile to
events_first_seen, optimizes the query and refactors codeRelated Tickets & Documents
Reviewer, please follow this checklist