Skip to content

Conversation

@curtismorales
Copy link
Contributor

@curtismorales curtismorales commented Dec 28, 2025

Description

This PR adds two new flags to the bqetl query create CLI command that will generate standard ETL for queries that run more frequently than daily (and pull from live tables):
--use-live: Creates an additional _use_live table and a view that joins the two
--hourly: A special case of --use-live for queries that run hourly

There is some minor refactoring for DRY but most of the logic here is new and should not affect the existing logic.

A few notes:

  • This PR does not include the other portions of implementing this feature. These other pieces are almost entirely orthogonal to this one and are in separate forthcoming PRs:
    1. Correctly calculating airflow dependencies between DAGs that run at different cadences
    2. The logic for calculating and passing @submission_date-ish parameters based on the schedule
    3. Updating managed backfills to support tables at cadences other than daily
  • The generated metadata for the use_live table includes the somewhat gnarly calculation of parameters that currently is used -- this will be removed in (2)
  • I haven't added functionality here to support specifying DAGs, so for now these flags implicitly also set the flag --no-schedule. I plan to deal with this later; I don't think it's a huge deal now that the user has to add the dag names in the metadata files
  • The logic in the various sql and metadata templates has gotten pretty long with this addition, so I would like to move all this templated stuff out of this function. I intend to do this in a later PR, but I would like to get this change in first
  • I'm adding a few more inline notes

Related Tickets & Documents

Reviewer, please follow this checklist

@curtismorales curtismorales requested a review from a team as a code owner December 28, 2025 00:20
derived_path = path / project_id / dataset / table_name
derived_path.mkdir(parents=True)
if use_live:
use_live_table_name = name + use_live_slug + version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in names like my_table_v1 and my_table_use_live_v1. This is not my favorite because it means the two tables can be alphabetically separated, but it also breaks our current conventions to have anything other than the version at the end of a table name. Improving our naming standards is a larger issue that I'd prefer to not get involved in here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble: IMO _use_live reads very awkwardly. I'd be inclined to have the slug just be _live, or maybe _from_live.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about _live but thought that could lead to confusion with the actual live tables. I like _from_live, I'll make that change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw there are already some queries that follow a similar pattern and do have the _live_v<n> suffix. E.g. experiment_events_live_v1, monitoring_derived.topsites_rate_fenix_release_live_v1 and more

Comment on lines +247 to +253
UNION ALL
SELECT * FROM
`{project_id}.{dataset}.{use_live_table_name}`
WHERE submission_date > (
SELECT MAX(submission_date)
FROM `{project_id}.{dataset}.{table_name}`
)"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This felt to me like the best way to combine these in the view. This does mean that if the main (daily) table is missing certain days before the most recent, that data will not be filled in by the _use_live table, but that feels appropriate -- a missing day in the stable table is a genuine problem in the data; the _use_live data is ephemeral and should not be used to paper over missing historical data

else:
macro_file.write_text(
reformat(
f"""{{% macro {table_name}(use_live) %}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we can only partition by day or by hour, there isn't really a simple standard SQL for this case. This is what I came up with, but I'm open to other suggestions

Comment on lines +399 to +402
parameters = [
"submission_hour:DATE:{{(execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d %h:00:00')}}",
]
destination_table = f"{use_live_table_name}${{{{(execution_date - macros.timedelta(hours=1)).strftime('%Y%m%d%h)}}}}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will go away once the logic for calculating these parameters is in place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: Subtracting an hour like this isn't always necessary, especially if the hourly ETLs aren't scheduled at the top of the hour.

Comment on lines +410 to +414
parameters = [
"interval_start:DATE:{{}}",
"interval_end:DATE:{{(execution_date - macros.timedelta(hours=1).strftime('%Y-%m-%d %h:%m:%s'))}}",
]
destination_table = f"{use_live_table_name}${{{{(execution_date - macros.timedelta(hours=1)).strftime('%Y%m%d)}}}}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines +250 to +251
WHERE submission_date > (
SELECT MAX(submission_date)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially create an invalid query if the source table doesn't have submission_date. I think this is fine though. CI will show a failure in the dryrun task and it's reasonable to expect the PR creator to manually adjust the query.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@curtismorales I suspect you used submission_date in these queries because submission_date was being used in the existing query.sql template, but I don't think that's a good approach to continue, as live/stable tables only contain submission_timestamp columns (submission_date columns are only present in derived tables).

I'd suggest taking this opportunity to consistently use submission_timestamp in the generated ETL queries and as the default partitioning column (i.e. presuming the ETL will pass through submission_timestamp, which I think is a safer bet than presuming the ETL is going to have a derived submission_date column).

+ "\n"
)
click.echo(f"Created base query in {macro_file}")
query_file = derived_path / "query.sql"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth creating a QUERY_FILE = "query.sql" here:

MATERIALIZED_VIEW = "materialized_view.sql"
since there are multiple usages.
Similar to what is done for materialized_view.sql etc

expiration_days=30,
)
parameters = [
"interval_start:DATE:{{}}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should interval_start be set to something for now?

Comment on lines +155 to +167
@click.option(
"--use_live",
"--use-live",
help=(
"""Using this option creates a query that consists of two tables with
different schedules based on a single base query, one that runs daily
and pulls from stable tables and another that runs more frequently and
pulls from live tables, plus a view that unions the two tables.
"""
),
default=False,
is_flag=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble: While the description explains the nuance/meaning of this --use-live option, IMO the option name by itself implies "use live rather than stable".

I'd be inclined to name the option something like --use-live-and-stable or --from-live-and-stable.

Comment on lines +168 to +182
@click.option(
"--hourly",
help=(
"""This options is a special case of the --use-live option for
tables that update hourly.
Using this option creates a query that consists of two tables with
different schedules based on a single base query, one that runs daily
and pulls from stable tables and another that runs hourly and
pulls from live tables, plus a view that unions the two tables.
"""
),
default=False,
is_flag=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues (blocking):

  • While the description explains the nuance/meaning of this --hourly option, IMO it's unnecessarily confusing, as just based on the name I think people could reasonably expect to be able to run bqetl query create --hourly <name> and get an hourly ETL that doesn't necessarily involve a combination of live and stable table sources.
  • This --hourly option is currently making the live ETL table use hourly partitioning, but to date hourly partitioning has been used very rarely (I currently only see one ETL using hourly partitioning), and I don't think it'd be advisable to encourage hourly partitioning due to some drawbacks:
    • It precludes the ETL doing aggregation at the daily level, which is a very common use case.
    • It makes it critical to ensure the success of all hourly ETL tasks to avoid missing any data, which is problematic because frequently running ETLs are more likely have failures (e.g. due to transient issues with BigQuery or GKE), and such ETL failures are prone to getting missed during Airflow triage as by the time triage happens there could easily have been subsequent successful hourly DAG runs.

I'd be inclined to remove this --hourly option and have the generated live ETL simply assume an hourly schedule with daily partitioning, as that's the most common sub-daily ETL setup being used thus far.

no_schedule = True
elif use_live:
use_live_slug = "_use_live"
no_schedule = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): I think it would make sense to respect the --dag argument if it's specified (for the main ETL).

AS SELECT * FROM
`{project_id}.{dataset}.{table_name}`"""

view_file.write_text(reformat(view_text) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): I know this approach of manually appending a newline is just following the convention of what the existing code was already doing, but it's worth noting that reformat() accepts a trailing_newline boolean argument specifically for that purpose.

Suggested change
view_file.write_text(reformat(view_text) + "\n")
view_file.write_text(reformat(view_text, trailing_newline=True))

(ditto for the other similar usages)

if use_live:
use_live_metadata_file = use_live_path / "metadata.yaml"
if hourly:
labels = {"incremental": True, "schedule": hourly}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking):

Suggested change
labels = {"incremental": True, "schedule": hourly}
labels = {"incremental": True, "schedule": "hourly"}

Comment on lines +411 to +412
"interval_start:DATE:{{}}",
"interval_end:DATE:{{(execution_date - macros.timedelta(hours=1).strftime('%Y-%m-%d %h:%m:%s'))}}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking):

Suggested change
"interval_start:DATE:{{}}",
"interval_end:DATE:{{(execution_date - macros.timedelta(hours=1).strftime('%Y-%m-%d %h:%m:%s'))}}",
"interval_start:TIMESTAMP:{{}}",
"interval_end:TIMESTAMP:{{(execution_date - macros.timedelta(hours=1).strftime('%Y-%m-%d %H:%M:%S'))}}",

Though I'm not sure subtracting an extra hour would necessarily always be advisable for arbitrary intervals.

"date_partition_parameter": None,
"parameters": parameters,
"destination_table": destination_table,
"query_file_path": use_live_path / "query.sql",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking): Specifying query_file_path like this isn't actually necessary and makes maintenance harder.

Suggested change
"query_file_path": use_live_path / "query.sql",

Comment on lines +399 to +402
parameters = [
"submission_hour:DATE:{{(execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d %h:00:00')}}",
]
destination_table = f"{use_live_table_name}${{{{(execution_date - macros.timedelta(hours=1)).strftime('%Y%m%d%h)}}}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking):

Suggested change
parameters = [
"submission_hour:DATE:{{(execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d %h:00:00')}}",
]
destination_table = f"{use_live_table_name}${{{{(execution_date - macros.timedelta(hours=1)).strftime('%Y%m%d%h)}}}}"
parameters = [
"submission_hour:TIMESTAMP:{{(execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d %H:00:00')}}",
]
destination_table = f"{use_live_table_name}${{{{(execution_date - macros.timedelta(hours=1)).strftime('%Y%m%d%H)}}}}"

Comment on lines +312 to +334
WHERE
{{% if use_live %}}
submission_timestamp >= @interval_start
AND submission_timestamp < @interval_end
{{% else %}}
TIMESTAMP_TRUNC(submission_date, DAY) = @submission_date
{{% endif %}}
{{% if use_live %}}
-- Overwrite the daily partition with a combination of new records for
-- the given interval (above) and existing records outside the given
-- interval (below)
UNION ALL
SELECT
*
FROM
{project_id}.{dataset}.{use_live_table_name}
WHERE
TIMESTAMP_TRUNC(submission_date, DAY) = TIMESTAMP_TRUNC(@interval_start, DAY)
AND (
submission_timestamp < @interval_start
OR submission_timestamp >= @interval_end
)
{{% endif %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • issue: submission_date wouldn't be a timestamp.
  • suggestion (non-blocking): It would be best to quote the live ETL table name.
  • note: WhiIe this approach of only running the live ETL on new records and reusing existing records is likely better for performance, it unfortunately has the same drawbacks as when using hourly partitioning that I mentioned in my comment about the --hourly command line option. I'd be inclined to run the live ETL on all records for the date in question.

Comment on lines +284 to +286
FROM
{{% if use_live %}}
table_live
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): IMO it would be best for the generated SQL for selecting from the live table to attempt to deduplicate the records for that date.

Comment on lines +285 to +289
{{% if use_live %}}
table_live
{{% else %}}
table_stable
{{% endif %}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quibble: I think it'd be more readable to have these placeholders match how we refer to live/stable tables.

Suggested change
{{% if use_live %}}
table_live
{{% else %}}
table_stable
{{% endif %}}
{{% if use_live %}}
live_table
{{% else %}}
stable_table
{{% endif %}}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants