-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[b/390590500] Airflow dates params #766
base: main
Are you sure you want to change the base?
Conversation
// Airflow v1.6.0 | ||
addFullTable(out, "dag_run.csv", "select * from dag_run;"); | ||
addTableTaskWithIntervals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should split the query into intervals or just limit the dates. The answer is depends on the data size in DB.
In theory it can be big, not sure if it's a case in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will keep it in one file for now, because consuming multiple files for metadata may require more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there will be a need to split it into multiple smaller, interval based files I'd consider ingesting it as a time-series data. We have a process for it. Airflow connector could be use by both metadata and time-series loaders.
ZonedDateTime startDate; | ||
ZonedDateTime endDate; | ||
if (arguments.getLookbackDays() != null) { | ||
endDate = arguments.getStartDate(ZonedDateTime.now(ZoneOffset.UTC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the start date used here and not the end date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's combined with lookbackDays
not with lookforward*
. I have unit-test to cover this logic validate_startDateAndLookbackDays_success
, feel free to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is but it's inconsistent since the interval is sometimes defined as [start, end) and sometimes [start -x, start) so "start" can mean the start or end of the interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's true. However start
alway used with an additional parameter:
- start + end
- start + lookback
So, the combination of the parameters doesn't look misleading to me. But let's keep the conversation open to collect more opinions if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to wonder if we need this case at all. I can see a usecase for --lookback=7 days from now, or --start-date=last tuesday because that's when I run the previous dump, or --start-date=last tuesday and --end-date=last-saturday because I want a specific range, but why would I want something like 7 days before last tuesday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can be connector used without this case? - yes, start/end is enough to cover all ranges.
- is there additional benefit for such use case? - yes, it's simpler way to specify the range.
- is the benefit significant? - it's hard to say/measure. (later on we will be able to measure)
--start-date=last tuesday because that's when I run the previous dump
But you actually raised an interesting use case. What if I will run the dumper 3 times and I want cover the ranges [t1 - t2 - t3 - t4]? It doesn't look like a use case, we just need to run dumper once for [t1 - t4] where t4 can be now.
We can go forward and think about an incremental assessment
, run dumper each day/week and append data to the existing assessment... It sounds interesting, but it's not the case for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"is there additional benefit for such use case? - yes, it's simpler way to specify the range."
It's not simpler if it's confusing ;-)
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Outdated
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Outdated
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Outdated
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Show resolved
Hide resolved
.../app/src/main/java/com/google/edwmigration/dumper/application/dumper/ConnectorArguments.java
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Outdated
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Show resolved
Hide resolved
-start-date=x --lookback-days=y -> date range: [x-y, x) This usage is not intuitive, at least for me. Usually when I provide start date I want all the data to be later than this. I think |
.../app/src/main/java/com/google/edwmigration/dumper/application/dumper/ConnectorArguments.java
Outdated
Show resolved
Hide resolved
.../app/src/main/java/com/google/edwmigration/dumper/application/dumper/ConnectorArguments.java
Outdated
Show resolved
Hide resolved
.../google/edwmigration/dumper/application/dumper/annotations/RespectsArgumentQueryLogDays.java
Show resolved
Hide resolved
...va/com/google/edwmigration/dumper/application/dumper/connector/airflow/AirflowConnector.java
Outdated
Show resolved
Hide resolved
addQueryTask( | ||
out, | ||
"task_instance.csv", | ||
dateRange == null | ||
? "select * from task_instance;" | ||
: "select * from task_instance " | ||
+ String.format( | ||
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;", | ||
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pre-build the filter query separately it might be easier to read and debug.
addQueryTask( | |
out, | |
"task_instance.csv", | |
dateRange == null | |
? "select * from task_instance;" | |
: "select * from task_instance " | |
+ String.format( | |
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;", | |
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))); | |
String queryFilter = ""; | |
if (dateRange != null) { | |
queryFilter = String.format( | |
"where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;", | |
dateToSqlFormat(dateRange.getLeft()), | |
dateToSqlFormat(dateRange.getRight()) | |
) | |
} | |
addQueryTask( | |
out, | |
"task_instance.csv", | |
String.format("select * from task_instance %s;", queryFilter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String filter =
dateRange != null
? String.format(
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))
: "";
addQueryTask(out, "task_instance.csv", "select * from task_instance " + filter);
vs
addQueryTask(
out,
"task_instance.csv",
"select * from task_instance "
+ (dateRange != null
? String.format(
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))
: ""));
vs
addQueryTask(
out,
"task_instance.csv",
dateRange == null
? "select * from task_instance;"
: "select * from task_instance "
+ String.format(
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight())));
Not a big difference from my point of view. The variable looks good, but it will require to introduce such variable for several method or use code blocks { ... }
.
So, I will keep it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference that you can log your filter or set a breakpoint for debugging purposes (right after your filter string will be generated). With current implementation you have to do it in addQueryTask
which is called multiple times (from different places). Hence it will be harder to identify required call.
I do the calculation and log the actual time range to show the date range explicitly. I agree that I'm open to ideas. There is a good example: https://grafana.com/docs/grafana/latest/dashboards/use-dashboards/#set-dashboard-time-range does it look better? |
Add date range support to airflow connector.
Usage:
--lookback-days=x
-> date range:[now-x, now)
--start-date=x --end-date=y
-> date range:[x, y)
-start-date=x --lookback-days=y
-> date range:[x-y, x)
If date range is provided, sql queries will be generate for each day in the interval.