Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[b/390590500] Airflow dates params #766

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

vladislav-sidorovich
Copy link
Collaborator

Add date range support to airflow connector.

Usage:

  1. --lookback-days=x -> date range: [now-x, now)
  2. --start-date=x --end-date=y -> date range: [x, y)
  3. -start-date=x --lookback-days=y -> date range: [x-y, x)

If date range is provided, sql queries will be generate for each day in the interval.

// Airflow v1.6.0
addFullTable(out, "dag_run.csv", "select * from dag_run;");
addTableTaskWithIntervals(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should split the query into intervals or just limit the dates. The answer is depends on the data size in DB.
In theory it can be big, not sure if it's a case in practice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep it in one file for now, because consuming multiple files for metadata may require more work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there will be a need to split it into multiple smaller, interval based files I'd consider ingesting it as a time-series data. We have a process for it. Airflow connector could be use by both metadata and time-series loaders.

ZonedDateTime startDate;
ZonedDateTime endDate;
if (arguments.getLookbackDays() != null) {
endDate = arguments.getStartDate(ZonedDateTime.now(ZoneOffset.UTC));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the start date used here and not the end date?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's combined with lookbackDays not with lookforward*. I have unit-test to cover this logic validate_startDateAndLookbackDays_success, feel free to check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is but it's inconsistent since the interval is sometimes defined as [start, end) and sometimes [start -x, start) so "start" can mean the start or end of the interval.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true. However start alway used with an additional parameter:

  1. start + end
  2. start + lookback

So, the combination of the parameters doesn't look misleading to me. But let's keep the conversation open to collect more opinions if any.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to wonder if we need this case at all. I can see a usecase for --lookback=7 days from now, or --start-date=last tuesday because that's when I run the previous dump, or --start-date=last tuesday and --end-date=last-saturday because I want a specific range, but why would I want something like 7 days before last tuesday?

Copy link
Collaborator Author

@vladislav-sidorovich vladislav-sidorovich Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can be connector used without this case? - yes, start/end is enough to cover all ranges.
  2. is there additional benefit for such use case? - yes, it's simpler way to specify the range.
  3. is the benefit significant? - it's hard to say/measure. (later on we will be able to measure)

--start-date=last tuesday because that's when I run the previous dump

But you actually raised an interesting use case. What if I will run the dumper 3 times and I want cover the ranges [t1 - t2 - t3 - t4]? It doesn't look like a use case, we just need to run dumper once for [t1 - t4] where t4 can be now.
We can go forward and think about an incremental assessment, run dumper each day/week and append data to the existing assessment... It sounds interesting, but it's not the case for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is there additional benefit for such use case? - yes, it's simpler way to specify the range."
It's not simpler if it's confusing ;-)

@misolt misolt removed their request for review March 3, 2025 13:21
@kpiotrowski
Copy link
Collaborator

-start-date=x --lookback-days=y -> date range: [x-y, x)

This usage is not intuitive, at least for me. Usually when I provide start date I want all the data to be later than this. I think end-date works better with a lookback-days because lookback refers to searching N days in the past.

Comment on lines 126 to 134
addQueryTask(
out,
"task_instance.csv",
dateRange == null
? "select * from task_instance;"
: "select * from task_instance "
+ String.format(
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pre-build the filter query separately it might be easier to read and debug.

Suggested change
addQueryTask(
out,
"task_instance.csv",
dateRange == null
? "select * from task_instance;"
: "select * from task_instance "
+ String.format(
" where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight())));
String queryFilter = "";
if (dateRange != null) {
queryFilter = String.format(
"where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
dateToSqlFormat(dateRange.getLeft()),
dateToSqlFormat(dateRange.getRight())
)
}
addQueryTask(
out,
"task_instance.csv",
String.format("select * from task_instance %s;", queryFilter)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String filter =
        dateRange != null
            ? String.format(
                " where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
                dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))
            : "";
addQueryTask(out, "task_instance.csv", "select * from task_instance " + filter);

vs

    addQueryTask(
        out,
        "task_instance.csv",
        "select * from task_instance "
            + (dateRange != null
                ? String.format(
                    " where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
                    dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight()))
                : ""));

vs

  addQueryTask(
        out,
        "task_instance.csv",
        dateRange == null
            ? "select * from task_instance;"
            : "select * from task_instance "
                + String.format(
                    " where end_date >= CAST( '%s' as TIMESTAMP) and end_date < CAST( '%s' as TIMESTAMP) ;",
                    dateToSqlFormat(dateRange.getLeft()), dateToSqlFormat(dateRange.getRight())));

Not a big difference from my point of view. The variable looks good, but it will require to introduce such variable for several method or use code blocks { ... }.

So, I will keep it as is.

Copy link
Collaborator

@zaldis zaldis Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference that you can log your filter or set a breakpoint for debugging purposes (right after your filter string will be generated). With current implementation you have to do it in addQueryTask which is called multiple times (from different places). Hence it will be harder to identify required call.

@vladislav-sidorovich
Copy link
Collaborator Author

-start-date=x --lookback-days=y -> date range: [x-y, x)

This usage is not intuitive, at least for me. Usually when I provide start date I want all the data to be later than this. I think end-date works better with a lookback-days because lookback refers to searching N days in the past.

I do the calculation and log the actual time range to show the date range explicitly. I agree that start-date might be confusing. However the logic behind it is the following: I chose X date in the calendar and go back for Y days, as the result it will be a date range. In other words, there is an ability to move now.
With such logic go back from end is confusing as well.

I'm open to ideas.

There is a good example: https://grafana.com/docs/grafana/latest/dashboards/use-dashboards/#set-dashboard-time-range
For example we can rename start -> from and end - to. So, it will be: from date X look back for Y days.

does it look better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants