Skip to content

curate format-dates: Support lower/upper bounds on incomplete dates#1773

Closed
victorlin wants to merge 5 commits into
victorlin/update-format-datesfrom
victorlin/curate-format-dates-interval
Closed

curate format-dates: Support lower/upper bounds on incomplete dates#1773
victorlin wants to merge 5 commits into
victorlin/update-format-datesfrom
victorlin/curate-format-dates-interval

Conversation

@victorlin

@victorlin victorlin commented Mar 14, 2025

Copy link
Copy Markdown
Member

Description of proposed changes

See #1494 for design description

Related issue(s)

#1494

Checklist

@victorlin victorlin self-assigned this Mar 14, 2025
@codecov

codecov Bot commented Mar 14, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.10309% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.57%. Comparing base (0023bbc) to head (620021c).

Files with missing lines Patch % Lines
augur/curate/format_dates.py 70.10% 17 Missing and 12 partials ⚠️
Additional details and impacted files
@@                        Coverage Diff                        @@
##           victorlin/update-format-dates    #1773      +/-   ##
=================================================================
- Coverage                          73.63%   73.57%   -0.06%     
=================================================================
  Files                                 81       81              
  Lines                               8580     8663      +83     
  Branches                            1750     1781      +31     
=================================================================
+ Hits                                6318     6374      +56     
- Misses                              1968     1984      +16     
- Partials                             294      305      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread augur/curate/format_dates.py
@victorlin victorlin force-pushed the victorlin/curate-format-dates-interval branch 2 times, most recently from c5e9eda to 3390ab8 Compare March 17, 2025 22:15

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me wonder how the command would handle the field if --target-date-field did not have a valid date, e.g. if the date was unknown. I pushed up a test file in 0634ea1 to show the current behavior, but unclear if the behavior is desired based comments on Slack.

@victorlin victorlin Jun 4, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, empty target date values (XXXX-XX-XX after normalization) could end up being a common use case if there are reasonable min/max bounds such as (clade root date, collection date), so it's good to think about the scenarios. I've added your test file to this PR.

My thinking is that it would make sense to apply bounds to empty target date values only if min and max are both set (i.e. 620021c). If only one is set, the intended interval would be very broad as either (-∞,max] or [min,+∞) which, for downstream usage, doesn't seem that much different from leaving it as XXXX-XX-XX.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that it would make sense to apply bounds to empty target date values only if min and max are both set (i.e. 620021c). If only one is set, the intended interval would be very broad as either (-∞,max] or [min,+∞) which, for downstream usage, doesn't seem that much different from leaving it as XXXX-XX-XX.

Sounds reasonable to me. I briefly wondered if only min is defined if we should automatically set max to today, but probably better for user to explicitly set max to today as desired.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for making it an explicit setting. While AmbiguousDate and everything using it will impose the current date as a maximum, I'm not a fan of the behavior and see it as out of scope and prone to bugs such as #1171.

victorlin and others added 4 commits June 4, 2025 15:58
Preparing for the upcoming bounds feature which will be fundamentally
different from the current behavior.
This is fundamentally different from date format normalization and is
done using a separate code path, but can still be beneficial to include
in the format-dates command because:

1. Applying date format normalization before applying bounds ensures
   that any existing ambiguity is in a resolvable format. This is still
   achievable with a separate command but it would be up to the user to
   get the order right:

      augur curate format-dates … |
      augur curate restrict-dates …

2. With the existing functionality re-scoped to "normalizing date
   formats", the new operation of "applying bounds" arguably falls
   within the scope of date formatting.

The trade-off off with the current implementation is that more
complicated failure reporting logic is necessary to bridge the two code
paths.
Both are marked as optional when registering the arguments, however
passing neither will effectively be a no-op which was previously not
allowed when --date-fields was required.
@victorlin victorlin force-pushed the victorlin/curate-format-dates-interval branch from 3390ab8 to 2248e2a Compare June 4, 2025 23:49
@victorlin victorlin changed the base branch from master to victorlin/update-format-dates June 4, 2025 23:49
@victorlin victorlin marked this pull request as ready for review June 4, 2025 23:59

@victorlin victorlin left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is ready for review, but I'm feeling unsure about the overall design. Starting a comment thread below, looking for opinions!

@@ -1,15 +1,30 @@
"""
Format date fields to ISO 8601 dates (YYYY-MM-DD).
Format date fields to ISO 8601-like formats.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I work on this feature, the more I feel like it should be implemented separately from the existing augur curate format-dates. In the commit message of 45dd11d I present an argument for keeping them together, but left a note at the end about the internal trade off. I'm also unsure about if the new scoping to differentiate the two functionalities now provided by augur curate format-dates is clear.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having looked at the PR (yet), just this comment, I think I agree that this feels like a distinct operation from format-dates; one that is probably worthy of its own sub-augur curate command, something like augur curate apply-date-bounds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that splitting this into a separate command seems clearer and a good separation of concerns, especially if we might want to add other bound options (just starting thinking that we might want something like --target-min-date "2019-12-01" to provide a specific min date rather than a date field).

The new command can always provide hint to run augur curate format-dates before applying bounds.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! I'll close this PR, re-scope the issue, and create a separate PR with the same functionality in a new command.

@victorlin victorlin changed the title curate format-dates: Add support for date intervals curate format-dates: Support lower/upper bounds on incomplete dates Jun 5, 2025
original_date = get_numerical_date_from_value(record[args.target_date_field], fmt="%Y-%m-%d")

# Keep exact dates as-is
# Maybe worth adding a warning if the date is out of bounds?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the errors in the cases where an incomplete date is out-of-bounds, I would be surprised that an out-of-bounds complete date didn't provoke similar behavior. I think a warning is a minimum, and would ask, why not an error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for error/warning if out of bounds even if the date is an exact date.

(Made me think that it would be useful to have a option such as --target-max-date today to warn about future dates in the metadata.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll update to error in the new PR.

@joverlee521 joverlee521 self-requested a review June 5, 2025 19:15
@victorlin

Copy link
Copy Markdown
Member Author

Closing per #1773 (comment). Will create a separate PR with the same functionality in a new command.

@victorlin victorlin closed this Jun 5, 2025
@victorlin victorlin deleted the victorlin/curate-format-dates-interval branch June 5, 2025 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

curate: Support date bounds

4 participants