feat: add TTL strategy to Argo WorkflowTemplates#3104
Conversation
… use schedules Add a configurable ttlStrategy to Argo WorkflowTemplates so completed workflows are automatically cleaned up after 7 days (matching the default Kubernetes job TTL). Configurable via METAFLOW_ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION env var; set to 0 to disable. Also update CronWorkflow creation to use the `schedules` array field instead of the deprecated singular `schedule` field, per Argo Workflows deprecation notice. Closes #1231, closes #2351
The schedules array field requires Argo Workflows v3.6+ and breaks older installations. Keep the singular schedule field which is supported across all versions.
Greptile SummaryThis PR adds Confidence Score: 5/5Safe to merge; all findings are P2 style/suggestion level. The implementation is correct and tested. The two findings are both P2: one is a design preference (default TTL value), and the other is a minor code quality improvement (double int() call). Neither blocks correctness. Review the default value in metaflow/metaflow_config.py — consider whether 0 (opt-out) is safer than 7 days (opt-in) for the default TTL. Important Files Changed
Reviews (1): Last reviewed commit: "test: add unit tests for WorkflowSpec.tt..." | Re-trigger Greptile |
| ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION = from_conf( | ||
| "ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION", 7 * 24 * 60 * 60 | ||
| ) |
There was a problem hiding this comment.
Opt-in default TTL is a behavior change for existing users
The default of 7 days means every existing Metaflow user who upgrades will silently start having their completed Argo Workflow resources auto-deleted after 7 days. Users who rely on workflow history beyond that window (e.g. for audit, debugging, or custom tooling that queries old workflow objects) will lose it without any explicit action. A safer default might be None or 0 (disable TTL) to preserve the existing behavior, with opt-in for cleanup.
| def ttl_strategy(self, seconds_after_completion): | ||
| # https://argoproj.github.io/argo-workflows/fields/#ttlstrategy | ||
| if seconds_after_completion is not None and int(seconds_after_completion) > 0: | ||
| self.payload["ttlStrategy"] = { | ||
| "secondsAfterCompletion": int(seconds_after_completion) | ||
| } | ||
| return self |
There was a problem hiding this comment.
int() called twice and raises unhandled ValueError on bad env-var input
int(seconds_after_completion) is evaluated twice (once in the condition, once in the payload). More importantly, if a user sets METAFLOW_ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION=invalid in their environment, int("invalid") will raise an unhandled ValueError at deploy time with no helpful message. Converting and validating once at the top is cleaner:
| def ttl_strategy(self, seconds_after_completion): | |
| # https://argoproj.github.io/argo-workflows/fields/#ttlstrategy | |
| if seconds_after_completion is not None and int(seconds_after_completion) > 0: | |
| self.payload["ttlStrategy"] = { | |
| "secondsAfterCompletion": int(seconds_after_completion) | |
| } | |
| return self | |
| def ttl_strategy(self, seconds_after_completion): | |
| # https://argoproj.github.io/argo-workflows/fields/#ttlstrategy | |
| if seconds_after_completion is not None: | |
| try: | |
| seconds = int(seconds_after_completion) | |
| except (TypeError, ValueError): | |
| seconds = 0 | |
| if seconds > 0: | |
| self.payload["ttlStrategy"] = { | |
| "secondsAfterCompletion": seconds | |
| } | |
| return self |
Summary
METAFLOW_ARGO_WORKFLOW_TTL_SECONDSconfigschedulesarray back toschedulestring field for broader Argo version compatibilityTest plan
test/unit/test_argo_ttl_strategy.py— unit tests for TTL strategy configuration🤖 Generated with Claude Code