Skip to content

feat: add TTL strategy to Argo WorkflowTemplates#3104

Open
npow wants to merge 4 commits intomasterfrom
npow/argo-ttl-schedules-clean
Open

feat: add TTL strategy to Argo WorkflowTemplates#3104
npow wants to merge 4 commits intomasterfrom
npow/argo-ttl-schedules-clean

Conversation

@npow
Copy link
Copy Markdown
Collaborator

@npow npow commented Apr 13, 2026

Summary

  • Add TTL strategy support to Argo WorkflowTemplates via METAFLOW_ARGO_WORKFLOW_TTL_SECONDS config
  • Automatically cleans up completed workflow pods after the configured TTL, reducing cluster resource usage
  • Fix: revert CronWorkflow schedules array back to schedule string field for broader Argo version compatibility

Test plan

  • test/unit/test_argo_ttl_strategy.py — unit tests for TTL strategy configuration

🤖 Generated with Claude Code

Nissan Pow added 4 commits April 13, 2026 14:55
… use schedules

Add a configurable ttlStrategy to Argo WorkflowTemplates so completed
workflows are automatically cleaned up after 7 days (matching the default
Kubernetes job TTL). Configurable via METAFLOW_ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION
env var; set to 0 to disable.

Also update CronWorkflow creation to use the `schedules` array field instead
of the deprecated singular `schedule` field, per Argo Workflows deprecation
notice.

Closes #1231, closes #2351
The schedules array field requires Argo Workflows v3.6+ and breaks
older installations. Keep the singular schedule field which is
supported across all versions.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR adds ttlStrategy support to Argo WorkflowTemplate specs, controlled by a new METAFLOW_ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION config variable (default: 7 days). The implementation is clean and well-tested, but the non-zero default is a silent behavior change — existing users who upgrade will have completed workflow resources auto-deleted after 7 days without any explicit opt-in.

Confidence Score: 5/5

Safe to merge; all findings are P2 style/suggestion level.

The implementation is correct and tested. The two findings are both P2: one is a design preference (default TTL value), and the other is a minor code quality improvement (double int() call). Neither blocks correctness.

Review the default value in metaflow/metaflow_config.py — consider whether 0 (opt-out) is safer than 7 days (opt-in) for the default TTL.

Important Files Changed

Filename Overview
metaflow/metaflow_config.py Adds ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION config; default of 7 days is a silent behavior change for existing users.
metaflow/plugins/argo/argo_workflows.py Adds ttl_strategy() to WorkflowSpec and wires it into the WorkflowTemplate builder; logic is correct, minor int() redundancy.
test/unit/test_argo_ttl_strategy.py Good coverage of positive, zero, None, negative, string-input, and chaining cases for ttl_strategy().

Reviews (1): Last reviewed commit: "test: add unit tests for WorkflowSpec.tt..." | Re-trigger Greptile

Comment on lines +482 to +484
ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION = from_conf(
"ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION", 7 * 24 * 60 * 60
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Opt-in default TTL is a behavior change for existing users

The default of 7 days means every existing Metaflow user who upgrades will silently start having their completed Argo Workflow resources auto-deleted after 7 days. Users who rely on workflow history beyond that window (e.g. for audit, debugging, or custom tooling that queries old workflow objects) will lose it without any explicit action. A safer default might be None or 0 (disable TTL) to preserve the existing behavior, with opt-in for cleanup.

Comment on lines +4301 to +4307
def ttl_strategy(self, seconds_after_completion):
# https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
if seconds_after_completion is not None and int(seconds_after_completion) > 0:
self.payload["ttlStrategy"] = {
"secondsAfterCompletion": int(seconds_after_completion)
}
return self
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 int() called twice and raises unhandled ValueError on bad env-var input

int(seconds_after_completion) is evaluated twice (once in the condition, once in the payload). More importantly, if a user sets METAFLOW_ARGO_WORKFLOWS_TTL_SECONDS_AFTER_COMPLETION=invalid in their environment, int("invalid") will raise an unhandled ValueError at deploy time with no helpful message. Converting and validating once at the top is cleaner:

Suggested change
def ttl_strategy(self, seconds_after_completion):
# https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
if seconds_after_completion is not None and int(seconds_after_completion) > 0:
self.payload["ttlStrategy"] = {
"secondsAfterCompletion": int(seconds_after_completion)
}
return self
def ttl_strategy(self, seconds_after_completion):
# https://argoproj.github.io/argo-workflows/fields/#ttlstrategy
if seconds_after_completion is not None:
try:
seconds = int(seconds_after_completion)
except (TypeError, ValueError):
seconds = 0
if seconds > 0:
self.payload["ttlStrategy"] = {
"secondsAfterCompletion": seconds
}
return self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant