Problem
Terminal TaskAction CRDs (Succeeded/Failed) remain in the cluster indefinitely after completion. This wastes etcd storage and slows down list operations, especially as the number of completed tasks grows over time.
Proposed Solution
Add a label-based TTL garbage collector for terminal TaskActions, modeled after the propeller FlyteWorkflow GC (flytepropeller/pkg/controller/garbage_collector.go).
Design
- Terminal labeling: When a TaskAction reaches a terminal state (Succeeded/Failed), stamp it with
flyte.org/termination-status=terminated and flyte.org/completed-time=<UTC hour> labels
- Background GC loop: A
manager.Runnable that periodically lists terminated TaskActions, filters by completed-time label (lexicographically ordered), and deletes expired ones
- Configuration:
GCConfig with Interval (how often GC runs) and MaxTTL (time-to-live for terminal TaskActions)
Key Design Decisions
- List + filter + delete (not
DeleteAllOf): K8s label selectors don't support "less than" on string values, so we list all terminated TaskActions and filter client-side by hour label
- Separate metadata update: Labels require
r.Update() (not r.Status().Update()). One extra API call per terminal transition, but only once per TaskAction lifetime
- Upgrade path: Pre-existing terminal TaskActions get labels on next reconcile via the terminal short-circuit path
Implementation
PR: #6994
Problem
Terminal TaskAction CRDs (Succeeded/Failed) remain in the cluster indefinitely after completion. This wastes etcd storage and slows down list operations, especially as the number of completed tasks grows over time.
Proposed Solution
Add a label-based TTL garbage collector for terminal TaskActions, modeled after the propeller FlyteWorkflow GC (
flytepropeller/pkg/controller/garbage_collector.go).Design
flyte.org/termination-status=terminatedandflyte.org/completed-time=<UTC hour>labelsmanager.Runnablethat periodically lists terminated TaskActions, filters by completed-time label (lexicographically ordered), and deletes expired onesGCConfigwithInterval(how often GC runs) andMaxTTL(time-to-live for terminal TaskActions)Key Design Decisions
DeleteAllOf): K8s label selectors don't support "less than" on string values, so we list all terminated TaskActions and filter client-side by hour labelr.Update()(notr.Status().Update()). One extra API call per terminal transition, but only once per TaskAction lifetimeImplementation
PR: #6994