Skip to content

[Core feature] Vertical Pod scaling to handle OOMs #2234

Description

@apatel-fn

Motivation: Why do you think this is important?

It would be clean and useful for allowing Flyte to handle vertical pod scaling when tasks fail from an OOMKill (or possible set of resource related recoverable exceptions). This feature could be exposed via the simple task resource or pod_spec parameter. It is extremely effective for use cases where the workflow writer's users are modifying and overriding an existing set of workflows. Experimental compute generally requires various monitoring of running workflows, and creates unnecessary overhead.

Goal: What should the final outcome look like, ideally?

Ideally, this would be a simple field for the Task definition to consume (similar to pod specs), that defines the behavior on which exceptions the task should be reran on, and with what monotonic backing-off. There can also exist configurations that live on the flytepropeller for describing limits and other system level constraints.

Describe alternatives you've considered

A naive but complex alternative is utilizing a server that acts as a long polling listener running FlyteRemote. This listener would monitor the existing workflow that needs to be relaunched on OOM, and wait for the running nodes to either return succeed, or error, and then rerun the workflow from the start. This method has a few drawbacks. The first being long polling listeners from flyte remote do not seem efficient, and can be an anti pattern when many workflows with heterogeneous inputs are expected to be ran in parallel. Secondly, relaunching workflows can be costly, especially for workflows that are not being cached intentionally.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

Labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions