Skip to content

feat: isolate core ado ray actors during explore operation #1001

@michael-johnston

Description

@michael-johnston

Is your feature request related to a problem? Please describe.

The core ado ray actors started for an explore operation can be scheduled to any ray worker. However if an experiment is scheduled to the same worker and e.g. goes OOM, the full ado ray job will crash or hang, as the core actor will also be killed, with no change of recovery.

Describe the solution you'd like.

core ado ray actor are scheduled to a node isolated from experiments - this could be the head node.

This requires tagging a ray cluster worked with a resource which can be used when starting the actors e.g. "operation-actors" or "cluster-head-node" AND tagging a kuberay worker with same resource label.

The easiest would be the head node

Additional context.

  • Need the operation to still run even if resource is not available e.g. probe for resource "cluster-head-node" and if not available fall back to current behaviour
  • The core actors are: discovery-space-manager, actuators, operator
  • There is a case where the actuators do not spawn experiment remotely where the isolated node could potentially go OOM killing all core actors, if this node is also the head-node the cluster will go down.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions