Skip to content

Worker pods not cleaned up upon MPIJobEvicted event #647

@shaowei-su

Description

@shaowei-su

If the worker pod got evicted, the entire MPIJob will run into Failed state:

status:
  conditions:
  - lastTransitionTime: "2024-08-14T19:45:39Z"
    lastUpdateTime: "2024-08-14T19:45:39Z"
    message: MPIJob xxx is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2024-08-14T19:48:02Z"
    lastUpdateTime: "2024-08-14T19:48:02Z"
    message: MPIJob xxx is running.
    reason: MPIJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2024-08-15T04:01:42Z"
    lastUpdateTime: "2024-08-15T04:01:42Z"
    message: 1/8 workers are evicted
    reason: MPIJobEvicted
    status: "True"
    type: Failed
  replicaStatuses:
    Launcher:
      failed: 1
    Worker:
      active: 7
      failed: 1
  startTime: "2024-08-14T19:45:39Z"

However, the run policy is not honored as a result and the worker pods are kept in running state.

  runPolicy:
    backoffLimit: 1
    cleanPodPolicy: Running
    ttlSecondsAfterFinished: 10800

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions