Skip to content

Terminate failed repair jobs #3806

Open
@Michal-Leszczynski

Description

@Michal-Leszczynski

Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:

  • in case of a timeout, SM believes that the task has failed and stopped running, so it schedules new repair jobs for "released" hosts. This can break the 1 job per 1 host rule.
  • not terminated repair jobs running after SM task has ended might make it impossible to retry the SM task until they are finished (see https://github.com/scylladb/scylla-enterprise/issues/4055)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions