Open
Description
Right now we don't terminate failed repair jobs by default - the problem is that they might have failed because of a timeout on our side and in fact still be running. This causes two problems:
- in case of a timeout, SM believes that the task has failed and stopped running, so it schedules new repair jobs for "released" hosts. This can break the 1 job per 1 host rule.
- not terminated repair jobs running after SM task has ended might make it impossible to retry the SM task until they are finished (see https://github.com/scylladb/scylla-enterprise/issues/4055)