Revisit checking when SM should batch vnode token ranges during repair

By default, SM batches (with `ranges_parallelism`) token ranges sent during vnode repair in order to improve its performance.
It's not done if one of the executions in task execution chain ended due to running out of maintenance window or some error:
```go
func shouldBatchRanges(session gocqlx.Session, clusterID, taskID, runID uuid.UUID) (bool, error) {
	prevIDs, err := getAllPrevRunIDs(session, clusterID, taskID, runID)
       ...
	var status string
	for _, id := range prevIDs {
		err := q.BindMap(qb.M{
			"cluster_id": clusterID,
			"type":       "repair",
			"task_id":    taskID,
			"id":         id,
		}).Scan(&status)
		if err != nil {
			return false, errors.Wrap(err, "get prev run status")
		}
		// Fall back to no-batching when some of the previous runs:
		// - finished with error
		// - got out of scheduler window
		if status == "WAITING" || status == "ERROR" {
			return false, nil
		}
	}

	return true, nil
}
```
The problem with batching is that it might negatively impact granularity, so also the progress that repair task can make in a single, short execution. E.g., it would be theoretically possible for repair task scheduled with short spans on maintenance window to progress without batching and fail to progress with batching. Here is the [initial conversation](https://github.com/scylladb/scylla-manager/issues/3792#issuecomment-2116473707) about this topic.

In general, we should really prefer batching, as:
- we gain a lot of performance from it
- re-repairing already repaired data is faster
- nobody schedules maintenance windows like that

So in general, there are some reasons to believe that not batching might be better in case of retires / running out of maintenance window, but it's difficult to say whether they are real.

One thing for sure is that the current implementation stops batching too often - it just looks if any of the previous runs finished with an error. This error might be unrelated to the failed vnode repair (e.g. repair might fail due to the initial check that there are no repairs running on the cluster). Also, failing one batch results in not batching different token ranges, which isn't optimal as well. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit checking when SM should batch vnode token ranges during repair #4752

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Revisit checking when SM should batch vnode token ranges during repair #4752

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions