Skip to content

Revisit checking when SM should batch vnode token ranges during repair #4752

@Michal-Leszczynski

Description

@Michal-Leszczynski

By default, SM batches (with ranges_parallelism) token ranges sent during vnode repair in order to improve its performance.
It's not done if one of the executions in task execution chain ended due to running out of maintenance window or some error:

func shouldBatchRanges(session gocqlx.Session, clusterID, taskID, runID uuid.UUID) (bool, error) {
	prevIDs, err := getAllPrevRunIDs(session, clusterID, taskID, runID)
       ...
	var status string
	for _, id := range prevIDs {
		err := q.BindMap(qb.M{
			"cluster_id": clusterID,
			"type":       "repair",
			"task_id":    taskID,
			"id":         id,
		}).Scan(&status)
		if err != nil {
			return false, errors.Wrap(err, "get prev run status")
		}
		// Fall back to no-batching when some of the previous runs:
		// - finished with error
		// - got out of scheduler window
		if status == "WAITING" || status == "ERROR" {
			return false, nil
		}
	}

	return true, nil
}

The problem with batching is that it might negatively impact granularity, so also the progress that repair task can make in a single, short execution. E.g., it would be theoretically possible for repair task scheduled with short spans on maintenance window to progress without batching and fail to progress with batching. Here is the initial conversation about this topic.

In general, we should really prefer batching, as:

  • we gain a lot of performance from it
  • re-repairing already repaired data is faster
  • nobody schedules maintenance windows like that

So in general, there are some reasons to believe that not batching might be better in case of retires / running out of maintenance window, but it's difficult to say whether they are real.

One thing for sure is that the current implementation stops batching too often - it just looks if any of the previous runs finished with an error. This error might be unrelated to the failed vnode repair (e.g. repair might fail due to the initial check that there are no repairs running on the cluster). Also, failing one batch results in not batching different token ranges, which isn't optimal as well.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions