Optionally allow many repair jobs per host

This is a dup of #3611, but since that issue has a long history of comments and was stale, I decided to create a new one with updated argumentation.

One of the more important changes in SM 3.2 repair was sticking to the `one job per one host rule`. I believe that for a bigger cluster this might kill any parallelism on the node level. Let's analyze a big cluster: 2 dcs 30 nodes each - all nodes have `max_repair_ranges_in_parallel = 7`. By default, each keyspace in such cluster would consist of `60 * 256 = 15360` token ranges. Assuming that they keyspace has replication `{'dc1': 3, 'dc2': 3'}`, we have `(30!/(3! * 27!))^2 = 4060^2 = 16 483 600` possible replica sets. Assuming that token ranges are distributed uniformly across all possible replica sets, it is rather unlikely that a single repaired replica set owns more than 1 token range. This combined with the fact that SM sends repair jobs only for a single replica set results in SM sending only a single token range per repair job despite `max_repair_ranges_in_parallel = 7`.

This behavior could be controlled by an additional flag or repair config option in `scylla-manager.yaml`.
In terms of testing, it would be good to see performance improvement on a big cluster like: 2ds, 15 nodes each, keyspace with RF 3 in each dc, setup in which the repair indeed has to do some work (missing rows on some nodes). This bigger setup would definitely require a help from QA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optionally allow many repair jobs per host #3790

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optionally allow many repair jobs per host #3790

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions