-
Notifications
You must be signed in to change notification settings - Fork 47
Tablet repair: core implementation #4662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
89cc37f to
b331693
Compare
b331693 to
f44dece
Compare
|
This failure is actually something that can be improved.
I will add this functionality to this PR shortly. |
Some task manager tasks are local to a given node, meaning that we need to check progress of this task on the node on which it was scheduled. Tablet repair is cluster wide, meaning that we can query any node for its progress. This commit makes it possible to not specify host when scheduling or checking progress of tablet repair.
…ient This way it can be accessed from different pkgs without importing repair pkg.
This way it can be accessed from different pkgs without importing repair pkg.
This error is non retryable, and we use it to safely skip colocated table tablet repair.
This way it can be accessed from different pkgs without importing repair pkg.
…able_tablets enable_tablets has been deprecated in favor of tablets_mode_for_new_keyspaces. We can use both for simplicity when using the same scylla.yaml for multiple scylla versions. This allows for controlling default alternator table replication (vnode/tablets) in the same as for the CQL tables.
It will be used for storing progress of the new tablet repair task. This commit also runs `make generate`.
Tablet repair svc is supposed to handle tablet repair in a light weighted way. Since tablet repair works fine with topology changes on scylla side, tablet repair svc also makes sure to support it. It does not contain any vnode related bloat, which is not needed for tablets, and could cause problems with parallel topology changes. Refs #4644
Since it also changes svc interfaces, it also runs `make generate`.
We need to make sure that leftover scylla tablet repair tasks are not running, as scheduling new scylla tablet repair tasks on a table with an ongoing tablet repair ends with an error. On the other hand, this is just best effort, because we can't ensure that new scylla tablet repair tasks won't be created in the meantime or that the task that we were trying to abort has just finished on its own.
adc9684 to
ad81961
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } | ||
| }, | ||
| "TabletRepairTableProgress": { |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TabletRepairTableProgress definition is missing "type": "object" at the beginning of its definition. This should be added for consistency with other schema definitions like TabletRepairProgress (line 279) and Schedule (line 313). The definition should start with:
"TabletRepairTableProgress": {
"type": "object",
"properties": {|
|
||
| func (w *worker) abortRepairTask(ctx context.Context, id string) { | ||
| if err := w.client.ScyllaAbortTask(ctx, "", id); err != nil { | ||
| w.logger.Error(context.Background(), "Failed to abort scylla repair task", |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error logging is using context.Background() instead of the ctx parameter that was passed to the function. This means error logs won't include any context-specific information (like trace IDs, deadlines, etc.) that might be in the original context. The parameter ctx should be used here consistently:
w.logger.Error(ctx, "Failed to abort scylla repair task",| w.logger.Error(context.Background(), "Failed to abort scylla repair task", | |
| w.logger.Error(ctx, "Failed to abort scylla repair task", |
| @@ -0,0 +1,51 @@ | |||
| // Copyright (C) 2017 ScyllaDB | |||
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copyright year should be 2025 since this is a new file created in 2025, not 2017. Please update to:
// Copyright (C) 2025 ScyllaDB
| // Copyright (C) 2017 ScyllaDB | |
| // Copyright (C) 2025 ScyllaDB |
...ger/client/operations/get_cluster_cluster_id_task_tablet_repair_task_id_run_id_parameters.go
Show resolved
Hide resolved
...ger/client/operations/get_cluster_cluster_id_task_tablet_repair_task_id_run_id_parameters.go
Show resolved
Hide resolved
...ger/client/operations/get_cluster_cluster_id_task_tablet_repair_task_id_run_id_parameters.go
Show resolved
Hide resolved
...ger/client/operations/get_cluster_cluster_id_task_tablet_repair_task_id_run_id_parameters.go
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the PR.
One comment related to the timeout on waiting for Scylla response.
Let's go through the integration-tests on Monday.
Not approving yet, although I love this thin tablet repair service. It's easy to read.
| g := gaugeVecCreator("tablet_repair") | ||
|
|
||
| return TabletRepairMetrics{ | ||
| tableProgress: g("Tablet repair progress in percents (0-100).", "progress", "cluster", "keyspace", "table"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update list of labels with the task-id. Otherwise, metrics are mixed when there is more than one tablet repair task defined per cluster (corner-case, but supported)
| w.logger.Info(ctx, "Scheduled tablet repair", "keyspace", ks, "table", tab, "task ID", id) | ||
| w.upsertTableProgress(ctx, pr) | ||
|
|
||
| status, err := client.ScyllaWaitTask(ctx, "", id, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need your help here.
Is this call blocking? Does it wait indefinitely until Scylla returns a response?
As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.
Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call blocking? Does it wait indefinitely until Scylla returns a response?
Yes
As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.
Right, we rely on correct scylla behavior
Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?
It depends on table and cluster size. We don't have a single value that could be sanely used here.
In general, there were similar discussions in the past (#4434 - long polling is bad, #4419 - big timeout is bad) and the conclusion was to not set the timeout.
|
|
||
| status, err := client.ScyllaWaitTask(ctx, "", id, 0) | ||
| if err != nil { | ||
| w.abortRepairTask(context.Background(), id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why context.Background() here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we might have failed due to SM task being stopped which results in context cancel.
In such case, we still want to perform the tablet repair task abort.
This PR contains most of the implementation of the new tablet repair task.
It starts with refactor/improvements to existing codebase, which are needed for the tablet repair task.
This task is supposed to be a light weighted version of regular repair task for tablets.
It puts less pressure on SM DB and is resilient to parallel topology changes.
It also adds the possibility to schedule tablet repair with more frequent schedule than
the regular repair in order to utilize incremental tablet repair.
For now it does not have any flags or configuration - it is as simple as possible, but
some of the regular repair flags can be added to it later on (if we think that they are needed).
This PR also defines SM DB schema and progress swagger, but managerclient and sctool
commands need to be added in separate PRs (managerclient is a vendored submodule
and sctool commands depend on it).
Full implementation (made with go work) can be checked here: #4661
Refs #4644