Skip to content

Conversation

@Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Nov 27, 2025

This PR contains most of the implementation of the new tablet repair task.
It starts with refactor/improvements to existing codebase, which are needed for the tablet repair task.

This task is supposed to be a light weighted version of regular repair task for tablets.
It puts less pressure on SM DB and is resilient to parallel topology changes.
It also adds the possibility to schedule tablet repair with more frequent schedule than
the regular repair in order to utilize incremental tablet repair.
For now it does not have any flags or configuration - it is as simple as possible, but
some of the regular repair flags can be added to it later on (if we think that they are needed).

This PR also defines SM DB schema and progress swagger, but managerclient and sctool
commands need to be added in separate PRs (managerclient is a vendored submodule
and sctool commands depend on it).

Full implementation (made with go work) can be checked here: #4661

Refs #4644

@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/4644-tablet-repair-core branch 6 times, most recently from 89cc37f to b331693 Compare November 28, 2025 10:33
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review November 28, 2025 11:17
@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/4644-tablet-repair-core branch from b331693 to f44dece Compare December 2, 2025 10:06
@Michal-Leszczynski
Copy link
Collaborator Author

This failure is actually something that can be improved.
It happened because we don't abort scylla tablet repair tasks on SM task pause and trying to schedule scylla tablet repair tasks on a table which is already repaired ends with an error. Because of that, we should both:

  • abort scylla tablet repair task if we need to exit before it is finished
  • abort all ongoing scylla tablet repair tasks at the beginning of SM tablet repair task

I will add this functionality to this PR shortly.

Some task manager tasks are local to a given node, meaning that
we need to check progress of this task on the node on which it
was scheduled. Tablet repair is cluster wide, meaning that we
can query any node for its progress. This commit makes it possible
to not specify host when scheduling or checking progress of tablet repair.
…ient

This way it can be accessed from different pkgs without importing repair pkg.
This way it can be accessed from different pkgs without importing repair pkg.
This error is non retryable, and we use it
to safely skip colocated table tablet repair.
This way it can be accessed from different pkgs without importing repair pkg.
…able_tablets

enable_tablets has been deprecated in favor of tablets_mode_for_new_keyspaces.
We can use both for simplicity when using the same scylla.yaml for multiple
scylla versions. This allows for controlling default alternator table replication
(vnode/tablets) in the same as for the CQL tables.
It will be used for storing progress of the new tablet repair task.
This commit also runs `make generate`.
Tablet repair svc is supposed to handle tablet repair
in a light weighted way. Since tablet repair works
fine with topology changes on scylla side, tablet repair
svc also makes sure to support it. It does not contain
any vnode related bloat, which is not needed for tablets,
and could cause problems with parallel topology changes.

Refs #4644
Since it also changes svc interfaces, it also runs `make generate`.
We need to make sure that leftover scylla tablet repair tasks are not running,
as scheduling new scylla tablet repair tasks on a table with an ongoing tablet repair
ends with an error. On the other hand, this is just best effort, because we can't
ensure that new scylla tablet repair tasks won't be created in the meantime or that
the task that we were trying to abort has just finished on its own.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot finished reviewing on behalf of karol-kokoszka December 2, 2025 16:44
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}
},
"TabletRepairTableProgress": {
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TabletRepairTableProgress definition is missing "type": "object" at the beginning of its definition. This should be added for consistency with other schema definitions like TabletRepairProgress (line 279) and Schedule (line 313). The definition should start with:

"TabletRepairTableProgress": {
  "type": "object",
  "properties": {

Copilot uses AI. Check for mistakes.

func (w *worker) abortRepairTask(ctx context.Context, id string) {
if err := w.client.ScyllaAbortTask(ctx, "", id); err != nil {
w.logger.Error(context.Background(), "Failed to abort scylla repair task",
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error logging is using context.Background() instead of the ctx parameter that was passed to the function. This means error logs won't include any context-specific information (like trace IDs, deadlines, etc.) that might be in the original context. The parameter ctx should be used here consistently:

w.logger.Error(ctx, "Failed to abort scylla repair task",
Suggested change
w.logger.Error(context.Background(), "Failed to abort scylla repair task",
w.logger.Error(ctx, "Failed to abort scylla repair task",

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,51 @@
// Copyright (C) 2017 ScyllaDB
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright year should be 2025 since this is a new file created in 2025, not 2017. Please update to:

// Copyright (C) 2025 ScyllaDB
Suggested change
// Copyright (C) 2017 ScyllaDB
// Copyright (C) 2025 ScyllaDB

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the PR.
One comment related to the timeout on waiting for Scylla response.

Let's go through the integration-tests on Monday.

Not approving yet, although I love this thin tablet repair service. It's easy to read.

g := gaugeVecCreator("tablet_repair")

return TabletRepairMetrics{
tableProgress: g("Tablet repair progress in percents (0-100).", "progress", "cluster", "keyspace", "table"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update list of labels with the task-id. Otherwise, metrics are mixed when there is more than one tablet repair task defined per cluster (corner-case, but supported)

w.logger.Info(ctx, "Scheduled tablet repair", "keyspace", ks, "table", tab, "task ID", id)
w.upsertTableProgress(ctx, pr)

status, err := client.ScyllaWaitTask(ctx, "", id, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need your help here.

Is this call blocking? Does it wait indefinitely until Scylla returns a response?

As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.

Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call blocking? Does it wait indefinitely until Scylla returns a response?

Yes

As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.

Right, we rely on correct scylla behavior

Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?

It depends on table and cluster size. We don't have a single value that could be sanely used here.
In general, there were similar discussions in the past (#4434 - long polling is bad, #4419 - big timeout is bad) and the conclusion was to not set the timeout.


status, err := client.ScyllaWaitTask(ctx, "", id, 0)
if err != nil {
w.abortRepairTask(context.Background(), id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why context.Background() here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we might have failed due to SM task being stopped which results in context cancel.
In such case, we still want to perform the tablet repair task abort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants