Tablet repair: core implementation #4662

Michal-Leszczynski · 2025-11-27T22:50:24Z

This PR contains most of the implementation of the new tablet repair task.
It starts with refactor/improvements to existing codebase, which are needed for the tablet repair task.

This task is supposed to be a light weighted version of regular repair task for tablets.
It puts less pressure on SM DB and is resilient to parallel topology changes.
It also adds the possibility to schedule tablet repair with more frequent schedule than
the regular repair in order to utilize incremental tablet repair.
For now it does not have any flags or configuration - it is as simple as possible, but
some of the regular repair flags can be added to it later on (if we think that they are needed).

This PR also defines SM DB schema and progress swagger, but managerclient and sctool
commands need to be added in separate PRs (managerclient is a vendored submodule
and sctool commands depend on it).

Full implementation (made with go work) can be checked here: #4661

Refs #4644

Michal-Leszczynski · 2025-12-02T12:35:19Z

This failure is actually something that can be improved.
It happened because we don't abort scylla tablet repair tasks on SM task pause and trying to schedule scylla tablet repair tasks on a table which is already repaired ends with an error. Because of that, we should both:

abort scylla tablet repair task if we need to exit before it is finished
abort all ongoing scylla tablet repair tasks at the beginning of SM tablet repair task

I will add this functionality to this PR shortly.

Some task manager tasks are local to a given node, meaning that we need to check progress of this task on the node on which it was scheduled. Tablet repair is cluster wide, meaning that we can query any node for its progress. This commit makes it possible to not specify host when scheduling or checking progress of tablet repair.

…ient This way it can be accessed from different pkgs without importing repair pkg.

This way it can be accessed from different pkgs without importing repair pkg.

This error is non retryable, and we use it to safely skip colocated table tablet repair.

This way it can be accessed from different pkgs without importing repair pkg.

…able_tablets enable_tablets has been deprecated in favor of tablets_mode_for_new_keyspaces. We can use both for simplicity when using the same scylla.yaml for multiple scylla versions. This allows for controlling default alternator table replication (vnode/tablets) in the same as for the CQL tables.

It will be used for storing progress of the new tablet repair task. This commit also runs `make generate`.

Tablet repair svc is supposed to handle tablet repair in a light weighted way. Since tablet repair works fine with topology changes on scylla side, tablet repair svc also makes sure to support it. It does not contain any vnode related bloat, which is not needed for tablets, and could cause problems with parallel topology changes. Refs #4644

Since it also changes svc interfaces, it also runs `make generate`.

We need to make sure that leftover scylla tablet repair tasks are not running, as scheduling new scylla tablet repair tasks on a table with an ongoing tablet repair ends with an error. On the other hand, this is just best effort, because we can't ensure that new scylla tablet repair tasks won't be created in the meantime or that the task that we were trying to abort has just finished on its own.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-02T16:45:33Z

v3/swagger/scylla-manager.json

+        }
+      }
+    },
+    "TabletRepairTableProgress": {


The TabletRepairTableProgress definition is missing "type": "object" at the beginning of its definition. This should be added for consistency with other schema definitions like TabletRepairProgress (line 279) and Schedule (line 313). The definition should start with:

"TabletRepairTableProgress": { "type": "object", "properties": {

Copilot · 2025-12-02T16:45:33Z

pkg/service/repair/tablet/worker.go

+
+func (w *worker) abortRepairTask(ctx context.Context, id string) {
+	if err := w.client.ScyllaAbortTask(ctx, "", id); err != nil {
+		w.logger.Error(context.Background(), "Failed to abort scylla repair task",


The error logging is using context.Background() instead of the ctx parameter that was passed to the function. This means error logs won't include any context-specific information (like trace IDs, deadlines, etc.) that might be in the original context. The parameter ctx should be used here consistently:

w.logger.Error(ctx, "Failed to abort scylla repair task",

Suggested change

w.logger.Error(context.Background(), "Failed to abort scylla repair task",

w.logger.Error(ctx, "Failed to abort scylla repair task",

Copilot · 2025-12-02T16:45:34Z

pkg/metrics/tabletrepair.go

@@ -0,0 +1,51 @@
+// Copyright (C) 2017 ScyllaDB


The copyright year should be 2025 since this is a new file created in 2025, not 2017. Please update to:

// Copyright (C) 2025 ScyllaDB

Suggested change

// Copyright (C) 2017 ScyllaDB

// Copyright (C) 2025 ScyllaDB

...ger/client/operations/get_cluster_cluster_id_task_tablet_repair_task_id_run_id_parameters.go

karol-kokoszka

I went through the PR.
One comment related to the timeout on waiting for Scylla response.

Let's go through the integration-tests on Monday.

Not approving yet, although I love this thin tablet repair service. It's easy to read.

karol-kokoszka · 2025-12-02T16:45:30Z

pkg/metrics/tabletrepair.go

+	g := gaugeVecCreator("tablet_repair")
+
+	return TabletRepairMetrics{
+		tableProgress: g("Tablet repair progress in percents (0-100).", "progress", "cluster", "keyspace", "table"),


Please update list of labels with the task-id. Otherwise, metrics are mixed when there is more than one tablet repair task defined per cluster (corner-case, but supported)

karol-kokoszka · 2025-12-02T18:15:02Z

pkg/service/repair/tablet/worker.go

+	w.logger.Info(ctx, "Scheduled tablet repair", "keyspace", ks, "table", tab, "task ID", id)
+	w.upsertTableProgress(ctx, pr)
+
+	status, err := client.ScyllaWaitTask(ctx, "", id, 0)


Need your help here.

Is this call blocking? Does it wait indefinitely until Scylla returns a response?

As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.

Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?

Is this call blocking? Does it wait indefinitely until Scylla returns a response?

Yes

As far as I can see, we don’t set any timeout on the HTTP client, right? If Scylla neither responds nor closes the connection, this call would hang forever.

Right, we rely on correct scylla behavior

Could you please set a reasonable client timeout? Do we have any data on how long tablet repair for a single table typically takes, so we can pick a sensible value?

It depends on table and cluster size. We don't have a single value that could be sanely used here.
In general, there were similar discussions in the past (#4434 - long polling is bad, #4419 - big timeout is bad) and the conclusion was to not set the timeout.

karol-kokoszka · 2025-12-02T18:17:23Z

pkg/service/repair/tablet/worker.go

+
+	status, err := client.ScyllaWaitTask(ctx, "", id, 0)
+	if err != nil {
+		w.abortRepairTask(context.Background(), id)


Why context.Background() here ?

Because we might have failed due to SM task being stopped which results in context cancel.
In such case, we still want to perform the tablet repair task abort.

Michal-Leszczynski force-pushed the ml/4644-tablet-repair-core branch 6 times, most recently from 89cc37f to b331693 Compare November 28, 2025 10:33

Michal-Leszczynski marked this pull request as ready for review November 28, 2025 11:17

Michal-Leszczynski requested review from VAveryanov8 and karol-kokoszka as code owners November 28, 2025 11:17

Michal-Leszczynski force-pushed the ml/4644-tablet-repair-core branch from b331693 to f44dece Compare December 2, 2025 10:06

Michal-Leszczynski added 12 commits December 2, 2025 13:54

refactor(repair): move incremental repair mode definition to scyllacl…

0ba2db9

…ient This way it can be accessed from different pkgs without importing repair pkg.

refactor(repair): move colocated repair error check to scyllaclient

2943926

This way it can be accessed from different pkgs without importing repair pkg.

feat(scyllaclient): don't retry on colocated table tablet repair err

ac1b0e7

This error is non retryable, and we use it to safely skip colocated table tablet repair.

refactor(repair_test): move stop/startNode to testhelper pkg

fe18fd7

This way it can be accessed from different pkgs without importing repair pkg.

feat(schema): add tablet_repair_run_progress

fdf9d4d

It will be used for storing progress of the new tablet repair task. This commit also runs `make generate`.

feat(swagger): manager, add tablet repair progress definition

79afeff

feat(tablet_repair): plug tablet repair into scheduler and api

38b8986

Since it also changes svc interfaces, it also runs `make generate`.

feat(.github): run tablet repair tests in CI

bba0159

Michal-Leszczynski force-pushed the ml/4644-tablet-repair-core branch from adc9684 to ad81961 Compare December 2, 2025 12:56

karol-kokoszka requested a review from Copilot December 2, 2025 13:08

Copilot AI reviewed Dec 2, 2025

View reviewed changes

Copilot started reviewing on behalf of karol-kokoszka December 2, 2025 13:45 View session

Copilot finished reviewing on behalf of karol-kokoszka December 2, 2025 13:46

karol-kokoszka requested a review from Copilot December 2, 2025 16:34

Copilot started reviewing on behalf of karol-kokoszka December 2, 2025 16:35 View session

Copilot finished reviewing on behalf of karol-kokoszka December 2, 2025 16:44

Copilot AI reviewed Dec 2, 2025

View reviewed changes

karol-kokoszka reviewed Dec 2, 2025

View reviewed changes

	w.logger.Error(context.Background(), "Failed to abort scylla repair task",
	w.logger.Error(ctx, "Failed to abort scylla repair task",

	// Copyright (C) 2017 ScyllaDB
	// Copyright (C) 2025 ScyllaDB

Tablet repair: core implementation #4662

Are you sure you want to change the base?

Tablet repair: core implementation #4662

Conversation

Michal-Leszczynski commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Michal-Leszczynski commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karol-kokoszka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karol-kokoszka Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

karol-kokoszka Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Michal-Leszczynski Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

karol-kokoszka Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Michal-Leszczynski Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Michal-Leszczynski commented Nov 27, 2025 •

edited

Loading

karol-kokoszka left a comment •

edited

Loading