Skip to content

fix(nodetool rebuild): use repair instead of rebuild if no tablets support #9073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yarongilor
Copy link
Contributor

if no tables support for nodetool rebuild, test should use an alternative action of repair. it should then disable load-balancing and repair all nodes in this datacenter.
refs: scylladb/scylladb#17575
refs: scylladb/scylladb#20084 (comment)

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@yarongilor yarongilor added the backport/2024.2 Need backport to 2024.2 label Oct 28, 2024
@yarongilor
Copy link
Contributor Author

yarongilor commented Oct 28, 2024

@bhalevy , can you please advise, following scylladb/scylladb#20084 (comment) -

IIUC, in case scylladb/scylladb#17852 is open all DC nodes should be manually repaired.
but then, otherwise, there is no need for a repair at all? or only repair target node?

and secondly, i'm not sure, is it right to backport this fix to 2024/6.x ? (it may have an extensive impact on longevities and testing for this PR)

@yarongilor yarongilor added area/tablets and removed backport/2024.2 Need backport to 2024.2 labels Oct 28, 2024
@yarongilor yarongilor force-pushed the skip_rebuild_streaming_err_with_tablets branch 2 times, most recently from 5340448 to f24debe Compare October 30, 2024 15:19
sdcm/nemesis.py Outdated
with self.cluster.cql_connection_patient(self.target_node) as session:
if is_tablets_feature_enabled(session=session) and not is_rebuild_supported:
for node in [n for n in self.cluster.nodes if n.dc_idx == self.target_node.dc_idx]:
node.run_nodetool(sub_cmd="repair")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend doing long_running=True, retry=0

also maybe to consider hard timeout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I'm not sure you have guarantee all the nodes in this DC are up and running...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed like:

                for node in [n for n in self.cluster.nodes if n.dc_idx == self.target_node.dc_idx and n.db_up()]:
                    node.run_nodetool(sub_cmd="repair", long_running=True, retry=0)

@fruch , since long_running support already mentioned, how about using scylla task manager for monitoring such commands progress and results?
There's already a dtest covering task manager in https://github.com/scylladb/scylla-dtest/pull/4957

@yarongilor
Copy link
Contributor Author

@bhalevy , @pehala , please advise -
i see that nodetool rebuild continues to be tested in sct with no complains or reported issues.
on the other hand it is indeed skipped in dtest like:
https://github.com/scylladb/scylla-dtest/blob/1f70dde42ceeb2f2c45bbd773bb00bc7d8bb56ad/rebuild_test.py#L47
so if this workaround still relevant then i'll address comments and run tests for it.

@yarongilor yarongilor force-pushed the skip_rebuild_streaming_err_with_tablets branch 2 times, most recently from 3635f1a to 98b7009 Compare February 27, 2025 17:19
…pport

if no tables support for nodetool rebuild, test should use an alternative action of repair.
it should then disable load-balancing and repair all nodes in this datacenter.
refs: scylladb/scylladb#17575
refs: scylladb/scylladb#20084 (comment)
@yarongilor yarongilor force-pushed the skip_rebuild_streaming_err_with_tablets branch from 98b7009 to 9ab928c Compare February 27, 2025 18:32
@pehala
Copy link
Contributor

pehala commented Feb 28, 2025

@bhalevy , @pehala , please advise -
i see that nodetool rebuild continues to be tested in sct with no complains or reported issues.
on the other hand it is indeed skipped in dtest like:

I believe new tablet repair is being worked on, not sure about timelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants