Skip to content

Manager 3.5.0 release updates #10586

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 7, 2025
Merged

Conversation

mikliapko
Copy link
Contributor

@mikliapko mikliapko commented Apr 3, 2025

Closes scylladb/scylla-manager#4343

Manager tests, configuration files, CI updates to introduce new Manager version 3.5.0 into SCT.

Manager 3.5.0 is a default version used in tests

As the new minor release of Manager - 3.5.0 - is here, it is set to be a default one used in tests.
Together with that, Scylla 2025.1 is added to manager_versions.yaml config and will be the default one to test with Manager 3.5.0.

Changed Manager version to upgrade from to 3.4.*

Since the Manager 3.5 is out, we need to cover an upgrade from 3.4.* versions now.
Manager versions 3.4.1 and 3.4.2 are two versions used in Production in Cloud currently.

Covered Scylla 2024.1/2024.2 in some of the Manager jobs

Scylla 2023.1 is not officially supported by the latest Manager release. Because of that, older enterprise jobs switched to run with 2024.1. All Debian jobs is set to run with 2024.2 to keep the coverage for this release after switching the majority of jobs to 2025.1.

Get rf dynamically for repair test with multiDC cluster

In previous implementation the test was trying to set rf=2 per each DC while one of DCs had only one node.
As a result, the test failed with error like:

Datacenter us-west-2scylla_node_west doesn't have enough token-owning nodes for replication_factor=2

The new approach will dynamically define the number of nodes per DC and set this value to DC's replication factor.

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

@mikliapko mikliapko self-assigned this Apr 3, 2025
Since the Manager 3.5 is out, we need to cover an upgrade from 3.4.*
versions now.

Manager versions 3.4.1 and 3.4.2 are two versions used in Production
in Cloud currently.
Scylla 2023.1 is not officially supported by the latest Manager release.
Because of that, older enterprise jobs switched to run with 2024.1.

All Debian jobs is set to run with 2024.2 to keep the coverage for this
release after switching the majority of jobs to 2025.1.
@mikliapko mikliapko force-pushed the manager-release-3.5 branch from 5d86e88 to 183c4c3 Compare April 3, 2025 11:46
@pehala
Copy link
Contributor

pehala commented Apr 3, 2025

I would add 2025.1 backport, due to the repair issue we encountered

@mikliapko mikliapko force-pushed the manager-release-3.5 branch 2 times, most recently from 8dbc99a to 8a13330 Compare April 3, 2025 11:58
As the new minor release of Manager - 3.5.0 - is here, it is set to
be a default one used in tests.

Together with that, Scylla 2025.1 is added to manager_versions.yaml
config and will be the default one to test with Manager 3.5.0.

test_sdcm_mgmt_common.py tests have been updated accordingly.
In previous implementation the test was trying to set rf=2 per each DC
while one of DCs had only one node. As a result, the test failed with
error (1) like

 "Datacenter us-west-2scylla_node_west doesn't have enough token-owning
 nodes for replication_factor=2"

The new approach will dynamically define the number of nodes per DC and
set this value to DC's replication factor.

refs:
1. https://jenkins.scylladb.com/job/manager-3.5/job/ubuntu22-sanity-test/1/
@mikliapko mikliapko force-pushed the manager-release-3.5 branch from 8a13330 to 1abac5d Compare April 3, 2025 12:09
@mikliapko mikliapko marked this pull request as ready for review April 3, 2025 13:17
@mikliapko mikliapko requested a review from rayakurl as a code owner April 3, 2025 13:17
@mikliapko mikliapko requested review from karol-kokoszka, Michal-Leszczynski and a team and removed request for rayakurl April 3, 2025 13:17
@fruch
Copy link
Contributor

fruch commented Apr 3, 2025

@fruch
Copy link
Contributor

fruch commented Apr 3, 2025

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ?
the code in those is a bit different then the manager tests...

and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

@mikliapko
Copy link
Contributor Author

@mikliapko

what about dtest ? scylladb/scylla-dtest@62e922c

In addition to the version, need to update some error messages that changed in 3.5.0.
Should be ready tomorrow.

@mikliapko
Copy link
Contributor Author

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ? the code in those is a bit different then the manager tests...

and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

Triggered Manager ops specific Nemesis:
https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/longevity-100gb-manager-ops-4h/3

@mikliapko
Copy link
Contributor Author

@mikliapko
Copy link
Contributor Author

mikliapko commented Apr 4, 2025

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ? the code in those is a bit different then the manager tests...
and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

Triggered Manager ops specific Nemesis: https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/longevity-100gb-manager-ops-4h/3

disrupt_mgmt_restore fails for this particular run. The reason is described here.

SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Rerunning with simulated_racks: 0 here.

@pehala
Copy link
Contributor

pehala commented Apr 7, 2025

disrupt_mgmt_restore fails for this particular run. The reason is described here.

SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Why is this an issue? Is backup bound to a rack as well the dc?

Rerunning with simulated_racks: 0 here.

So the manager 3.5 wont work with simulated racks?

@mikliapko
Copy link
Contributor Author

disrupt_mgmt_restore fails for this particular run. The reason is described here.
SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Why is this an issue? Is backup bound to a rack as well the dc?

When Manager does schema restore, it applies the schema from a backup snapshot (schema.json file)
It has the next keyspace cql_stmt:

"cql_stmt":"CREATE KEYSPACE \"10gb_sizetiered_2024_2\" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'} AND durable_writes = true AND tablets = {'enabled': false};"

where replication is defined as {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'} (dc name is eu-west).

At the same time, the cluster under test has dc/racks configuration defined in cassandra-rackdc.properties:

#
# cassandra-rackdc.properties
# The lines may include white spaces at the beginning and the end.
# The rack and data center names may also include white spaces.
# All trailing and leading white spaces will be trimmed.
#  
# dc=my_data_center
# rack=my_rack
# prefer_local=<false | true>
# dc_suffix=<Data Center name suffix, used by EC2SnitchXXX snitches>

 
dc = eu-west-1
rack = RACK2
prefer_local = true

where dc name is eu-west-1.

As a result, we have dc names mismatch and the Manager fails to restore the schema reporting Unrecognized strategy option {eu-west} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 10gb_sizetiered_2024_2

Rerunning with simulated_racks: 0 here.

So the manager 3.5 wont work with simulated racks?

It's not related to this particular Manager 3.5 release but to the fact whether we use simulated_racks in the tests or not.
Going through SCT code and looking for the places where we do any changes into cassandra-rackdc.properties I found only one that might be applicable (see the code).
It works if the statement is true if self.test_config.MULTI_REGION or simulated_regions_num > 1 or self.params.get('simulated_racks') > 1.

To confirm it, rerunning the previous test with simulated_racks: 0 here. (The previous run 4 I was mentioning above failed because I haven't applied the change properly and the test was run with simulated_racks: 3 again).

@mikliapko
Copy link
Contributor Author

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

@pehala
Copy link
Contributor

pehala commented Apr 7, 2025

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

This is a problem regardless, we will switch to using simulated_racks as default very soon, so we need to fix this incompatiblity with a high degree of importance. Please open issue for it.

But given it is not tied to manager 3.5, I think we can continue with merging this and resolve simulated_racks issue separately

@mikliapko
Copy link
Contributor Author

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

This is a problem regardless, we will switch to using simulated_racks as default very soon, so we need to fix this incompatiblity with a high degree of importance. Please open issue for it.

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

@pehala
Copy link
Contributor

pehala commented Apr 7, 2025

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

Was aiming for this week. We can discuss how to proceed once you create the issue and we know what is actually the problem

@mikliapko
Copy link
Contributor Author

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

Was aiming for this week. We can discuss how to proceed once you create the issue and we know what is actually the problem

The issue is here (scylladb/scylla-manager#4346).
Since you need it ASAP, we can think about some workarounds (to restore schema out of Manager via direct CQL calls).
We will have a Manager team sync soon today, I'll raise this topic.

Copy link
Contributor

@vponomaryov vponomaryov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pehala
Copy link
Contributor

pehala commented Apr 7, 2025

@vponomaryov Could you please merge given it is approved and passing all the checks?

@roydahan roydahan merged commit dd22413 into scylladb:master Apr 7, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Manager 3.5.0 into SCT
7 participants