Manager 3.5.0 release updates #10586

mikliapko · 2025-04-03T11:42:34Z

Manager tests, configuration files, CI updates to introduce new Manager version 3.5.0 into SCT.

Manager 3.5.0 is a default version used in tests

As the new minor release of Manager - 3.5.0 - is here, it is set to be a default one used in tests.
Together with that, Scylla 2025.1 is added to manager_versions.yaml config and will be the default one to test with Manager 3.5.0.

Changed Manager version to upgrade from to 3.4.*

Since the Manager 3.5 is out, we need to cover an upgrade from 3.4.* versions now.
Manager versions 3.4.1 and 3.4.2 are two versions used in Production in Cloud currently.

Covered Scylla 2024.1/2024.2 in some of the Manager jobs

Scylla 2023.1 is not officially supported by the latest Manager release. Because of that, older enterprise jobs switched to run with 2024.1. All Debian jobs is set to run with 2024.2 to keep the coverage for this release after switching the majority of jobs to 2025.1.

Get rf dynamically for repair test with multiDC cluster

In previous implementation the test was trying to set rf=2 per each DC while one of DCs had only one node.
As a result, the test failed with error like:

Datacenter us-west-2scylla_node_west doesn't have enough token-owning nodes for replication_factor=2

The new approach will dynamically define the number of nodes per DC and set this value to DC's replication factor.

Testing

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Since the Manager 3.5 is out, we need to cover an upgrade from 3.4.* versions now. Manager versions 3.4.1 and 3.4.2 are two versions used in Production in Cloud currently.

Scylla 2023.1 is not officially supported by the latest Manager release. Because of that, older enterprise jobs switched to run with 2024.1. All Debian jobs is set to run with 2024.2 to keep the coverage for this release after switching the majority of jobs to 2025.1.

pehala · 2025-04-03T11:53:52Z

I would add 2025.1 backport, due to the repair issue we encountered

As the new minor release of Manager - 3.5.0 - is here, it is set to be a default one used in tests. Together with that, Scylla 2025.1 is added to manager_versions.yaml config and will be the default one to test with Manager 3.5.0. test_sdcm_mgmt_common.py tests have been updated accordingly.

In previous implementation the test was trying to set rf=2 per each DC while one of DCs had only one node. As a result, the test failed with error (1) like "Datacenter us-west-2scylla_node_west doesn't have enough token-owning nodes for replication_factor=2" The new approach will dynamically define the number of nodes per DC and set this value to DC's replication factor. refs: 1. https://jenkins.scylladb.com/job/manager-3.5/job/ubuntu22-sanity-test/1/

fruch · 2025-04-03T15:30:13Z

@mikliapko

what about dtest ?
https://github.com/scylladb/scylla-dtest/commit/62e922cb09ae425a79e6279e6055f56df18d1008

fruch · 2025-04-03T15:32:34Z

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ?
the code in those is a bit different then the manager tests...

and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

mikliapko · 2025-04-03T18:40:35Z

@mikliapko

what about dtest ? scylladb/scylla-dtest@62e922c

In addition to the version, need to update some error messages that changed in 3.5.0.
Should be ready tomorrow.

mikliapko · 2025-04-04T10:58:25Z

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ? the code in those is a bit different then the manager tests...

and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

Triggered Manager ops specific Nemesis:
https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/longevity-100gb-manager-ops-4h/3

mikliapko · 2025-04-04T10:58:38Z

@mikliapko

what about dtest ? scylladb/scylla-dtest@62e922c

https://github.com/scylladb/scylla-dtest/pull/5808

mikliapko · 2025-04-04T15:04:02Z

also can you run longevity 4h with the manager nemesis, i.e. repair backup and restore ? the code in those is a bit different then the manager tests...
and if backporting it should be tested at least in this PR, and maybe even again on the backport PR.

Triggered Manager ops specific Nemesis: https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/longevity-100gb-manager-ops-4h/3

disrupt_mgmt_restore fails for this particular run. The reason is described here.

SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Rerunning with simulated_racks: 0 here.

pehala · 2025-04-07T07:18:32Z

disrupt_mgmt_restore fails for this particular run. The reason is described here.

SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Why is this an issue? Is backup bound to a rack as well the dc?

Rerunning with simulated_racks: 0 here.

So the manager 3.5 wont work with simulated racks?

mikliapko · 2025-04-07T08:43:50Z

disrupt_mgmt_restore fails for this particular run. The reason is described here.
SCT redefines SnitchConfiguration here if simulated_racks > 1 what was the case for the run above.

Why is this an issue? Is backup bound to a rack as well the dc?

When Manager does schema restore, it applies the schema from a backup snapshot (schema.json file)
It has the next keyspace cql_stmt:

"cql_stmt":"CREATE KEYSPACE \"10gb_sizetiered_2024_2\" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'} AND durable_writes = true AND tablets = {'enabled': false};"

where replication is defined as {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'eu-west': '3'} (dc name is eu-west).

At the same time, the cluster under test has dc/racks configuration defined in cassandra-rackdc.properties:

#
# cassandra-rackdc.properties
# The lines may include white spaces at the beginning and the end.
# The rack and data center names may also include white spaces.
# All trailing and leading white spaces will be trimmed.
#  
# dc=my_data_center
# rack=my_rack
# prefer_local=<false | true>
# dc_suffix=<Data Center name suffix, used by EC2SnitchXXX snitches>

 
dc = eu-west-1
rack = RACK2
prefer_local = true

where dc name is eu-west-1.

As a result, we have dc names mismatch and the Manager fails to restore the schema reporting Unrecognized strategy option {eu-west} passed to org.apache.cassandra.locator.NetworkTopologyStrategy for keyspace 10gb_sizetiered_2024_2

Rerunning with simulated_racks: 0 here.

So the manager 3.5 wont work with simulated racks?

It's not related to this particular Manager 3.5 release but to the fact whether we use simulated_racks in the tests or not.
Going through SCT code and looking for the places where we do any changes into cassandra-rackdc.properties I found only one that might be applicable (see the code).
It works if the statement is true if self.test_config.MULTI_REGION or simulated_regions_num > 1 or self.params.get('simulated_racks') > 1.

To confirm it, rerunning the previous test with simulated_racks: 0 here. (The previous run 4 I was mentioning above failed because I haven't applied the change properly and the test was run with simulated_racks: 3 again).

mikliapko · 2025-04-07T08:51:57Z

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

pehala · 2025-04-07T09:42:48Z

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

This is a problem regardless, we will switch to using simulated_racks as default very soon, so we need to fix this incompatiblity with a high degree of importance. Please open issue for it.

But given it is not tied to manager 3.5, I think we can continue with merging this and resolve simulated_racks issue separately

mikliapko · 2025-04-07T10:17:02Z

@pehala Could you please send me some links to Nemesis jobs used in release testing where we run disrupt_mgmt_restore?
Want to check why we don't get into the same issue there. Perhaps, I've just "luckily" picked up the configuration with simulated_racks for my test run.

This is a problem regardless, we will switch to using simulated_racks as default very soon, so we need to fix this incompatiblity with a high degree of importance. Please open issue for it.

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

pehala · 2025-04-07T10:18:29Z

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

Was aiming for this week. We can discuss how to proceed once you create the issue and we know what is actually the problem

mikliapko · 2025-04-07T10:26:55Z

Just to prioritize it properly, when are you going to switch - in one week, one month, couple of months period of time?

Was aiming for this week. We can discuss how to proceed once you create the issue and we know what is actually the problem

The issue is here (scylladb/scylla-manager#4346).
Since you need it ASAP, we can think about some workarounds (to restore schema out of Manager via direct CQL calls).
We will have a Manager team sync soon today, I'll raise this topic.

vponomaryov

LGTM

pehala · 2025-04-07T12:28:16Z

@vponomaryov Could you please merge given it is approved and passing all the checks?

mikliapko self-assigned this Apr 3, 2025

mikliapko added 2 commits April 3, 2025 13:45

ci(manager): change the version to upgrade from to 3.4.*

a953b3f

Since the Manager 3.5 is out, we need to cover an upgrade from 3.4.* versions now. Manager versions 3.4.1 and 3.4.2 are two versions used in Production in Cloud currently.

mikliapko force-pushed the manager-release-3.5 branch from 5d86e88 to 183c4c3 Compare April 3, 2025 11:46

mikliapko force-pushed the manager-release-3.5 branch 2 times, most recently from 8dbc99a to 8a13330 Compare April 3, 2025 11:58

mikliapko added the backport/2025.1 label Apr 3, 2025

mikliapko added 2 commits April 3, 2025 14:08

mikliapko force-pushed the manager-release-3.5 branch from 8a13330 to 1abac5d Compare April 3, 2025 12:09

mikliapko marked this pull request as ready for review April 3, 2025 13:17

mikliapko requested a review from rayakurl as a code owner April 3, 2025 13:17

mikliapko requested review from karol-kokoszka, Michal-Leszczynski and a team and removed request for rayakurl April 3, 2025 13:17

mikliapko mentioned this pull request Apr 7, 2025

Make disrupt_mgmt_restore nemesis independent on DC configuration (if they mismatch for cluster and backup snapshot) scylladb/scylla-manager#4346

Open

mikliapko mentioned this pull request Apr 7, 2025

Can't restore schema if DC names of original and restored clusters are not the same scylladb/scylla-manager#4345

Open

pehala approved these changes Apr 7, 2025

View reviewed changes

soyacz approved these changes Apr 7, 2025

View reviewed changes

vponomaryov approved these changes Apr 7, 2025

View reviewed changes

roydahan merged commit dd22413 into scylladb:master Apr 7, 2025
7 checks passed

scylladbbot added the promoted-to-master label Apr 7, 2025

scylladbbot mentioned this pull request Apr 7, 2025

[Backport 2025.1] Manager 3.5.0 release updates #10609

Merged

5 tasks

scylladbbot added backport/2025.1-done and removed backport/2025.1 labels Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manager 3.5.0 release updates #10586

Manager 3.5.0 release updates #10586

mikliapko commented Apr 3, 2025 •

edited

Loading

pehala commented Apr 3, 2025

fruch commented Apr 3, 2025

fruch commented Apr 3, 2025

mikliapko commented Apr 3, 2025

mikliapko commented Apr 4, 2025

mikliapko commented Apr 4, 2025

mikliapko commented Apr 4, 2025 •

edited

Loading

pehala commented Apr 7, 2025 •

edited

Loading

mikliapko commented Apr 7, 2025

mikliapko commented Apr 7, 2025

pehala commented Apr 7, 2025

mikliapko commented Apr 7, 2025

pehala commented Apr 7, 2025

mikliapko commented Apr 7, 2025

vponomaryov left a comment

pehala commented Apr 7, 2025

Manager 3.5.0 release updates #10586

Manager 3.5.0 release updates #10586

Conversation

mikliapko commented Apr 3, 2025 • edited Loading

Testing

PR pre-checks (self review)

pehala commented Apr 3, 2025

fruch commented Apr 3, 2025

fruch commented Apr 3, 2025

mikliapko commented Apr 3, 2025

mikliapko commented Apr 4, 2025

mikliapko commented Apr 4, 2025

mikliapko commented Apr 4, 2025 • edited Loading

pehala commented Apr 7, 2025 • edited Loading

mikliapko commented Apr 7, 2025

mikliapko commented Apr 7, 2025

pehala commented Apr 7, 2025

mikliapko commented Apr 7, 2025

pehala commented Apr 7, 2025

mikliapko commented Apr 7, 2025

vponomaryov left a comment

Choose a reason for hiding this comment

pehala commented Apr 7, 2025

mikliapko commented Apr 3, 2025 •

edited

Loading

mikliapko commented Apr 4, 2025 •

edited

Loading

pehala commented Apr 7, 2025 •

edited

Loading