Skip to content

test/tablets: Unmark RF-changing test with xfail #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: tablets_alter_keyspace
Choose a base branch
from

Conversation

xemul
Copy link

@xemul xemul commented May 13, 2024

Now the scailing works and test must check it does

Paweł Zakrzewski and others added 16 commits May 9, 2024 13:49
This is needed, because the same name cannot be used for 2 separate
entities, because we're getting double-metrics-registration error, thus
the names have to be configurable, not hardcoded.
Note we're suppressing a UBSanitizer overflow error in UTs. That's
because our linter complains about a possible overflow, which never
happens, but tests are still failing because of it.
Query processor needs to access storage service to check if global
topology request is still ongoing and to be able to wait until it
completes.
Allows executing combined topology & schema mutations under a single RAFT command
…blets transition kind

These will be used when processing ALTER KS statement, but also to
create a separate processing path for a KS with tablets (as opposed to
a vnode KS).
With current implementation only 1 global topo req can be executed at a
time, so when ALTER KS is executed, we'll have to check if any other
global topo req is ongoing and fail the req if that's the case.
…rocess alter ks global topo req

Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
…ifferent raft commands

Since ALTER KS requires creating topology_change raft command, some
functions need to be extended to handle it. RAFT commands are recognized
by types, so some functions are just going to be parameterized by type,
i.e. made into templates.
These templates are instantiated already, so that only 1 instances of
each template exists across the whole code base, to avoid compiling it
in each translation unit.
This commit adds support for executing ALTER KS for keyspaces with
tablets and utilizes all the previous commits.
The ALTER KS is handled in alter_keyspace_statement, where a global
topology request in generated with data attached to system.topology
table. Then, once topology state machine is ready, it starts to handle
this global topology event, which results in producing mutations
required to change the schema of the keyspace, delete the
system.topology's global req, produce tablets mutations and additional
mutations for a table tracking the lifetime of the whole req. Tracking
the lifetime is necessary to not return the control to the user too
early, so the query processor only returns the response while the
mutations are sent.
This patch removes the support for the "wildcard" replication_factor
option for ALTER KEYSPACE when the keyspace supports tablets.
It will still be supported for CREATE KEYSPACE so that a user doesn't
have to know all datacenter names when creating the keyspace,
but ALTER KEYSPACE will require that and the user will have to
specify the exact change in replication factors they wish to make by
explicitly specifying the datacenter names.
Expanding the replication_factor option in the ALTER case is
unintuitive and it's a trap many users fell into.

See scylladb#8881, scylladb#15391, scylladb#16115
…than 1

We want to ensure that when the replication factor
of a keyspace changes, it changes by at most 1 per DC
if it uses tablets. The rationale for that is to make
sure that the old and new quorums overlap by at least
one node.

After these changes, attempts to change the RF of
a keyspace in any DC by more than 1 will fail.
This commit adds a test verifying that we can only
change the RF of a keyspace for any DC by at most 1
when using tablets.

Fixes scylladb#18029
Up until now we waited until mutations are in place and then returned
directly to the caller of the ALTER statement, but that doesn't imply
that tablets were deleted/created, so we must wait until the whole
processing is done and return only then.
Now the scailing works and test must check it does

Signed-off-by: Pavel Emelyanov <[email protected]>
@xemul
Copy link
Author

xemul commented May 13, 2024

@ptrsmrn , please, consider merging it into your next scylladb#16723 update

@ptrsmrn ptrsmrn force-pushed the tablets_alter_keyspace branch 5 times, most recently from 9a6b05c to d25d5df Compare May 14, 2024 21:47
@ptrsmrn ptrsmrn force-pushed the tablets_alter_keyspace branch 7 times, most recently from 1aa9941 to 128c122 Compare May 17, 2024 18:33
@ptrsmrn ptrsmrn force-pushed the tablets_alter_keyspace branch 10 times, most recently from ec0b88a to 6fabacf Compare May 23, 2024 12:22
@ptrsmrn ptrsmrn force-pushed the tablets_alter_keyspace branch 2 times, most recently from 5bb4ffe to 66f6001 Compare May 28, 2024 12:08
ptrsmrn pushed a commit that referenced this pull request Jun 18, 2024
Enriches the output of `scylla fiber` with resolved names of coroutine resume functions.

Before:

```
[shard  2] #0  (task*) 0x0000602004c9fbf0 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
[shard  2] #1  (task*) 0x0000602000344c90 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
[shard  2] #2  (task*) 0x0000602004b30c50 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16
```

After:

```
[shard  2] #0  (task*) 0x0000602004c9fbf0 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is seastar::future<void> sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) [clone .resume] )
[shard  2] #1  (task*) 0x0000602000344c90 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) [clone .resume] )
[shard  2] #2  (task*) 0x0000602004b30c50 0x0000000000642880 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const [clone .resume] )
```

Closes scylladb#19091
ptrsmrn pushed a commit that referenced this pull request Jun 28, 2024
For convenience.

Note that this line info only points to the function as a whole, not to the
current suspend point.
I think there's no facility for converting the `__coro_index` to the current suspend point automatically.

Before:

```
(gdb) scylla fiber seastar::local_engine->_current_task
[shard  1] #0  (task*) 0x0000601008e8e970 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is seastar::future<void> sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) [clone .resume] )
[shard  1] #1  (task*) 0x00006010092acf10 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) [clone .resume] )
[shard  1] #2  (task*) 0x0000601008e648d0 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (.resume is sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const [clone .resume] )
```

After:

```
(gdb) scylla fiber seastar::local_engine->_current_task
[shard  1] #0  (task*) 0x0000601008e8e970 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::parse<unsigned int, std::pair<sstables::metadata_type, unsigned int> >(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::disk_array<unsigned int, std::pair<sstables::metadata_type, unsigned int> >&) at sstables/sstables.cc:352)
[shard  1] #1  (task*) 0x00006010092acf10 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::parse(schema const&, sstables::sstable_version_types, sstables::random_access_reader&, sstables::statistics&) at sstables/sstables.cc:570)
[shard  1] #2  (task*) 0x0000601008e648d0 0x000000000047aae0 vtable for seastar::internal::coroutine_traits_base<void>::promise_type + 16  (sstables::sstable::read_simple<(sstables::component_type)8, sstables::statistics>(sstables::statistics&)::{lambda(sstables::sstable_version_types, seastar::file&&, unsigned long)#1}::operator()(sstables::sstable_version_types, seastar::file&&, unsigned long) const at sstables/sstables.cc:992)

```

Closes scylladb#19478
ptrsmrn pushed a commit that referenced this pull request Sep 24, 2024
By turning server::shutdown() into a coroutine, we need not dynamically
allocate "nr_conn".

Verified as follows:

(1) In terminal #1:

    build/Dev/scylla --overprovisioned --developer-mode=yes \
        --memory=2G --smp=1 --default-log-level error \
        --logger-log-level cql_server=debug:cql_server_controller=debug

> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:9042 (unencrypted,
>                                     non-shard-aware)
> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:19042 (unencrypted,
>                                     shard-aware)

(2) In terminals #2 and #3:

    tools/cqlsh/bin/cqlsh.py

(3) Press ^C in terminal #1:

> DEBUG [...] cql_server - abort accept nr_total=2
> DEBUG [...] cql_server - abort accept 1 out of 2 done
> DEBUG [...] cql_server - abort accept 2 out of 2 done
> DEBUG [...] cql_server - shutdown connection nr_total=4
> DEBUG [...] cql_server - shutdown connection 1 out of 4 done
> DEBUG [...] cql_server - shutdown connection 2 out of 4 done
> DEBUG [...] cql_server - shutdown connection 3 out of 4 done
> DEBUG [...] cql_server - shutdown connection 4 out of 4 done
> INFO  [...] cql_server_controller - CQL server stopped

This patch is best viewed with "git show --word-diff=color".

Suggested-by: Benny Halevy <[email protected]>
Signed-off-by: Laszlo Ersek <[email protected]>
ptrsmrn pushed a commit that referenced this pull request Oct 10, 2024
By turning server::shutdown() into a coroutine, we need not dynamically
allocate "nr_conn".

Verified as follows:

(1) In terminal #1:

    build/Dev/scylla --overprovisioned --developer-mode=yes \
        --memory=2G --smp=1 --default-log-level error \
        --logger-log-level cql_server=debug:cql_server_controller=debug

> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:9042 (unencrypted,
>                                     non-shard-aware)
> INFO  [...] cql_server_controller - Starting listening for CQL clients
>                                     on 127.0.0.1:19042 (unencrypted,
>                                     shard-aware)

(2) In terminals #2 and #3:

    tools/cqlsh/bin/cqlsh.py

(3) Press ^C in terminal #1:

> DEBUG [...] cql_server - abort accept nr_total=2
> DEBUG [...] cql_server - abort accept 1 out of 2 done
> DEBUG [...] cql_server - abort accept 2 out of 2 done
> DEBUG [...] cql_server - shutdown connection nr_total=4
> DEBUG [...] cql_server - shutdown connection 1 out of 4 done
> DEBUG [...] cql_server - shutdown connection 2 out of 4 done
> DEBUG [...] cql_server - shutdown connection 3 out of 4 done
> DEBUG [...] cql_server - shutdown connection 4 out of 4 done
> INFO  [...] cql_server_controller - CQL server stopped

This patch is best viewed with "git show --word-diff=color".

Suggested-by: Benny Halevy <[email protected]>
Signed-off-by: Laszlo Ersek <[email protected]>
(cherry picked from commit 1138347)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants