Skip to content

Conversation

@Deexie
Copy link

@Deexie Deexie commented Aug 26, 2024

Reproducer for mixed shard repair to choose the best solution for scylladb/scylladb#18269.

Sets up a 3-node cluster on AWS with 1TB of data and runs repair.

It will be run with jenkins with the following configurations:

@Deexie Deexie requested review from asias and denesb August 26, 2024 14:54
@Deexie Deexie force-pushed the mixed-shard-repair branch 6 times, most recently from 05d9461 to 3060c15 Compare August 27, 2024 14:52
@denesb
Copy link

denesb commented Aug 29, 2024

I am not familiar with the SCT code, but the description looks good to me.
Did you get a chance to run the test? How do the numbers look?

@Deexie Deexie force-pushed the mixed-shard-repair branch from 3060c15 to 7f38a55 Compare September 2, 2024 15:42
@Deexie
Copy link
Author

Deexie commented Sep 2, 2024

  • change instance type to i3.16xlarge

@Deexie Deexie force-pushed the mixed-shard-repair branch from 7f38a55 to 9502ed2 Compare September 3, 2024 12:50
@Deexie
Copy link
Author

Deexie commented Sep 3, 2024

  • change shards count

@Deexie Deexie force-pushed the mixed-shard-repair branch from 9502ed2 to d09d884 Compare September 4, 2024 07:21
@Deexie
Copy link
Author

Deexie commented Sep 4, 2024

  • change loaders instance
  • split data population

@Deexie Deexie force-pushed the mixed-shard-repair branch 4 times, most recently from 4b62220 to 13c631d Compare September 5, 2024 10:51
@Deexie
Copy link
Author

Deexie commented Sep 5, 2024

master-60-59-58
test duration: 1h51m
repair time: 936.2321102619171s (15min)
argus: https://argus.scylladb.com/test_runs?state=WyI4YTI3MGEyYS05OWE2LTQwYzYtODkxNS1lMzhiOGFiOGQ2OGMiXQ
non-LSA memory: image

@Deexie
Copy link
Author

Deexie commented Sep 6, 2024

master 60-60-60
test duration: 1h 24min
repair time: 553.5129013061523 (9 min)
image

@Deexie
Copy link
Author

Deexie commented Sep 6, 2024

poc1-60-59-58
test duration: 3h 2min
repair time: 5473.398208618164 (1.5h)
image

@Deexie
Copy link
Author

Deexie commented Sep 6, 2024

poc2-60-59-58
failed after: 7h 45min

02:09:38  error running operation: std::system_error (error system:104, recv: Connection reset by peer)
02:09:38  ----- LAST WARNING EVENT -----------------------------------------------------
02:09:38  2024-09-05 19:38:43.928 <2024-09-05 19:38:43.699>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=c571cbf1-5131-4da8-8563-3d5ed86ec7bb: type=WARNING regex=(^WARNING|!\s*?WARNING).*\[shard.*\] line_number=109442 node=ubuntu-mixed-sh-db-node-a7786369-2
02:09:38  2024-09-05T19:38:43.699+00:00 ubuntu-mixed-sh-db-node-a7786369-2  !WARNING | scylla[15842]:  [shard  0: gms] seastar_memory - oversized allocation: 1069056 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x6102d3e 0x6103350 0x6103658 0x5bafde2 0x5bb2585 0x45e9158 0x45e255a 0x5c0591f 0x5c06e9a 0x5c08077 0x5c07428 0x5b97593 0x5b968f3 0x13cf2f5 0x13d0cb0 0x13cd713 /opt/scylladb/libreloc/libc.so.6+0x2a087 /opt/scylladb/libreloc/libc.so.6+0x2a14a 0x13cad94
02:09:38  ----- LAST NORMAL EVENT ------------------------------------------------------
02:09:38  2024-09-05 19:38:10.473: (PrometheusAlertManagerEvent Severity.NORMAL) period_type=end event_id=d69143fc-2503-4eb1-b248-61a7a9171077 duration=1h35m59s: alert_name=InstanceDown node=10.4.1.235 start=2024-09-05T18:02:07.408Z end=2024-09-05T18:06:07.408Z description=10.4.1.235 has been down for more than 30 seconds. updated=2024-09-05T18:02:07.412Z state=active fingerprint=45469a7e312b47e8 labels={'alertname': 'InstanceDown', 'cluster': 'my-cluster', 'dc': 'eu-west-1', 'instance': '10.4.1.235', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '3'}
02:09:38  ================================================================================

decoded:

[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:68
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:97
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:148
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:181
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:849
 (inlined by) seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:912
 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:919
 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1542
 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1688
malloc at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1707
service::raft_sys_table_storage::load_log() at ././seastar/include/seastar/core/sstring.hh:167
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<boost::container::deque<seastar::lw_shared_ptr<raft::log_entry const>, void, void> >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<boost::container::deque<seastar::lw_shared_ptr<raft::log_entry const>, void, void> >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2577
seastar::reactor::run_some_tasks() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3043
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3211
seastar::reactor::run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:3101
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ././main.cc:700
std::function<int (int, char**)>::operator()(int, char**) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
main at ././main.cc:2246
/data/scylla-s3-reloc.cache/by-build-id/f8ada775ee7b1210127d4237f218442ce59c3ae3/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8f53abaad945a669f2bdcd25f471d80e077568ef, for GNU/Linux 3.2.0, not stripped

__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?

@Deexie Deexie force-pushed the mixed-shard-repair branch 2 times, most recently from 3f3986d to b3929c0 Compare September 13, 2024 16:28
@Deexie Deexie force-pushed the mixed-shard-repair branch 2 times, most recently from 7fca2ff to a948bb9 Compare September 19, 2024 13:05
@asias
Copy link
Contributor

asias commented Sep 20, 2024

@Deexie How did you execute the new sct test introduced in this PR? Do you run through Jenkins? Could you share the details?

"""),

dict(name="nodes_smp", env="SCT_NODES_SMP", type=list,
help="List of shard numbers of nodes in Scylla cluster; list of int, like [4, 5, 3]"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="List of shard numbers of nodes in Scylla cluster; list of int, like [4, 5, 3]"),
help="List of shard number to set per node in Scylla cluster; list of int, like [4, 5, 3]"),

I wonder how it would work with multi-dc cases:

region_name: 'eu-west-1 us-east-1'
n_db_nodes: '2 1'
nodes_smp: [12, 12, 15]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number is based on node_index and I think it does not depend on dc

@fruch fruch added backport/none Backport is not required test-provision-aws Run provision test on AWS test-provision-gce Run provision test on GCE test-provision-docker labels Jan 20, 2025
fruch
fruch previously approved these changes Jan 20, 2025
Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

  • we might be able to name a bit better the configuration option
  • arguments shouldn't be mutable

@Deexie
Copy link
Author

Deexie commented Jan 24, 2025

  • use None as a default param value
  • rename nodes_smp to smp_per_db_node_mapping
  • use str_or_list_or_eval type for smp_per_db_node_mapping
  • add pipelines with custom shard number for some tests that run with random shard num

@Deexie Deexie force-pushed the mixed-shard-repair branch from 3ba3fec to 3d55f23 Compare January 24, 2025 13:33
@Deexie
Copy link
Author

Deexie commented Jan 24, 2025

  • modify smp_per_db_node_mapping description

@scylladbbot
Copy link

@Deexie new branch branch-2025.1 was added, please add backport label if needed

@Deexie Deexie force-pushed the mixed-shard-repair branch from 3d55f23 to 8482007 Compare January 28, 2025 15:17
@Deexie
Copy link
Author

Deexie commented Jan 28, 2025

  • drop excessive self arg

@Deexie Deexie force-pushed the mixed-shard-repair branch from 8482007 to 3a1912d Compare March 4, 2025 15:56
@denesb
Copy link

denesb commented Apr 14, 2025

@Deexie what is the status of this?

@mykaul
Copy link
Contributor

mykaul commented May 25, 2025

@Deexie , @denesb - this was forgotten, but we still need it. Let's see if we can revive it.

@Deexie Deexie force-pushed the mixed-shard-repair branch from 3a1912d to 4eeff18 Compare May 26, 2025 09:24
@github-actions
Copy link

identified changes in generated code

diff found by running:
bash ./docker/env/hydra.sh update-conf-docs:

 docs/configuration_options.md | 9 +++++++++
 1 file changed, 9 insertions(+)

@Deexie
Copy link
Author

Deexie commented May 26, 2025

  • rebase

@Deexie Deexie force-pushed the mixed-shard-repair branch from 4eeff18 to 3e7055e Compare May 26, 2025 10:23
@github-actions
Copy link

identified changes in generated code

diff found by running:
bash ./docker/env/hydra.sh update-conf-docs:

 docs/configuration_options.md | 9 +++++++++
 1 file changed, 9 insertions(+)

@denesb
Copy link

denesb commented May 26, 2025

@Deexie should this PR be un-marked as Draft?

@Deexie Deexie marked this pull request as ready for review May 27, 2025 07:24
Deexie added 3 commits June 17, 2025 17:01
Add custom shard number config for Scylla clusters.
…es with custom shard number

Copy asimetric jenkins longevity pipelines and set custom shard
number for them.
@Deexie Deexie force-pushed the mixed-shard-repair branch from 3e7055e to 66a8f7f Compare June 17, 2025 15:02
@Deexie
Copy link
Author

Deexie commented Jun 17, 2025

  • rebase

@github-actions
Copy link

identified changes in generated code

diff found by running:
bash ./docker/env/hydra.sh update-conf-docs:

 docs/configuration_options.md | 9 +++++++++
 1 file changed, 9 insertions(+)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/none Backport is not required test-provision-aws Run provision test on AWS test-provision-docker test-provision-gce Run provision test on GCE

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants