Open
Description
scylla-bench: scylladb/hydra-loaders:scylla-bench-v0.1.16
(docker)
scylla: 2023.1.0~rc4-0.20230402.f90af81a13cc with build-id 4db7799d283c7ca63c7639b04db34c301dbcfb66
(non-docker)
Platform: AWS
/ eu-west-1
DB nodes num: 4
DB node instance type: i3en.3xlarge
Loaders num: 3
Loader node instance type: c5n.2xlarge
Loader commands:
- scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60
-clustering-row-count=10000000 -retry-number=0 -clustering-row-size=2048 -rows-per-request=2000
-timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count
500
- scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60
-clustering-row-count=10000000 -retry-number=0 -clustering-row-size=2048 -rows-per-request=2000
-timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count
500
- scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=60
-clustering-row-count=10000000 -retry-number=0 -clustering-row-size=2048 -rows-per-request=2000
-timeout=180s -concurrency=700 -max-rate=64000 -duration=5760m -connection-count
500
So, running above commands one on a loader we got one of them (loader-3
) OOMkilled:
Apr 19 18:23:31 kernel: scylla-bench invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Apr 19 18:23:31 kernel: scylla-bench cpuset=8ea88c75ae805f18fb862b1ec82e301af32c4e910b0eb2b87364ad7e62d831d5 mems_allowed=0
Apr 19 18:23:31 kernel: CPU: 5 PID: 21662 Comm: scylla-bench Not tainted 4.20.4-1.el7.elrepo.x86_64 #1
Apr 19 18:23:31 kernel: Hardware name: Amazon EC2 c5n.2xlarge/, BIOS 1.0 10/16/2017
Apr 19 18:23:31 kernel: Call Trace:
Apr 19 18:23:31 kernel: dump_stack+0x63/0x88
Apr 19 18:23:31 kernel: dump_header+0x78/0x2a4
Apr 19 18:23:31 kernel: ? cpuset_mems_allowed_intersects+0x21/0x30
Apr 19 18:23:31 kernel: oom_kill_process+0x262/0x290
Apr 19 18:23:31 kernel: out_of_memory+0x140/0x4b0
Apr 19 18:23:31 kernel: __alloc_pages_slowpath+0x863/0xaf9
Apr 19 18:23:31 kernel: __alloc_pages_nodemask+0x266/0x2a0
Apr 19 18:23:31 kernel: alloc_pages_current+0x6a/0xb0
Apr 19 18:23:31 kernel: __page_cache_alloc+0x73/0x90
Apr 19 18:23:31 kernel: filemap_fault+0x40c/0x7c0
Apr 19 18:23:31 kernel: ? unlock_page_memcg+0x12/0x20
Apr 19 18:23:31 kernel: ? page_add_file_rmap+0x187/0x220
Apr 19 18:23:31 kernel: ? _cond_resched+0x19/0x30
Apr 19 18:23:31 kernel: ? down_read+0x12/0x40
Apr 19 18:23:31 kernel: __xfs_filemap_fault+0x72/0x1c0 [xfs]
Apr 19 18:23:31 kernel: xfs_filemap_fault+0x2c/0x30 [xfs]
Apr 19 18:23:31 kernel: __do_fault+0x3e/0xc0
Apr 19 18:23:31 kernel: __handle_mm_fault+0xbd6/0xe80
Apr 19 18:23:31 kernel: handle_mm_fault+0x102/0x220
Apr 19 18:23:31 kernel: __do_page_fault+0x21c/0x4c0
Apr 19 18:23:31 kernel: do_page_fault+0x37/0x140
Apr 19 18:23:31 kernel: ? async_page_fault+0x8/0x30
Apr 19 18:23:31 kernel: do_async_page_fault+0x22/0xd0
Apr 19 18:23:31 kernel: ? async_page_fault+0x8/0x30
Apr 19 18:23:31 kernel: async_page_fault+0x1e/0x30
Apr 19 18:23:31 kernel: RIP: 0033:0x464200
Apr 19 18:23:31 kernel: Code: Bad RIP value.
Apr 19 18:23:31 kernel: RSP: 002b:000000c000653288 EFLAGS: 00010202
Apr 19 18:23:31 kernel: RAX: 000000c0004991e0 RBX: 000000c000499380 RCX: 000000c000499380
Apr 19 18:23:31 kernel: RDX: 000000c000653380 RSI: 0000000000000000 RDI: 0000000000000101
Apr 19 18:23:31 kernel: RBP: 000000c0006532f8 R08: 00000000009bef80 R09: 0000024360fc5245
Apr 19 18:23:31 kernel: R10: 000000000098e160 R11: 000000000000c744 R12: 000000000043e1a0
Apr 19 18:23:31 kernel: R13: 0000000000000004 R14: 000000c000499380 R15: 000141ff4f158308
Apr 19 18:23:31 kernel: Mem-Info:
Apr 19 18:23:31 kernel: active_anon:4748116 inactive_anon:371183 isolated_anon:0
active_file:144 inactive_file:77 isolated_file:0
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:11569 slab_unreclaimable:23773
mapped:341 shmem:209 pagetables:17282 bounce:0
free:38281 free_pcp:2151 free_cma:0
Apr 19 18:23:31 kernel: Node 0 active_anon:18992464kB inactive_anon:1484732kB active_file:576kB inactive_file:308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1364kB dirty:0kB writeback:0kB shmem:836kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 5292032kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Apr 19 18:23:31 kernel: Node 0 DMA free:15908kB min:48kB low:60kB high:72kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr 19 18:23:31 kernel: lowmem_reserve[]: 0 2915 20441 20441 20441
Apr 19 18:23:31 kernel: Node 0 DMA32 free:79344kB min:9628kB low:12612kB high:15596kB active_anon:2756648kB inactive_anon:187876kB active_file:556kB inactive_file:224kB unevictable:0kB writepending:0kB present:3129252kB managed:3041788kB mlocked:0kB kernel_stack:80kB pagetables:6544kB bounce:0kB free_pcp:3008kB local_pcp:156kB free_cma:0kB
Apr 19 18:23:31 kernel: lowmem_reserve[]: 0 0 17525 17525 17525
Apr 19 18:23:31 kernel: Node 0 Normal free:57872kB min:57900kB low:75844kB high:93788kB active_anon:16235816kB inactive_anon:1296856kB active_file:1072kB inactive_file:0kB unevictable:0kB writepending:0kB present:18284544kB managed:17946372kB mlocked:0kB kernel_stack:4512kB pagetables:62584kB bounce:0kB free_pcp:5696kB local_pcp:168kB free_cma:0kB
Apr 19 18:23:31 kernel: lowmem_reserve[]: 0 0 0 0 0
Apr 19 18:23:31 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
Apr 19 18:23:31 kernel: Node 0 DMA32: 330*4kB (UE) 635*8kB (UME) 496*16kB (UME) 278*32kB (UME) 79*64kB (UME) 18*128kB (UM) 19*256kB (UE) 51*512kB (UM) 17*1024kB (UME) 0*2048kB 0*4096kB = 78976kB
Apr 19 18:23:31 kernel: Node 0 Normal: 272*4kB (UME) 249*8kB (ME) 259*16kB (UME) 331*32kB (UME) 98*64kB (UME) 39*128kB (UME) 15*256kB (UME) 4*512kB (UM) 22*1024kB (ME) 0*2048kB 0*4096kB = 57496kB
Apr 19 18:23:31 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Apr 19 18:23:31 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 19 18:23:31 kernel: 96385 total pagecache pages
Apr 19 18:23:31 kernel: 96037 pages in swap cache
Apr 19 18:23:31 kernel: Swap cache stats: add 8936640, delete 8844227, find 4194314/4735878
Apr 19 18:23:31 kernel: Free swap = 0kB
Apr 19 18:23:31 kernel: Total swap = 10485756kB
Apr 19 18:23:31 kernel: 5357447 pages RAM
Apr 19 18:23:31 kernel: 0 pages HighMem/MovableOnly
Apr 19 18:23:31 kernel: 106430 pages reserved
Apr 19 18:23:31 kernel: 0 pages hwpoisoned
Apr 19 18:23:31 kernel: Tasks state (memory values in pages):
Apr 19 18:23:31 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Apr 19 18:23:31 kernel: [ 875] 0 875 10059 222 118784 85 0 systemd-journal
Apr 19 18:23:31 kernel: [ 903] 0 903 12055 2 122880 459 -1000 systemd-udevd
Apr 19 18:23:31 kernel: [ 949] 0 949 13886 10 131072 101 -1000 auditd
Apr 19 18:23:31 kernel: [ 1052] 81 1052 16580 5 155648 143 -900 dbus-daemon
Apr 19 18:23:31 kernel: [ 1056] 0 1056 4162 666 86016 189 0 collectd_export
Apr 19 18:23:31 kernel: [ 1058] 0 1058 5389 24 86016 36 0 irqbalance
Apr 19 18:23:31 kernel: [ 1060] 0 1060 6599 14 94208 61 0 systemd-logind
Apr 19 18:23:31 kernel: [ 1061] 998 1061 153065 0 274432 1895 0 polkitd
Apr 19 18:23:31 kernel: [ 1078] 997 1078 29455 2 143360 112 0 chronyd
Apr 19 18:23:31 kernel: [ 1133] 0 1133 50361 10 184320 125 0 gssproxy
Apr 19 18:23:31 kernel: [ 1341] 0 1341 25729 0 233472 514 0 dhclient
Apr 19 18:23:31 kernel: [ 1437] 0 1437 25729 17 225280 506 0 dhclient
Apr 19 18:23:31 kernel: [ 1479] 0 1479 143579 102 434176 3216 0 tuned
Apr 19 18:23:31 kernel: [ 1484] 0 1484 301877 1671 323584 4005 -999 containerd
Apr 19 18:23:31 kernel: [ 1601] 0 1601 22429 16 204800 242 0 master
Apr 19 18:23:31 kernel: [ 1615] 89 1615 22472 1 217088 252 0 qmgr
Apr 19 18:23:31 kernel: [ 1642] 0 1642 318682 3177 483328 5117 -500 dockerd
Apr 19 18:23:31 kernel: [ 1651] 0 1651 254861 223 274432 603 0 collectd
Apr 19 18:23:31 kernel: [ 1661] 0 1661 31601 17 94208 141 0 crond
Apr 19 18:23:31 kernel: [ 1662] 0 1662 27555 2 65536 31 0 agetty
Apr 19 18:23:31 kernel: [ 1665] 0 1665 27555 3 61440 31 0 agetty
Apr 19 18:23:31 kernel: [ 2090] 0 2090 38685 2 344064 336 0 sshd
Apr 19 18:23:31 kernel: [ 2093] 1000 2093 38685 1 331776 339 0 sshd
Apr 19 18:23:31 kernel: [ 2180] 0 2180 27056 123 253952 373 0 syslog-ng
Apr 19 18:23:31 kernel: [ 2191] 0 2191 28234 25 253952 232 -1000 sshd
Apr 19 18:23:31 kernel: [ 4766] 995 4766 181775 1361 143360 1521 0 node_exporter
Apr 19 18:23:31 kernel: [ 16560] 89 16560 22455 13 208896 238 0 pickup
Apr 19 18:23:31 kernel: [ 21525] 0 21525 38685 2 339968 330 0 sshd
Apr 19 18:23:31 kernel: [ 21528] 1000 21528 38685 0 335872 357 0 sshd
Apr 19 18:23:31 kernel: [ 21580] 0 21580 178111 501 102400 1488 -998 containerd-shim
Apr 19 18:23:31 kernel: [ 21600] 0 21600 606 8 45056 12 0 tail
Apr 19 18:23:31 kernel: [ 21613] 0 21613 59821 2 311296 289 0 sudo
Apr 19 18:23:31 kernel: [ 21625] 0 21625 310307 1177 294912 3412 0 docker
Apr 19 18:23:31 kernel: [ 21644] 0 21644 620 0 40960 20 0 sh
Apr 19 18:23:31 kernel: [ 21650] 0 21650 8193649 5011664 62324736 2590712 0 scylla-bench
Apr 19 18:23:31 kernel: [ 24370] 0 24370 28781 118 262144 176 0 sshd
Apr 19 18:23:31 kernel: [ 24989] 0 24989 2991 105 69632 7 0 dhclient-script
Apr 19 18:23:31 kernel: [ 25009] 0 25009 27674 122 237568 0 0 sshd
Apr 19 18:23:31 kernel: [ 25012] 0 25012 27672 120 241664 1 0 sshd
Apr 19 18:23:31 kernel: [ 25019] 0 25019 27672 119 237568 1 0 sshd
Apr 19 18:23:31 kernel: [ 25025] 0 25025 27672 119 241664 1 0 sshd
Apr 19 18:23:31 kernel: [ 25031] 0 25031 27672 120 241664 1 0 sshd
Apr 19 18:23:31 kernel: [ 25038] 0 25038 27143 112 229376 3 0 sshd
Apr 19 18:23:31 kernel: [ 25043] 0 25043 26090 114 221184 0 0 sshd
Apr 19 18:23:31 kernel: [ 25048] 0 25048 25045 107 217088 0 0 sshd
Apr 19 18:23:31 kernel: [ 25051] 0 25051 19766 87 184320 0 0 sshd
Apr 19 18:23:31 kernel: [ 25057] 0 25057 15314 67 147456 0 0 sshd
Apr 19 18:23:31 kernel: [ 25063] 0 25063 6295 33 81920 0 0 sshd
Apr 19 18:23:31 kernel: Out of memory: Kill process 21650 (scylla-bench) score 967 or sacrifice child
Apr 19 18:23:31 kernel: Killed process 21650 (scylla-bench) total-vm:32774596kB, anon-rss:20046656kB, file-rss:0kB, shmem-rss:0kB
And here is how loader-3
logs looked like:
Configuration
Mode: read
Workload: uniform
Timeout: 3m0s
Max error number at row: 1000
Max error number: unlimited
Retries:
number: 0
min interval: 80ms
max interval: 1s
handler: sb
Consistency level: quorum
Partition count: 60
Clustering rows: 10000000
Clustering row size: Fixed(2048)
Rows per request: 2000
Provide upper bound: false
IN queries: false
Order by: none
No lower bound: false
Page size: 1000
Concurrency: 700
Connections: 500
Maximum rate: 64000 op/s
Client compression: true
Hdr memory consumption: 344254400 bytes
2023/04/19 17:44:56 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[43 3491014] consistency=QUORUM] || ERROR: 0 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
2023/04/19 17:44:56 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[39 3977063] consistency=QUORUM] || ERROR: 0 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
time ops/s rows/s errors max 99.9th 99th 95th 90th median mean
2023/04/19 17:44:56 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[15 7113516] consistency=QUORUM] || ERROR: 0 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
2023/04/19 17:44:56 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[55 3968802] consistency=QUORUM] || ERROR: 0 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
...
<lots of similar timeout messages>
6s 1264 2528000 328 4.8s 4.7s 4.7s 4.4s 4.2s 370ms 1.1s
...
<lots of similar timeout messages>
...
2m0.1s 565 1131503 138 9.2s 8.7s 8.5s 8.3s 6s 18ms 1.3s
2023/04/19 17:44:57 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[58 5460126] consistency=QUORUM] || ERROR: 0 attempts applied: Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
...
<lots of similar timeout messages>
...
2023/04/19 17:52:43 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[35 5246275] consistency=QUORUM] || ERROR: 0 attempts applied: EOF
2023/04/19 17:52:44 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[35 1570023] consistency=QUORUM] || ERROR: 0 attempts applied: EOF
2023/04/19 17:52:44 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[40 1738314] consistency=QUORUM] || ERROR: 0 attempts applied: Operation failed for scylla_bench.test - received 0 responses and 1 failures from 2 CL=QUORUM.
2023/04/19 17:52:44 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[54 4939889] consistency=QUORUM] || ERROR: 0 attempts applied: gocql: connection closed waiting for response
2023/04/19 17:52:44 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[9 440632] consistency=QUORUM] || ERROR: 0 attempts applied: gocql: connection closed waiting for response
...
2023/04/19 17:54:44 [query statement="SELECT pk, ck, v FROM scylla_bench.test WHERE pk = ? AND ck >= ? LIMIT 2000 " values=[19 4463825] consistency=QUORUM] || ERROR: 0 attempts applied: Operation failed for scylla_bench.test - received 0 responses and 1 failures from 2 CL=QUORUM.
2023/04/19 17:56:26 error: failed to connect to "[HostInfo hostname=\"10.4.2.156\" connectAddress=\"10.4.2.156\" peer=\"10.4.2.156\" rpc_address=\"10.4.2.156\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.4.2.156\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"eu-west\" rack=\"1a\" host_id=\"71ffdcba-9a8e-4cfd-8d68-ec36e4250afa\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout
2023/04/19 17:56:59 error: failed to connect to "[HostInfo hostname=\"10.4.2.156\" connectAddress=\"10.4.2.156\" peer=\"10.4.2.156\" rpc_address=\"10.4.2.156\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.4.2.156\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"eu-west\" rack=\"1a\" host_id=\"71ffdcba-9a8e-4cfd-8d68-ec36e4250afa\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response to connection startup within timeout
Killed
Note that the lost connection to the 10.4.2.156
node is expected, because it is result of the soft_reboot_node
nemesis:
2361909 < t:2023-04-19 17:45:48,389 f:nemesis.py l:1742 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method soft_reboot_node
2361910 < t:2023-04-19 17:45:48,390 f:nemesis.py l:371 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-large-partitions-4d-vp-br-db-node-4743601d-4 [3.252.162.48 | 10.4.2.156] (seed: False) with running nemesis: None
Nemesis list:
Nemesis Name | Target node | Status | Start | End |
---|---|---|---|---|
disrupt_remove_node_then_add_node | longevity-large-partitions-4d-vp-br-db-node-4743601d-2 | Succeeded | 2023-04-19 19:00:19 | 2023-04-19 22:05:08 |
disrupt_resetlocalschema | longevity-large-partitions-4d-vp-br-db-node-4743601d-4 | Succeeded | 2023-04-19 18:28:37 | 2023-04-19 18:29:40 |
disrupt_nodetool_seed_decommission | longevity-large-partitions-4d-vp-br-db-node-4743601d-2 | Skipped | 2023-04-19 18:28:18 | 2023-04-19 18:28:18 |
disrupt_network_random_interruptions | longevity-large-partitions-4d-vp-br-db-node-4743601d-4 | Skipped | 2023-04-19 18:27:57 | 2023-04-19 18:27:58 |
disrupt_soft_reboot_node | longevity-large-partitions-4d-vp-br-db-node-4743601d-4 | Succeeded | 2023-04-19 17:52:39 | 2023-04-19 17:57:16 |
So, probably we have some kind of memory leak here.
Logs:
- CI Job: https://jenkins.scylladb.com/job/enterprise-2023.1/job/Reproducers/job/longevity-large-partition-4days-test/3
- db-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/4743601d-4f90-4c6a-bcf3-0ec5e81b9ed7/20230419_233242/db-cluster-4743601d.tar.gz
- sct-runner-events: https://cloudius-jenkins-test.s3.amazonaws.com/4743601d-4f90-4c6a-bcf3-0ec5e81b9ed7/20230419_233242/sct-runner-events-4743601d.tar.gz
- sct-runner: https://cloudius-jenkins-test.s3.amazonaws.com/4743601d-4f90-4c6a-bcf3-0ec5e81b9ed7/20230419_233242/sct-4743601d.log.tar.gz
- loaders: https://cloudius-jenkins-test.s3.amazonaws.com/4743601d-4f90-4c6a-bcf3-0ec5e81b9ed7/20230419_233242/loader-set-4743601d.tar.gz