Skip to content

backup timeout when uploading to s3 (end up with failure Indexing snapshot files failed on host) #3195

Open
@fruch

Description

@fruch

Installation details

Kernel Version: 5.13.0-1031-aws
Scylla version (or git commit hash): 5.1.dev-20220706.a0ffbf3291b7 with build-id 3490fa9f14da510e97a1d0f53f693cac13a70494
Cluster size: 6 nodes (i3.4xlarge)
Manager Version: 3.0.0-0.20220523.5501e5d7f53

Scylla Nodes used in this run:

  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-9 (63.35.163.81 | 10.0.3.187) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-8 (52.31.41.157 | 10.0.0.95) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-7 (34.244.145.19 | 10.0.0.195) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-6 (34.241.120.110 | 10.0.0.111) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-5 (34.244.16.3 | 10.0.1.141) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-4 (34.253.116.115 | 10.0.0.12) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-3 (3.251.71.251 | 10.0.1.19) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-2 (52.16.243.180 | 10.0.2.34) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-10 (34.243.184.173 | 10.0.0.66) (shards: 14)
  • longevity-tls-50gb-3d-master-db-node-fb8b36a7-1 (63.33.208.165 | 10.0.1.209) (shards: 14)

OS / Image: ami-07d73e5ea1fc772eb (aws: eu-west-1)

Test: longevity-50gb-3days
Test id: fb8b36a7-c818-4b0b-8ae3-f3ee2cafa53a
Test name: scylla-master/longevity/longevity-50gb-3days
Test config file(s):

Issue description

backup was working o.k. for ~2.5 hours:

< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > sctool output: Run:               68480f6c-0074-11ed-8c00-0ada645d1fc3
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Status:           RUNNING (uploading data)
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Start time:       10 Jul 22 17:19:08 UTC
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Duration: 2h27m46s
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Progress: 99%/0%
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Snapshot Tag:     sm_20220710171912UTC
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > Datacenters:
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG >   - eu-west
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG >
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╭────────────┬──────────┬─────────┬─────────┬──────────────┬──────────╮
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ Host       │ Progress │    Size │ Success │ Deduplicated │   Failed │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ├────────────┼──────────┼─────────┼─────────┼──────────────┼──────────┤
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.0.111 │     100% │ 21.992G │ 21.992G │            0 │        0 │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.0.12  │     100% │ 21.511G │ 21.511G │            0 │        0 │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.0.195 │   99%/0% │ 22.922G │ 22.753G │            0 │ 172.830M │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.0.95  │      99% │ 29.205G │ 29.205G │            0 │        0 │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.1.141 │     100% │ 18.556G │ 18.556G │            0 │        0 │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > │ 10.0.1.209 │     100% │ 22.173G │ 22.173G │            0 │        0 │
< t:2022-07-10 19:46:55,290 f:cli.py          l:1064 c:sdcm.mgmt.cli        p:DEBUG > ╰────────────┴──────────┴─────────┴─────────┴──────────────┴──────────╯

and then failed with the following:

< t:2022-07-10 19:47:05,215 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            68480f6c-0074-11ed-8c00-0ada645d1fc3
< t:2022-07-10 19:47:05,215 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         ERROR (uploading data)
< t:2022-07-10 19:47:05,220 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Cause:          upload
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.0.195: keyspace1.standard1: upload snapshot: copy "data:keyspace1/standard1-45183ad0002111edb97479c04c879a65/snapshots/sm_20220710171912UTC" to "s3:manager-backup-tests-us-east-1/backup/sst/cluster/f2972720-cb85-4ab3-b5fb-d72d4d195d38/dc/eu-west/node/a846b4ec-4eeb-43ad-b7
89-df7dabfa66a6/keyspace/keyspace1/table/standard1/45183ad0002111edb97479c04c879a65": job error (1657446887): s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>D88KJC301DEEVFEQ</RequestId><HostId>HX3vytjgqgJDYE4AlPQpFrhOEs0WDpnSbYeK2m2gxdxNnUCeNQmcJCzvLq4wTA0NXcI0CAQy7XI=</HostId></Error>; standard1-45183ad0002111edb97479c04c879a65/snapshots/sm_20220710171912UTC/me-13418-big-Data.db s3 upload: 400 Bad Request: <?xml version="
1.0" encoding="UTF-8"?>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>D88KJC301DEEVFEQ</RequestId><HostId>HX3vytjgqgJDYE4AlPQpFrhOEs0WDpnSbYeK2m2gxdxNnUCeNQmcJCzvLq4wTA0NXcI0CAQy7XI=</HostId></Error>; standard1-45183ad0002111edb97479c04c879a65/snapshots/sm_20220710171912UTC/me-13418-big-Data.db s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>D88KJC301DEEVFEQ</RequestId><HostId>HX3vytjgqgJDYE4AlPQpFrhOEs0WDpnSbYeK2m2gxdxNnUCeNQmcJCzvLq4wTA0NXcI0CAQy7XI=</HostId></Error>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     10 Jul 22 17:19:08 UTC
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       10 Jul 22 19:46:55 UTC
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       2h27m46s
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       99%/0%
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20220710171912UTC
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - eu-west
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Errors:
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - standard1-45183ad0002111edb97479c04c879a65/snapshots/sm_20220710171912UTC/me-13418-big-Data.db s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>D88KJC301DEEVFEQ</RequestId><HostId>HX3vytjgqgJDYE4AlPQpFrhOEs0WDpnSbYeK2m2gxdxNnUCeNQmcJCzvLq4wTA0NXcI0CAQy7XI=</HostId></Error>; standard1-45183ad0002111edb97479c04c879a65/snapshots/sm_20220710171912UTC/me-13418-big-Data.db s3 upload: 400 Bad Request: <?xml version="1.0" encoding="UTF-8"?>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > <Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message><RequestId>D88KJC301DEEVFEQ</RequestId><HostId>HX3vytjgqgJDYE4AlPQpFrhOEs0WDpnSbYeK2m2gxdxNnUCeNQmcJCzvLq4wTA0NXcI0CAQy7XI=</HostId></Error>
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╭────────────┬──────────┬─────────┬─────────┬──────────────┬──────────╮
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ Host       │ Progress │    Size │ Success │ Deduplicated │   Failed │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ├────────────┼──────────┼─────────┼─────────┼──────────────┼──────────┤
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.0.111 │     100% │ 21.992G │ 21.992G │            0 │        0 │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.0.12  │     100% │ 21.511G │ 21.511G │            0 │        0 │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.0.195 │   99%/0% │ 22.922G │ 22.753G │            0 │ 172.830M │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.0.95  │     100% │ 29.205G │ 29.205G │            0 │        0 │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.1.141 │     100% │ 18.556G │ 18.556G │            0 │        0 │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > │ 10.0.1.209 │     100% │ 22.173G │ 22.173G │            0 │        0 │
< t:2022-07-10 19:47:05,221 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > ╰────────────┴──────────┴─────────┴─────────┴──────────────┴──────────╯

the tasks retries:

< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            72d4a86c-008a-11ed-8c01-0ada645d1fc3
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         RUNNING (indexing files)
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     10 Jul 22 19:56:55 UTC
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       4s
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       -
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20220710171912UTC
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - eu-west
< t:2022-07-10 19:57:00,007 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >

and fails straight away:

< t:2022-07-10 19:57:10,339 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Run:            72d4a86c-008a-11ed-8c01-0ada645d1fc3
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Status:         ERROR (indexing files)
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Cause:          index
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.1.209: could not find any files
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.0.12: could not find any files
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.0.111: could not find any files
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.0.95: could not find any files
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >  10.0.1.141: could not find any files
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Start time:     10 Jul 22 19:56:55 UTC
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > End time:       10 Jul 22 19:57:00 UTC
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Duration:       5s
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Progress:       -
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Snapshot Tag:   sm_20220710171912UTC
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Datacenters:
< t:2022-07-10 19:57:10,340 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   - eu-west
3020695:< t:2022-07-10 19:57:00,932 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Jul 10 19:57:00 longevity-tls-50gb-3d-master-monitor-node-fb8b36a7-1 scylla-manager[166
48]: {"L":"ERROR","T":"2022-07-10T19:57:00.927Z","N":"backup.index","M":"Indexing snapshot files failed see exact errors above","duration":"1.146242062s","_trace_id":"3Ug8flv9SBS_vuto
hyT1YA","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/log
ger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.(*worker).Index.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/worker_index.go:24\ngithub.com/sc
ylladb/scylla-manager/v3/pkg/service/backup.(*worker).Index\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/worker_index.go:30\ngithub.com/scylladb/scylla-manager/v3/pkg/s
ervice/backup.(*Service).Backup.func5\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/service.go:755\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.(*Service).B
ackup.func11\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/service.go:808\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.(*Service).Backup\n\tgithub.com/scyll
adb/scylla-manager/v3/pkg/service/backup/service.go:812\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.Runner.Run\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backu
p/runner.go:26\ngithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler.PolicyRunner.Run\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler/policy.go:32\ngithub.com/sc
ylladb/scylla-manager/v3/pkg/service/scheduler.(*Service).run\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/scheduler/service.go:432\ngithub.com/scylladb/scylla-manager/v3/pkg/
scheduler.(*Scheduler).asyncRun.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/scheduler/scheduler.go:412"}
3020696:< t:2022-07-10 19:57:00,944 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Jul 10 19:57:00 longevity-tls-50gb-3d-master-monitor-node-fb8b36a7-1 scylla-manager[166
48]: {"L":"INFO","T":"2022-07-10T19:57:00.943Z","N":"scheduler.f2972720","M":"Run ended with ERROR","task":"backup/0c14e13b-7f7e-4298-83a5-adf564aacc29","status":"ERROR","cause":"inde
x: 10.0.1.209: could not find any files; 10.0.0.12: could not find any files; 10.0.0.111: could not find any files; 10.0.0.95: could not find any files; 10.0.1.141: could not find any
 files","duration":"5.901123544s","_trace_id":"3Ug8flv9SBS_vutohyT1YA"}

Logs:

Jenkins job URL

Metadata

Metadata

Assignees

No one assigned

    Labels

    backupbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions