Description
I do the next test:
- Create backup task (keyspace1) and wait for DONE status;
- Reach enospc on one of the nodes;
- Create restore task and wait for task final status
The results of such test is unstable. In general, there were three attempts:
Attempt 1
Restore task status PASS.
Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/5/
Attempt 2
Restore task status PASS.
Environment: Scylla 6.0 (tablets disabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/6/
Attempt 3
Restore task status ERROR.
Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/8/
Command "sudo sctool -c 3970cebd-91ae-4b54-8209-60361b254972 progress restore/f58bc267-cc7b-4f52-a7e9-7d46363f9a96" finished with status 0
Restore progress
Run: b3d0dda5-339e-11ef-b94d-026cea77bb6b
Status: ERROR (restoring backed-up data)
Cause: not restored bundles [3ghb_0ph8_0vi9c21pobsvuli4m0 3ghb_0ph8_12sls2wklal4ocf8ig]: create run progress: validate free disk space: not enough disk space
Start time: 26 Jun 24 09:30:18 UTC
End time: 26 Jun 24 09:36:50 UTC
Duration: 6m31s
Progress: 0% | 0%
Snapshot Tag: sm_20240626092405UTC
╭───────────┬──────────┬────────┬─────────┬────────────┬────────╮
│ Keyspace │ Progress │ Size │ Success │ Downloaded │ Failed │
├───────────┼──────────┼────────┼─────────┼────────────┼────────┤
│ keyspace1 │ 0% | 0% │ 2.882G │ 0 │ 0 │ 0 │
╰───────────┴──────────┴────────┴─────────┴────────────┴────────╯
Some pre-history:
Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status.
After testing with Scylla 6.0 we observed that the restore passes every time we execute the test.
According to @karol-kokoszka, Manager checks available disk space on the nodes that are expected to participate in the restore. If it's below the limit, then the node won't participate in the restore. It looks that only one node reported not enough space. So the restore was performed by other (working) ones. Maybe there was some change in 6.0 that load&stream handles the situation better than 2024.1.
Thus, it was decided to rework the test to expect successful restore when testing with Scylla 6.0, but during test validation after fixes the restore failed again.