Flaky restore behavior in a case of enospc on one of the nodes

I do the next test:
1. Create backup task (keyspace1) and wait for DONE status;
2. Reach enospc on one of the nodes;
3. Create restore task and wait for task final status

The results of such test is **_unstable_**. In general, there were **_three attempts_**:

**Attempt 1**
Restore task status **_PASS_**.
Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/5/

**Attempt 2**
Restore task status **_PASS_**.
Environment: Scylla 6.0 (tablets disabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/6/

**Attempt 3**
Restore task status **_ERROR_**.
Environment: Scylla 6.0 (tablets enabled), Manager 3.3.0-dev
CI - https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/manager-master/job/test_enospc_before_restore/8/
```
Command "sudo sctool  -c 3970cebd-91ae-4b54-8209-60361b254972 progress restore/f58bc267-cc7b-4f52-a7e9-7d46363f9a96" finished with status 0

Restore progress
Run:		b3d0dda5-339e-11ef-b94d-026cea77bb6b
Status:		ERROR (restoring backed-up data)
Cause:		not restored bundles [3ghb_0ph8_0vi9c21pobsvuli4m0 3ghb_0ph8_12sls2wklal4ocf8ig]: create run progress: validate free disk space: not enough disk space
Start time:	26 Jun 24 09:30:18 UTC
End time:	26 Jun 24 09:36:50 UTC
Duration:	6m31s
Progress:	0% | 0%
Snapshot Tag:	sm_20240626092405UTC

╭───────────┬──────────┬────────┬─────────┬────────────┬────────╮
│ Keyspace  │ Progress │   Size │ Success │ Downloaded │ Failed │
├───────────┼──────────┼────────┼─────────┼────────────┼────────┤
│ keyspace1 │  0% | 0% │ 2.882G │       0 │          0 │      0 │
╰───────────┴──────────┴────────┴─────────┴────────────┴────────╯
```

**Some pre-history:**
Before testing with Scylla release 6.0, we had a test that checks the restore performed in described conditions results with ERROR status.
After testing with Scylla 6.0 we observed that the restore passes every time we execute the test.
According to @karol-kokoszka, `Manager checks available disk space on the nodes that are expected to participate in the restore. If it's below the limit, then the node won't participate in the restore. It looks that only one node reported not enough space. So the restore was performed by other (working) ones. Maybe there was some change in 6.0 that load&stream handles the situation better than 2024.1.`
Thus, it was decided to rework the test to expect successful restore when testing with Scylla 6.0, but during test validation after fixes the restore failed again.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky restore behavior in a case of enospc on one of the nodes #3907

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flaky restore behavior in a case of enospc on one of the nodes #3907

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions