Add possibility to stop execution after specific number of errors #36

piodul · 2022-11-18T10:39:50Z

This PR adds support for the -error-limit flag from the original scylla-bench. The flag allows to limit the total number of errors that can occur during a run before the run is actually stopped.

src/run.rs

src/bin/cql-stress-scylla-bench/main.rs

src/run.rs

src/bin/cql-stress-scylla-bench/args.rs

cvybhu · 2022-11-18T22:00:53Z

src/run.rs

            max_retries_per_op: 0,
+            max_retries_in_total: usize::MAX,


By default retiries_in_total are unlimited, but retries_per_op are set to 0. This is a bit inconsistent.

I didn't use max_retries_in_total here because I didn't want to change the existing tests that used max_retries_per_op, however I don't think it's very bad so ultimately I changed max_retries_in_total here to 0 and adjusted the affected tests.

piodul · 2022-11-24T12:06:52Z

v2:

Fixed review comments
The s-b frontend now prints all errors returned from tasks instead of only the first one
Slightly changed the run task's behavior in case the run stops in the middle of the retry loop

I rebased the PR on top of #37 for the CI to pass, so please review that one first. As for the commit that changes the behavior on error (the last one), I found it when improving the error printing logic and included it here instead of sending separately in order to reduce a chance of merge conflicts.

cvybhu · 2022-12-02T17:02:28Z

src/bin/cql-stress-scylla-bench/stats.rs

+            for err in errors {
+                // The {:#} syntax makes sure that the error is printed in one line
+                writeln!(out, "    {:#}", err)?;


Is formatting using {:#} documented somewhere? I took look at https://doc.rust-lang.org/std/fmt/index.html, but the only mention # with some specifier after, never just the # alone.

ahyhow's documentation suggests using {:#} here: https://docs.rs/anyhow/latest/anyhow/struct.Error.html#display-representations

cvybhu · 2022-12-02T17:04:46Z

src/bin/cql-stress-scylla-bench/gocompat/strconv.rs

@@ -28,7 +28,7 @@ pub fn parse_int(mut s: &str, bit_size: u32) -> Result<i64> {

    // Detect the sign
    let negative = s.starts_with('-');
-    s = s.strip_prefix(&['-', '+']).unwrap_or(s);
+    s = s.strip_prefix(['-', '+']).unwrap_or(s);


#37 has been merged, so let's rebase.

Done, but clippy started to complain again. I will rebase again after #38 is merged, as it contains the necessary fixes.

cvybhu · 2022-12-02T17:50:01Z

src/bin/cql-stress-scylla-bench/stats.rs

+        stats: &Stats,
+        errors: &[anyhow::Error],
+        out: &mut impl Write,
+    ) -> Result<()> {


Maybe it would be good to limit the maximum number of errors that are be printed.
We don't want to output millions of lines on a big benchmark.

There will be at most as many error as the concurrency parameter. So, in the worst case, it will be thousands.

Besides, this is what scylla-bench does, and I'm striving for compatibility - maybe not with regards to exact error strings, but formatting at least...

cvybhu · 2022-12-02T17:59:30Z

src/run.rs

    while let Some(worker_result) = worker_handles.next().await {
        if let Err(err) = worker_result {
-            result = Err(err);
-            ctx.ask_to_stop();
+            errors.push(err);
        }
    }


Why was the ask_to_stop removed?

If one worker fails it will stop, but other workers will continue to send queries indefinitely.

Right, it doesn't make sense. Good catch. ask_to_stop is now only called when the global number of retries is exhausted, but not when the local number of retries is exhausted. I'll fix it and add a test.

Done, I also added a regression test which checks that an error from one task stops others.

cvybhu · 2022-12-02T18:00:54Z

src/run.rs

+                Err(err.context(format!(
+                    "Maximum number of retries per operation exceeded ({})",
+                    self.context.max_retries_per_op,
+                )))
+            }


The argument that user passed talked about the maximum number of errors, so maybe it would be better to mention errors in the message as well.

I didn't want to do it because the terminology used in the context talks about retries, not errors - however the error counts might make more sense to users, so I'll use that.

piodul · 2023-05-27T11:23:45Z

v3:

Rebased (will need to do it once again after Update driver version from 0.5 to 0.8.1 #38)
Fixed a bug that prevented the benchmark from stopping when one task fails with an error, added a test for it
Added a bit more explanation in the comment for decrement_global_retry_counter

piodul · 2023-05-27T11:23:59Z

Sorry for the long delay, I completely forgot about this PR...

piodul · 2023-05-29T16:50:46Z

It looks like I slightly misunderstood the meaning of the original max-errors-at-row flag. In original scylla-bench, it does not cause the operation to be retried for the same piece of data, instead it will be skipped. I'll update the PR according to that semantics.

There are more flags in the original scylla-bench that configure proper retries, but they will be implemented in a separate PR.

The value for this configuration option is parsed as u64 by the s-b frontend, so changing this field's type from usize to u64 allows to get rid of some casts.

The `max_retries_per_op` was introduced with the sb's `error-at-row-limit` option, however the semantics of that option was misunderstood. This option defines the maximum number of errors that can happen for a single operation _in a row_, not the number of retries. This commit changes the semantics and the name of that option to `max_consecutive_errors_per_op`, which better aligns with the original implementation. The `test_retrying` function is also accordingly adjusted.

Implements a parameter which allows limiting the number of total errors that will be tolerated before the run is stopped with a failure. This parameter will be used to implement a CLI flag with a similar semantic in the scylla-bench frontend (`error-limit`).

Implements the "error-limit" flag which was recently added in scylladb/scylla-bench#110. The flag allows to limit the total number of errors allowed before a run is stopped. Support for that flag is implemented via the `max_errors_in_total` parameter added in the previous commit.

Now, the values of "error-limit" and "error-limit-at-row" flags are printed along with other parameters when the scylla-bench frontend is being started.

Now the error message returned from a single task contains information about the maximum number of allowed errors being exceeded.

Previously, when the run was signalled to stop, if a task fails an operation and notices that it should stop, it returned an error. Now, it doesn't, and only returns an error in case either all global retries are exhausted or an error occured more than `max_retries_per_op` times. The new semantic makes more sense. For example, if the test is stopped manually with 10 consecutive retries allowed, and there is a task which performed 5 operations and notices that it should stop, the task shouldn't return an error because it might have succeeded if it tried one more time.

Adds a regression test which verifies that if an operation is stopped due to an error, then other operations will be stopped as well. This issue was found during the review of this pull request, it was not present in master version this PR was based on.

piodul · 2023-05-31T09:14:04Z

v4:

Aligned the semantics of the error-at-row-limit flag with what happens in scylla-bench
Rebased on top of master to get rid of clippy complaints.

piodul · 2023-06-06T08:07:37Z

@cvybhu review ping

cvybhu

Reviewed, sorry it took so long.

cvybhu · 2023-06-06T19:32:11Z

src/bin/cql-stress-scylla-bench/args.rs

        // Zero means unlimited tries,
        // and #tries == #retries + 1,
        // therefore just subtract with wraparound and treat u64::MAX as infinity
-        let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1);
+        let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);


Is this sub(1) correct?

I think that the maximum number of consecutive errors should be equal to the value of error-at-row-limit, not error-at-row-limit - 1.

Another comment says:

/// The maximum number of consecutive errors allowed before giving up. pub max_consecutive_errors_per_op: u64,

So AFAIU, the definition of max_consecutive_errors_per_op is the same as that of error-at-row-limit, not one less.

I don't understand what the comment about retries is trying to say, previously it made some sense, as the variable used to specifiy the number of retries, but now it would be good to update it so that it explains the new variable.

The test test_retrying() sets max_consecutive_errors_per_op = 0, and then runs an operation which fails the first time it's executed. It's expected that the run will fail.

With sub(1) this would correspond to error-at-row-limit = 1. That would mean that error-at-row-limit = 1 doesn't allow any errors, this sounds wrong.

cvybhu · 2023-06-06T19:37:47Z

src/bin/cql-stress-scylla-bench/args.rs

@@ -174,9 +174,9 @@ where
    let max_errors_at_row = flag.u64_var(


This name is a bit confusing, I think it could just be max_consecutive_errors_per_op, no?.

cvybhu · 2023-06-06T20:20:56Z

src/run.rs

-            let next_op_id = self.context.issue_operation_id()?;
-            self.op_id = next_op_id;
-        }
+        self.op_id = self.context.issue_operation_id()?;


Is it ok to increase op_id on failure?

AFAIU previously the op_id was an unique identifier for each operation, so if I wanted to insert values 0..100 I could just insert op_id. Then in case where inserting 13 fails, I know that the executor will try to insert 13 again, not 14.

cvybhu · 2023-06-06T21:11:04Z

src/run.rs

+            let mut cfg = make_test_cfg(move || Op::new(shared_counter.clone()));
+            cfg.concurrency = error_count;
+            cfg.max_consecutive_errors_per_op = 1; // We need to allow the runner to retry individual failures
+            cfg.max_errors_in_total = retry_limit;


I think it would be good to rename retry_limit to max_errors_allowed, it would be easier to understand.

cvybhu · 2023-06-06T21:12:58Z

src/bin/cql-stress-scylla-bench/args.rs

+        } else {
+            println!(
+                "Max error number at row: {}",
+                self.max_consecutive_errors_per_op as u128 + 1,


This +1 could be removed after removing sub(1)

Also, there's no \t in this particular print, is this intentional?

cvybhu · 2023-06-06T21:16:55Z

src/run.rs

-                Err(err)
+                Err(err.context(format!(
+                    "Maximum number of errors allowed per operation exceeded ({})",
+                    self.context.max_consecutive_errors_per_op as u128 + 1,


And this +1 as well

cvybhu · 2023-06-06T21:36:16Z

My current understanding of these flags is:

error-at-row-limit - The maximum number of consecutive errors allowed. After exceeding it, the workflow will terminate with an error.

For example, with error-at-row-limit = 3:

This is ok (3 consecutive errors):

intial query - ERROR
retry 1 - ERROR
retry 2 - ERROR
retry 3 - OK

This is not ok (4 consecutive errors):

intial query - ERROR
retry 1 - ERROR
retry 2 - ERROR
retry 3 - ERROR
> The whole run fails <

error-limit - The number of errors after which the run should fail. Having error-limit - 1 errors is fine, but one more and the run fails.
For example, with error-limit = 3, max_errors_in_total = 2 (retry_countdown is intialized to 2):

This is ok:

initial query - ERROR (`retry_countdown.fetch_sub(1) = 2`)
retry 1 - ERROR (`retry_countdown.fetch_sub(1) = 1`)
retry 2 - OK

This is not ok:

initial query - ERROR (`retry_countdown.fetch_sub(1) = 2`)
retry 1 - ERROR (`retry_countdown.fetch_sub(1) = 1`)
retry 2 - ERROR (`retry_countdown.fetch_sub(1) = 0`)
> The whole run fails <

muzarski

There is also a matter of rebasing on top of master, and adjusting the cassandra-stress frontend to these changes.

After analyzing the code of cassandra-stress:

the max_consecutive_errors_per_op config option should be by default set to 9 (as it is now). It's defined via -errors retries (will be introduced soon) parameter. This means that we should try to perform each operation 1 + retries number of times.
there is no corresponding option for max_errors_in_total in c-s. It can be set to u64::MAX.

muzarski · 2024-03-04T08:34:28Z

src/run.rs

        }
    }

    // Should be called before starting an operation.
    pub async fn start_operation(&mut self) -> Option<OperationContext> {
-        if self.trial_idx == 0 {


Why is this removed? The op_id is now incremented even when current operation failed and should be retried (using the same operation_id).

muzarski · 2024-03-04T08:45:54Z

src/bin/cql-stress-scylla-bench/args.rs

@@ -330,7 +330,7 @@ where
        // Zero means unlimited tries,
        // and #tries == #retries + 1,
        // therefore just subtract with wraparound and treat u64::MAX as infinity
-        let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1);
+        let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);


I agree with the above comments. From what I understood, the error-at-row-limit defines the number of errors that can occur before failing. In other words, it defines the number of retries - in the worst case scenario we should try to perform the operation error-at-row-limit + 1 times (first try + error-at-row-limit retries).

I think we should either change it to Option<NonZeroU64> (set to None when error-at-row-limit is 0) and remove the wrapping_sub or change the >= to > in WorkerSession::end_operation here:

Err(err) if self.consecutive_errors >= self.context.max_consecutive_errors_per_op => { Err(err) }

It would be also nice to add tests for edge cases such as error-at-row-limit = 1. In this case, the following scenario would be OK:

1st try: ERROR 2nd try: SUCCESS

, but this should fail:

1st try: ERROR 2nd try: ERROR ... there is no 3rd try, benchmarks terminates

As of now, I believe that the first scenario wouldn't pass since we use wrapping_sub(1) and we make use of >= comparison.

muzarski · 2024-03-04T09:18:27Z

src/bin/cql-stress-scylla-bench/args.rs

@@ -332,6 +339,9 @@ where
        // therefore just subtract with wraparound and treat u64::MAX as infinity
        let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);

+        // Similar to above
+        let max_errors_in_total = max_errors.get().wrapping_sub(1);


Just to make sure i understand correctly - this option (unlike error-at-row-limit) defines the number of errors after which the whole run fails. In other words, when error-limit = 1, we immediately fail the run after the first error occurs. Am i right?

If so, then this wrapping_sub(1) looks semantically correct to me (looking at the WorkerContext::decrement_global_retry_counter).

roydahan · 2024-04-03T20:15:48Z

@muzarski (cc @fruch ) this is one of the items must be completed before we can switch SCT cases to use cql-stress.

roydahan · 2024-04-03T20:17:28Z

Actually I see it's specific for s-b, which is second phase.

piodul requested a review from cvybhu November 18, 2022 11:16

cvybhu reviewed Nov 18, 2022

View reviewed changes

piodul force-pushed the global-error-limit-flag branch from 26aa0db to 12c63f0 Compare November 24, 2022 12:00

piodul requested a review from cvybhu December 1, 2022 12:03

cvybhu suggested changes Dec 2, 2022

View reviewed changes

piodul force-pushed the global-error-limit-flag branch 2 times, most recently from 4c2a1db to 912b419 Compare May 27, 2023 11:20

piodul requested a review from cvybhu May 27, 2023 11:24

piodul force-pushed the global-error-limit-flag branch from 912b419 to 7f917d2 Compare May 30, 2023 14:21

piodul added 10 commits May 31, 2023 10:10

run: change the type of max_retries_per_op from usize to u64

2a6718e

The value for this configuration option is parsed as u64 by the s-b frontend, so changing this field's type from usize to u64 allows to get rid of some casts.

s-b: print error limiting parameters in the initial config dump

bee64e1

Now, the values of "error-limit" and "error-limit-at-row" flags are printed along with other parameters when the scylla-bench frontend is being started.

run: collect and report all errors returned from tasks

0aa86e8

s-b: include errors returned from tasks in the final report

25cede3

run: inform about max error count exceeded when returning an error

dde50c0

Now the error message returned from a single task contains information about the maximum number of allowed errors being exceeded.

piodul force-pushed the global-error-limit-flag branch from 7f917d2 to d39e3f9 Compare May 31, 2023 08:10

cvybhu suggested changes Jun 6, 2023

View reviewed changes

avelanarius requested review from muzarski and Lorak-mmk February 23, 2024 17:24

muzarski reviewed Mar 4, 2024

View reviewed changes

roydahan requested a review from vponomaryov April 3, 2024 20:14

avelanarius assigned muzarski Apr 19, 2024

CodeLieutenant force-pushed the master branch 6 times, most recently from 8702378 to e831e30 Compare March 17, 2025 14:31

		@@ -174,9 +174,9 @@ where
		let max_errors_at_row = flag.u64_var(

Add possibility to stop execution after specific number of errors #36

Are you sure you want to change the base?

Add possibility to stop execution after specific number of errors #36

Uh oh!

Conversation

piodul commented Nov 18, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piodul commented Nov 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piodul commented May 27, 2023

Uh oh!

piodul commented May 27, 2023

Uh oh!

piodul commented May 29, 2023

Uh oh!

piodul commented May 31, 2023

Uh oh!

piodul commented Jun 6, 2023

Uh oh!

cvybhu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cvybhu commented Jun 6, 2023

Uh oh!

muzarski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment