Skip to content

Add possibility to stop execution after specific number of errors #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

piodul
Copy link
Collaborator

@piodul piodul commented Nov 18, 2022

This PR adds support for the -error-limit flag from the original scylla-bench. The flag allows to limit the total number of errors that can occur during a run before the run is actually stopped.

@piodul piodul requested a review from cvybhu November 18, 2022 11:16
src/run.rs Outdated
Comment on lines 326 to 327
max_retries_per_op: 0,
max_retries_in_total: usize::MAX,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default retiries_in_total are unlimited, but retries_per_op are set to 0. This is a bit inconsistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use max_retries_in_total here because I didn't want to change the existing tests that used max_retries_per_op, however I don't think it's very bad so ultimately I changed max_retries_in_total here to 0 and adjusted the affected tests.

@piodul piodul force-pushed the global-error-limit-flag branch from 26aa0db to 12c63f0 Compare November 24, 2022 12:00
@piodul
Copy link
Collaborator Author

piodul commented Nov 24, 2022

v2:

  • Fixed review comments
  • The s-b frontend now prints all errors returned from tasks instead of only the first one
  • Slightly changed the run task's behavior in case the run stops in the middle of the retry loop

I rebased the PR on top of #37 for the CI to pass, so please review that one first. As for the commit that changes the behavior on error (the last one), I found it when improving the error printing logic and included it here instead of sending separately in order to reduce a chance of merge conflicts.

@piodul piodul requested a review from cvybhu December 1, 2022 12:03
Comment on lines +287 to +289
for err in errors {
// The {:#} syntax makes sure that the error is printed in one line
writeln!(out, " {:#}", err)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is formatting using {:#} documented somewhere? I took look at https://doc.rust-lang.org/std/fmt/index.html, but the only mention # with some specifier after, never just the # alone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -28,7 +28,7 @@ pub fn parse_int(mut s: &str, bit_size: u32) -> Result<i64> {

// Detect the sign
let negative = s.starts_with('-');
s = s.strip_prefix(&['-', '+']).unwrap_or(s);
s = s.strip_prefix(['-', '+']).unwrap_or(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#37 has been merged, so let's rebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but clippy started to complain again. I will rebase again after #38 is merged, as it contains the necessary fixes.

stats: &Stats,
errors: &[anyhow::Error],
out: &mut impl Write,
) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to limit the maximum number of errors that are be printed.
We don't want to output millions of lines on a big benchmark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be at most as many error as the concurrency parameter. So, in the worst case, it will be thousands.

Besides, this is what scylla-bench does, and I'm striving for compatibility - maybe not with regards to exact error strings, but formatting at least...

src/run.rs Outdated
Comment on lines 278 to 293
while let Some(worker_result) = worker_handles.next().await {
if let Err(err) = worker_result {
result = Err(err);
ctx.ask_to_stop();
errors.push(err);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the ask_to_stop removed?

If one worker fails it will stop, but other workers will continue to send queries indefinitely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it doesn't make sense. Good catch. ask_to_stop is now only called when the global number of retries is exhausted, but not when the local number of retries is exhausted. I'll fix it and add a test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I also added a regression test which checks that an error from one task stops others.

src/run.rs Outdated
Comment on lines 153 to 158
Err(err.context(format!(
"Maximum number of retries per operation exceeded ({})",
self.context.max_retries_per_op,
)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument that user passed talked about the maximum number of errors, so maybe it would be better to mention errors in the message as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to do it because the terminology used in the context talks about retries, not errors - however the error counts might make more sense to users, so I'll use that.

@piodul piodul force-pushed the global-error-limit-flag branch 2 times, most recently from 4c2a1db to 912b419 Compare May 27, 2023 11:20
@piodul
Copy link
Collaborator Author

piodul commented May 27, 2023

v3:

  • Rebased (will need to do it once again after Update driver version from 0.5 to 0.8.1 #38)
  • Fixed a bug that prevented the benchmark from stopping when one task fails with an error, added a test for it
  • Added a bit more explanation in the comment for decrement_global_retry_counter

@piodul
Copy link
Collaborator Author

piodul commented May 27, 2023

Sorry for the long delay, I completely forgot about this PR...

@piodul piodul requested a review from cvybhu May 27, 2023 11:24
@piodul
Copy link
Collaborator Author

piodul commented May 29, 2023

It looks like I slightly misunderstood the meaning of the original max-errors-at-row flag. In original scylla-bench, it does not cause the operation to be retried for the same piece of data, instead it will be skipped. I'll update the PR according to that semantics.

There are more flags in the original scylla-bench that configure proper retries, but they will be implemented in a separate PR.

@piodul piodul force-pushed the global-error-limit-flag branch from 912b419 to 7f917d2 Compare May 30, 2023 14:21
piodul added 10 commits May 31, 2023 10:10
The value for this configuration option is parsed as u64 by the s-b
frontend, so changing this field's type from usize to u64 allows to get
rid of some casts.
The `max_retries_per_op` was introduced with the sb's
`error-at-row-limit` option, however the semantics of that option was
misunderstood. This option defines the maximum number of errors that
can happen for a single operation _in a row_, not the number of
retries. This commit changes the semantics and the name of that option
to `max_consecutive_errors_per_op`, which better aligns with the
original implementation. The `test_retrying` function is also
accordingly adjusted.
Implements a parameter which allows limiting the number of total errors
that will be tolerated before the run is stopped with a failure. This
parameter will be used to implement a CLI flag with a similar semantic
in the scylla-bench frontend (`error-limit`).
Implements the "error-limit" flag which was recently added in
scylladb/scylla-bench#110. The flag allows to limit the total number
of errors allowed before a run is stopped. Support for that flag
is implemented via the `max_errors_in_total` parameter added
in the previous commit.
Now, the values of "error-limit" and "error-limit-at-row" flags are
printed along with other parameters when the scylla-bench frontend is
being started.
Now the error message returned from a single task contains information
about the maximum number of allowed errors being exceeded.
Previously, when the run was signalled to stop, if a task fails an
operation and notices that it should stop, it returned an error. Now, it
doesn't, and only returns an error in case either all global retries are
exhausted or an error occured more than `max_retries_per_op` times.

The new semantic makes more sense. For example, if the test is stopped
manually with 10 consecutive retries allowed, and there is a task which
performed 5 operations and notices that it should stop, the task
shouldn't return an error because it might have succeeded if it tried
one more time.
Adds a regression test which verifies that if an operation is stopped
due to an error, then other operations will be stopped as well. This
issue was found during the review of this pull request, it was not
present in master version this PR was based on.
@piodul piodul force-pushed the global-error-limit-flag branch from 7f917d2 to d39e3f9 Compare May 31, 2023 08:10
@piodul
Copy link
Collaborator Author

piodul commented May 31, 2023

v4:

  • Aligned the semantics of the error-at-row-limit flag with what happens in scylla-bench
  • Rebased on top of master to get rid of clippy complaints.

@piodul
Copy link
Collaborator Author

piodul commented Jun 6, 2023

@cvybhu review ping

Copy link
Contributor

@cvybhu cvybhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed, sorry it took so long.

Comment on lines 330 to +333
// Zero means unlimited tries,
// and #tries == #retries + 1,
// therefore just subtract with wraparound and treat u64::MAX as infinity
let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1);
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sub(1) correct?

I think that the maximum number of consecutive errors should be equal to the value of error-at-row-limit, not error-at-row-limit - 1.

Another comment says:

/// The maximum number of consecutive errors allowed before giving up.
pub max_consecutive_errors_per_op: u64,

So AFAIU, the definition of max_consecutive_errors_per_op is the same as that of error-at-row-limit, not one less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the comment about retries is trying to say, previously it made some sense, as the variable used to specifiy the number of retries, but now it would be good to update it so that it explains the new variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_retrying() sets max_consecutive_errors_per_op = 0, and then runs an operation which fails the first time it's executed. It's expected that the run will fail.

With sub(1) this would correspond to error-at-row-limit = 1. That would mean that error-at-row-limit = 1 doesn't allow any errors, this sounds wrong.

@@ -174,9 +174,9 @@ where
let max_errors_at_row = flag.u64_var(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit confusing, I think it could just be max_consecutive_errors_per_op, no?.

let next_op_id = self.context.issue_operation_id()?;
self.op_id = next_op_id;
}
self.op_id = self.context.issue_operation_id()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to increase op_id on failure?

AFAIU previously the op_id was an unique identifier for each operation, so if I wanted to insert values 0..100 I could just insert op_id. Then in case where inserting 13 fails, I know that the executor will try to insert 13 again, not 14.

let mut cfg = make_test_cfg(move || Op::new(shared_counter.clone()));
cfg.concurrency = error_count;
cfg.max_consecutive_errors_per_op = 1; // We need to allow the runner to retry individual failures
cfg.max_errors_in_total = retry_limit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to rename retry_limit to max_errors_allowed, it would be easier to understand.

} else {
println!(
"Max error number at row: {}",
self.max_consecutive_errors_per_op as u128 + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This +1 could be removed after removing sub(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's no \t in this particular print, is this intentional?

Err(err)
Err(err.context(format!(
"Maximum number of errors allowed per operation exceeded ({})",
self.context.max_consecutive_errors_per_op as u128 + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this +1 as well

@cvybhu
Copy link
Contributor

cvybhu commented Jun 6, 2023

My current understanding of these flags is:

error-at-row-limit - The maximum number of consecutive errors allowed. After exceeding it, the workflow will terminate with an error.

For example, with error-at-row-limit = 3:

This is ok (3 consecutive errors):

intial query - ERROR
retry 1 - ERROR
retry 2 - ERROR
retry 3 - OK

This is not ok (4 consecutive errors):

intial query - ERROR
retry 1 - ERROR
retry 2 - ERROR
retry 3 - ERROR
> The whole run fails <

error-limit - The number of errors after which the run should fail. Having error-limit - 1 errors is fine, but one more and the run fails.
For example, with error-limit = 3, max_errors_in_total = 2 (retry_countdown is intialized to 2):

This is ok:

initial query - ERROR (`retry_countdown.fetch_sub(1) = 2`)
retry 1 - ERROR (`retry_countdown.fetch_sub(1) = 1`)
retry 2 - OK

This is not ok:

initial query - ERROR (`retry_countdown.fetch_sub(1) = 2`)
retry 1 - ERROR (`retry_countdown.fetch_sub(1) = 1`)
retry 2 - ERROR (`retry_countdown.fetch_sub(1) = 0`)
> The whole run fails <

Copy link
Collaborator

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a matter of rebasing on top of master, and adjusting the cassandra-stress frontend to these changes.

After analyzing the code of cassandra-stress:

  • the max_consecutive_errors_per_op config option should be by default set to 9 (as it is now). It's defined via -errors retries (will be introduced soon) parameter. This means that we should try to perform each operation 1 + retries number of times.
  • there is no corresponding option for max_errors_in_total in c-s. It can be set to u64::MAX.

}
}

// Should be called before starting an operation.
pub async fn start_operation(&mut self) -> Option<OperationContext> {
if self.trial_idx == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed? The op_id is now incremented even when current operation failed and should be retried (using the same operation_id).

@@ -330,7 +330,7 @@ where
// Zero means unlimited tries,
// and #tries == #retries + 1,
// therefore just subtract with wraparound and treat u64::MAX as infinity
let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1);
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the above comments. From what I understood, the error-at-row-limit defines the number of errors that can occur before failing. In other words, it defines the number of retries - in the worst case scenario we should try to perform the operation error-at-row-limit + 1 times (first try + error-at-row-limit retries).

I think we should either change it to Option<NonZeroU64> (set to None when error-at-row-limit is 0) and remove the wrapping_sub or change the >= to > in WorkerSession::end_operation here:

Err(err) if self.consecutive_errors >= self.context.max_consecutive_errors_per_op => {
    Err(err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be also nice to add tests for edge cases such as error-at-row-limit = 1. In this case, the following scenario would be OK:

1st try: ERROR
2nd try: SUCCESS

, but this should fail:

1st try: ERROR
2nd try: ERROR
... there is no 3rd try, benchmarks terminates

As of now, I believe that the first scenario wouldn't pass since we use wrapping_sub(1) and we make use of >= comparison.

@@ -332,6 +339,9 @@ where
// therefore just subtract with wraparound and treat u64::MAX as infinity
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);

// Similar to above
let max_errors_in_total = max_errors.get().wrapping_sub(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure i understand correctly - this option (unlike error-at-row-limit) defines the number of errors after which the whole run fails. In other words, when error-limit = 1, we immediately fail the run after the first error occurs. Am i right?

If so, then this wrapping_sub(1) looks semantically correct to me (looking at the WorkerContext::decrement_global_retry_counter).

@roydahan roydahan requested a review from vponomaryov April 3, 2024 20:14
@roydahan
Copy link
Collaborator

roydahan commented Apr 3, 2024

@muzarski (cc @fruch ) this is one of the items must be completed before we can switch SCT cases to use cql-stress.

@roydahan
Copy link
Collaborator

roydahan commented Apr 3, 2024

Actually I see it's specific for s-b, which is second phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants