-
Notifications
You must be signed in to change notification settings - Fork 8
Add possibility to stop execution after specific number of errors #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
src/run.rs
Outdated
max_retries_per_op: 0, | ||
max_retries_in_total: usize::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default retiries_in_total
are unlimited, but retries_per_op
are set to 0. This is a bit inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't use max_retries_in_total
here because I didn't want to change the existing tests that used max_retries_per_op
, however I don't think it's very bad so ultimately I changed max_retries_in_total
here to 0 and adjusted the affected tests.
26aa0db
to
12c63f0
Compare
v2:
I rebased the PR on top of #37 for the CI to pass, so please review that one first. As for the commit that changes the behavior on error (the last one), I found it when improving the error printing logic and included it here instead of sending separately in order to reduce a chance of merge conflicts. |
for err in errors { | ||
// The {:#} syntax makes sure that the error is printed in one line | ||
writeln!(out, " {:#}", err)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is formatting using {:#}
documented somewhere? I took look at https://doc.rust-lang.org/std/fmt/index.html, but the only mention #
with some specifier after, never just the #
alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahyhow's documentation suggests using {:#}
here: https://docs.rs/anyhow/latest/anyhow/struct.Error.html#display-representations
@@ -28,7 +28,7 @@ pub fn parse_int(mut s: &str, bit_size: u32) -> Result<i64> { | |||
|
|||
// Detect the sign | |||
let negative = s.starts_with('-'); | |||
s = s.strip_prefix(&['-', '+']).unwrap_or(s); | |||
s = s.strip_prefix(['-', '+']).unwrap_or(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#37 has been merged, so let's rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but clippy started to complain again. I will rebase again after #38 is merged, as it contains the necessary fixes.
stats: &Stats, | ||
errors: &[anyhow::Error], | ||
out: &mut impl Write, | ||
) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be good to limit the maximum number of errors that are be printed.
We don't want to output millions of lines on a big benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be at most as many error as the concurrency
parameter. So, in the worst case, it will be thousands.
Besides, this is what scylla-bench does, and I'm striving for compatibility - maybe not with regards to exact error strings, but formatting at least...
src/run.rs
Outdated
while let Some(worker_result) = worker_handles.next().await { | ||
if let Err(err) = worker_result { | ||
result = Err(err); | ||
ctx.ask_to_stop(); | ||
errors.push(err); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was the ask_to_stop
removed?
If one worker fails it will stop, but other workers will continue to send queries indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it doesn't make sense. Good catch. ask_to_stop is now only called when the global number of retries is exhausted, but not when the local number of retries is exhausted. I'll fix it and add a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I also added a regression test which checks that an error from one task stops others.
src/run.rs
Outdated
Err(err.context(format!( | ||
"Maximum number of retries per operation exceeded ({})", | ||
self.context.max_retries_per_op, | ||
))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument that user passed talked about the maximum number of errors, so maybe it would be better to mention errors in the message as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to do it because the terminology used in the context talks about retries, not errors - however the error counts might make more sense to users, so I'll use that.
4c2a1db
to
912b419
Compare
v3:
|
Sorry for the long delay, I completely forgot about this PR... |
It looks like I slightly misunderstood the meaning of the original There are more flags in the original scylla-bench that configure proper retries, but they will be implemented in a separate PR. |
912b419
to
7f917d2
Compare
The value for this configuration option is parsed as u64 by the s-b frontend, so changing this field's type from usize to u64 allows to get rid of some casts.
The `max_retries_per_op` was introduced with the sb's `error-at-row-limit` option, however the semantics of that option was misunderstood. This option defines the maximum number of errors that can happen for a single operation _in a row_, not the number of retries. This commit changes the semantics and the name of that option to `max_consecutive_errors_per_op`, which better aligns with the original implementation. The `test_retrying` function is also accordingly adjusted.
Implements a parameter which allows limiting the number of total errors that will be tolerated before the run is stopped with a failure. This parameter will be used to implement a CLI flag with a similar semantic in the scylla-bench frontend (`error-limit`).
Implements the "error-limit" flag which was recently added in scylladb/scylla-bench#110. The flag allows to limit the total number of errors allowed before a run is stopped. Support for that flag is implemented via the `max_errors_in_total` parameter added in the previous commit.
Now, the values of "error-limit" and "error-limit-at-row" flags are printed along with other parameters when the scylla-bench frontend is being started.
Now the error message returned from a single task contains information about the maximum number of allowed errors being exceeded.
Previously, when the run was signalled to stop, if a task fails an operation and notices that it should stop, it returned an error. Now, it doesn't, and only returns an error in case either all global retries are exhausted or an error occured more than `max_retries_per_op` times. The new semantic makes more sense. For example, if the test is stopped manually with 10 consecutive retries allowed, and there is a task which performed 5 operations and notices that it should stop, the task shouldn't return an error because it might have succeeded if it tried one more time.
Adds a regression test which verifies that if an operation is stopped due to an error, then other operations will be stopped as well. This issue was found during the review of this pull request, it was not present in master version this PR was based on.
7f917d2
to
d39e3f9
Compare
v4:
|
@cvybhu review ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed, sorry it took so long.
// Zero means unlimited tries, | ||
// and #tries == #retries + 1, | ||
// therefore just subtract with wraparound and treat u64::MAX as infinity | ||
let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1); | ||
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sub(1)
correct?
I think that the maximum number of consecutive errors should be equal to the value of error-at-row-limit
, not error-at-row-limit - 1
.
Another comment says:
/// The maximum number of consecutive errors allowed before giving up.
pub max_consecutive_errors_per_op: u64,
So AFAIU, the definition of max_consecutive_errors_per_op
is the same as that of error-at-row-limit
, not one less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what the comment about retries is trying to say, previously it made some sense, as the variable used to specifiy the number of retries, but now it would be good to update it so that it explains the new variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test test_retrying()
sets max_consecutive_errors_per_op = 0
, and then runs an operation which fails the first time it's executed. It's expected that the run will fail.
With sub(1)
this would correspond to error-at-row-limit = 1
. That would mean that error-at-row-limit = 1
doesn't allow any errors, this sounds wrong.
@@ -174,9 +174,9 @@ where | |||
let max_errors_at_row = flag.u64_var( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is a bit confusing, I think it could just be max_consecutive_errors_per_op
, no?.
let next_op_id = self.context.issue_operation_id()?; | ||
self.op_id = next_op_id; | ||
} | ||
self.op_id = self.context.issue_operation_id()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to increase op_id
on failure?
AFAIU previously the op_id
was an unique identifier for each operation, so if I wanted to insert values 0..100 I could just insert op_id
. Then in case where inserting 13
fails, I know that the executor will try to insert 13
again, not 14
.
let mut cfg = make_test_cfg(move || Op::new(shared_counter.clone())); | ||
cfg.concurrency = error_count; | ||
cfg.max_consecutive_errors_per_op = 1; // We need to allow the runner to retry individual failures | ||
cfg.max_errors_in_total = retry_limit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to rename retry_limit
to max_errors_allowed
, it would be easier to understand.
} else { | ||
println!( | ||
"Max error number at row: {}", | ||
self.max_consecutive_errors_per_op as u128 + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This +1
could be removed after removing sub(1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there's no \t
in this particular print, is this intentional?
Err(err) | ||
Err(err.context(format!( | ||
"Maximum number of errors allowed per operation exceeded ({})", | ||
self.context.max_consecutive_errors_per_op as u128 + 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this +1
as well
My current understanding of these flags is:
For example, with This is ok (3 consecutive errors):
This is not ok (4 consecutive errors):
This is ok:
This is not ok:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a matter of rebasing on top of master, and adjusting the cassandra-stress
frontend to these changes.
After analyzing the code of cassandra-stress
:
- the
max_consecutive_errors_per_op
config option should be by default set to 9 (as it is now). It's defined via-errors retries
(will be introduced soon) parameter. This means that we should try to perform each operation1 + retries
number of times. - there is no corresponding option for
max_errors_in_total
in c-s. It can be set tou64::MAX
.
} | ||
} | ||
|
||
// Should be called before starting an operation. | ||
pub async fn start_operation(&mut self) -> Option<OperationContext> { | ||
if self.trial_idx == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed? The op_id
is now incremented even when current operation failed and should be retried (using the same operation_id
).
@@ -330,7 +330,7 @@ where | |||
// Zero means unlimited tries, | |||
// and #tries == #retries + 1, | |||
// therefore just subtract with wraparound and treat u64::MAX as infinity | |||
let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1); | |||
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the above comments. From what I understood, the error-at-row-limit
defines the number of errors that can occur before failing. In other words, it defines the number of retries - in the worst case scenario we should try to perform the operation error-at-row-limit + 1
times (first try + error-at-row-limit
retries).
I think we should either change it to Option<NonZeroU64>
(set to None
when error-at-row-limit
is 0) and remove the wrapping_sub
or change the >=
to >
in WorkerSession::end_operation
here:
Err(err) if self.consecutive_errors >= self.context.max_consecutive_errors_per_op => {
Err(err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be also nice to add tests for edge cases such as error-at-row-limit = 1
. In this case, the following scenario would be OK:
1st try: ERROR
2nd try: SUCCESS
, but this should fail:
1st try: ERROR
2nd try: ERROR
... there is no 3rd try, benchmarks terminates
As of now, I believe that the first scenario wouldn't pass since we use wrapping_sub(1)
and we make use of >=
comparison.
@@ -332,6 +339,9 @@ where | |||
// therefore just subtract with wraparound and treat u64::MAX as infinity | |||
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1); | |||
|
|||
// Similar to above | |||
let max_errors_in_total = max_errors.get().wrapping_sub(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure i understand correctly - this option (unlike error-at-row-limit
) defines the number of errors after which the whole run fails. In other words, when error-limit = 1
, we immediately fail the run after the first error occurs. Am i right?
If so, then this wrapping_sub(1)
looks semantically correct to me (looking at the WorkerContext::decrement_global_retry_counter
).
Actually I see it's specific for s-b, which is second phase. |
8702378
to
e831e30
Compare
This PR adds support for the
-error-limit
flag from the original scylla-bench. The flag allows to limit the total number of errors that can occur during a run before the run is actually stopped.