Skip to content

Add possibility to stop execution after specific number of errors #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
37 changes: 32 additions & 5 deletions src/bin/cql-stress-scylla-bench/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@ pub(crate) struct ScyllaBenchArgs {
pub password: String,
pub mode: Mode,
pub latency_type: LatencyType,
pub max_retries_per_op: u64,
pub max_consecutive_errors_per_op: u64,
pub max_errors_in_total: u64,
pub concurrency: u64,
pub maximum_rate: u64,

Expand Down Expand Up @@ -174,9 +175,15 @@ where
let max_errors_at_row = flag.u64_var(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit confusing, I think it could just be max_consecutive_errors_per_op, no?.

"error-at-row-limit",
0,
"the maximum number of attempts allowed for a single operation. \
"the maximum number of consecutive errors allowed. \
After exceeding it, the workflow will terminate with an error. \
Set to 0 if you want to have unlimited retries",
Set to 0 if you want to disable this limit",
);
let max_errors = flag.u64_var(
"error-limit",
0,
"the number of total errors after which the workflow should stop and fail; \
set it to 0 (the default) to disable this limit",
);
let concurrency = flag.u64_var("concurrency", 16, "number of used tasks");
let maximum_rate = flag.u64_var(
Expand Down Expand Up @@ -330,7 +337,10 @@ where
// Zero means unlimited tries,
// and #tries == #retries + 1,
// therefore just subtract with wraparound and treat u64::MAX as infinity
let max_retries_per_op = max_errors_at_row.get().wrapping_sub(1);
let max_consecutive_errors_per_op = max_errors_at_row.get().wrapping_sub(1);
Comment on lines 337 to +340
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sub(1) correct?

I think that the maximum number of consecutive errors should be equal to the value of error-at-row-limit, not error-at-row-limit - 1.

Another comment says:

/// The maximum number of consecutive errors allowed before giving up.
pub max_consecutive_errors_per_op: u64,

So AFAIU, the definition of max_consecutive_errors_per_op is the same as that of error-at-row-limit, not one less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what the comment about retries is trying to say, previously it made some sense, as the variable used to specifiy the number of retries, but now it would be good to update it so that it explains the new variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_retrying() sets max_consecutive_errors_per_op = 0, and then runs an operation which fails the first time it's executed. It's expected that the run will fail.

With sub(1) this would correspond to error-at-row-limit = 1. That would mean that error-at-row-limit = 1 doesn't allow any errors, this sounds wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the above comments. From what I understood, the error-at-row-limit defines the number of errors that can occur before failing. In other words, it defines the number of retries - in the worst case scenario we should try to perform the operation error-at-row-limit + 1 times (first try + error-at-row-limit retries).

I think we should either change it to Option<NonZeroU64> (set to None when error-at-row-limit is 0) and remove the wrapping_sub or change the >= to > in WorkerSession::end_operation here:

Err(err) if self.consecutive_errors >= self.context.max_consecutive_errors_per_op => {
    Err(err)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be also nice to add tests for edge cases such as error-at-row-limit = 1. In this case, the following scenario would be OK:

1st try: ERROR
2nd try: SUCCESS

, but this should fail:

1st try: ERROR
2nd try: ERROR
... there is no 3rd try, benchmarks terminates

As of now, I believe that the first scenario wouldn't pass since we use wrapping_sub(1) and we make use of >= comparison.


// Similar to above
let max_errors_in_total = max_errors.get().wrapping_sub(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure i understand correctly - this option (unlike error-at-row-limit) defines the number of errors after which the whole run fails. In other words, when error-limit = 1, we immediately fail the run after the first error occurs. Am i right?

If so, then this wrapping_sub(1) looks semantically correct to me (looking at the WorkerContext::decrement_global_retry_counter).


let hdr_latency_resolution = match hdr_latency_units.get().as_str() {
"ns" => 1,
Expand Down Expand Up @@ -376,7 +386,8 @@ where
mode,
concurrency,
latency_type,
max_retries_per_op,
max_consecutive_errors_per_op,
max_errors_in_total,
maximum_rate,
test_duration: test_duration.get(),
partition_count,
Expand Down Expand Up @@ -418,6 +429,22 @@ impl ScyllaBenchArgs {
println!("Mode:\t\t\t {}", show_mode(&self.mode));
println!("Workload:\t\t {}", show_workload(&self.workload));
println!("Timeout:\t\t {}", format_duration(self.timeout));
if self.max_consecutive_errors_per_op == u64::MAX {
println!("Max error number at row: unlimited");
} else {
println!(
"Max error number at row: {}",
self.max_consecutive_errors_per_op as u128 + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This +1 could be removed after removing sub(1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's no \t in this particular print, is this intentional?

);
}
if self.max_errors_in_total == u64::MAX {
println!("Max error number:\t unlimited");
} else {
println!(
"Max error number:\t {}",
self.max_errors_in_total as u128 + 1,
);
}
println!(
"Consistency level:\t {}",
show_consistency_level(&self.consistency_level)
Expand Down
19 changes: 13 additions & 6 deletions src/bin/cql-stress-scylla-bench/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -93,13 +93,19 @@ async fn main() -> Result<()> {
combined_stats.combine(&partial_stats);
}
result = &mut run_finished => {
let errors = match &result {
Ok(_) => &[],
Err(err) => err.errors.as_slice(),
};
// Combine stats for the last time
let partial_stats = sharded_stats.get_combined_and_clear();
combined_stats.combine(&partial_stats);
printer.print_final(&combined_stats, errors, &mut std::io::stdout())?;
if result.is_ok() {
// Combine stats for the last time
let partial_stats = sharded_stats.get_combined_and_clear();
combined_stats.combine(&partial_stats);
printer.print_final(&combined_stats, &mut std::io::stdout())?;
return Ok(());
} else {
return Err(anyhow::anyhow!("Benchmark failed"));
}
return result.context("An error occurred during the benchmark");
}
}
}
Expand Down Expand Up @@ -145,7 +151,8 @@ async fn prepare(args: Arc<ScyllaBenchArgs>, stats: Arc<ShardedStats>) -> Result
concurrency: args.concurrency,
rate_limit_per_second,
operation_factory,
max_retries_per_op: args.max_retries_per_op as usize,
max_consecutive_errors_per_op: args.max_consecutive_errors_per_op,
max_errors_in_total: args.max_errors_in_total,
})
}

Expand Down
18 changes: 16 additions & 2 deletions src/bin/cql-stress-scylla-bench/stats.rs
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,12 @@ impl StatsPrinter {
Ok(())
}

pub fn print_final(&self, stats: &Stats, out: &mut impl Write) -> Result<()> {
pub fn print_final(
&self,
stats: &Stats,
errors: &[anyhow::Error],
out: &mut impl Write,
) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to limit the maximum number of errors that are be printed.
We don't want to output millions of lines on a big benchmark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be at most as many error as the concurrency parameter. So, in the worst case, it will be thousands.

Besides, this is what scylla-bench does, and I'm striving for compatibility - maybe not with regards to exact error strings, but formatting at least...

let time = Instant::now() - self.start_time;
writeln!(out)?;
writeln!(out, "Results:")?;
Expand All @@ -274,7 +279,16 @@ impl StatsPrinter {
self.print_final_latency_histogram("c-o fixed latency", &ls.co_fixed, out)?;
}

// TODO: "critical errors"
if !errors.is_empty() {
writeln!(
out,
"\nFollowing critical errors were caught during the run:"
)?;
for err in errors {
// The {:#} syntax makes sure that the error is printed in one line
writeln!(out, " {:#}", err)?;
Comment on lines +287 to +289
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is formatting using {:#} documented somewhere? I took look at https://doc.rust-lang.org/std/fmt/index.html, but the only mention # with some specifier after, never just the # alone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

Ok(())
}
Expand Down
9 changes: 6 additions & 3 deletions src/configuration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,12 @@ pub struct Configuration {
/// during the stress.
pub operation_factory: Arc<dyn OperationFactory>,

/// The maximum number of attempts an operation should be retried
/// before giving up.
pub max_retries_per_op: usize,
/// The maximum number of consecutive errors allowed before giving up.
pub max_consecutive_errors_per_op: u64,

/// The maximum, global number of errors allowed during the test.
/// After exceeding this number, the bench will be stopped.
pub max_errors_in_total: u64,
}

/// Contains all necessary context needed to execute an Operation.
Expand Down
Loading