Skip to content

Commit 08f8e62

Browse files
reactor: Disable hot polling if wakeup granularity is too high
We are seeing an issue where rand write performance under performs massively (100x). This is caused by the linux dio kernel thread/workqueue being starved and hence aio write completitions aren't being served in a timely manner. This doesn't happen using "default" linux settings but only if `/proc/sys/kernel/sched_wakeup_granularity_ns` or (`/sys/kernel/debug/sched/wakeup_granularity_ns` on newer kernels) is raised. Specifically this effect can be observed on RHEL-8 as the `tuned` version that ships with it sets this value to 15000000 but can reproduced on any other system by just bumping that value. This patch tries to detect this being the case and if so it will warn and disable hot polling (both `--poll-aio` and `--idle-poll-time-us`) which gives back the majority of the performance. Note because this setting has moved to debug fs on newer kernels which requires root rights to read it's actually not very likely that we will be able to detect it on these. However, RHEL8 uses an older kernel and is likely the major offender to run into this bug (we have had multiple customers run into this). Ref scylladb#2696 (cherry picked from commit 099cf61)
1 parent 52ca3bf commit 08f8e62

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

src/core/reactor.cc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3834,6 +3834,30 @@ static bool kernel_supports_aio_fsync() {
38343834
return internal::kernel_uname().whitelisted({"4.18"});
38353835
}
38363836

3837+
static std::tuple<std::filesystem::path, uint64_t> wakeup_granularity() {
3838+
auto try_read = [] (auto path) -> uint64_t {
3839+
try {
3840+
return read_first_line_as<uint64_t>(path);
3841+
} catch (...) {
3842+
return 0;
3843+
}
3844+
};
3845+
3846+
auto legacy_path = "/proc/sys/kernel/sched_wakeup_granularity_ns";
3847+
if (auto val = try_read(legacy_path); val) {
3848+
return {legacy_path, val};
3849+
}
3850+
3851+
// This will in practice almost always fail because debug fs requires root
3852+
// perms to read so we are out of luck
3853+
auto debug_fs_path = "/sys/kernel/debug/sched/wakeup_granularity_ns";
3854+
if (auto val = try_read(legacy_path); val) {
3855+
return {debug_fs_path, val};
3856+
}
3857+
3858+
return {"", 0};
3859+
}
3860+
38373861
static program_options::selection_value<network_stack_factory> create_network_stacks_option(reactor_options& zis) {
38383862
using value_type = program_options::selection_value<network_stack_factory>;
38393863
value_type::candidates candidates;
@@ -4535,6 +4559,22 @@ void smp::configure(const smp_options& smp_opts, const reactor_options& reactor_
45354559
.no_poll_aio = !reactor_opts.poll_aio.get_value() || (reactor_opts.poll_aio.defaulted() && reactor_opts.overprovisioned),
45364560
};
45374561

4562+
// Disable hot polling if sched wakeup granularity is too high
4563+
// dio thread will be starved otherwise
4564+
// see https://github.com/scylladb/seastar/issues/2696
4565+
if (!reactor_cfg.no_poll_aio || reactor_cfg.max_poll_time != 0us) {
4566+
auto [wakeup_file, granularity] = wakeup_granularity();
4567+
// 15M is chosen as it's what tuned sets. Though you probably already
4568+
// see an adverse effect earlier.
4569+
if (granularity >= 15000000) {
4570+
reactor_cfg.no_poll_aio = true;
4571+
reactor_cfg.max_poll_time = 0us;
4572+
seastar_logger.warn(
4573+
"Setting --poll-aio=0 and --idle-poll-time-us=0 due to too high sched_wakeup_granularity of {} in {}",
4574+
granularity, wakeup_file.string());
4575+
}
4576+
}
4577+
45384578
aio_nowait_supported = reactor_opts.linux_aio_nowait.get_value();
45394579
std::mutex mtx;
45404580

0 commit comments

Comments
 (0)