Skip to content

Commit 099cf61

Browse files
StephanDollbergavikivity
authored andcommitted
reactor: Disable hot polling if wakeup granularity is too high
We are seeing an issue where rand write performance under performs massively (100x). This is caused by the linux dio kernel thread/workqueue being starved and hence aio write completitions aren't being served in a timely manner. This doesn't happen using "default" linux settings but only if `/proc/sys/kernel/sched_wakeup_granularity_ns` or (`/sys/kernel/debug/sched/wakeup_granularity_ns` on newer kernels) is raised. Specifically this effect can be observed on RHEL-8 as the `tuned` version that ships with it sets this value to 15000000 but can reproduced on any other system by just bumping that value. This patch tries to detect this being the case and if so it will warn and disable hot polling (both `--poll-aio` and `--idle-poll-time-us`) which gives back the majority of the performance. Note because this setting has moved to debug fs on newer kernels which requires root rights to read it's actually not very likely that we will be able to detect it on these. However, RHEL8 uses an older kernel and is likely the major offender to run into this bug (we have had multiple customers run into this). Ref #2696
1 parent 2a446a7 commit 099cf61

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

src/core/reactor.cc

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3752,6 +3752,30 @@ static bool kernel_supports_aio_fsync() {
37523752
return internal::kernel_uname().whitelisted({"4.18"});
37533753
}
37543754

3755+
static std::tuple<std::filesystem::path, uint64_t> wakeup_granularity() {
3756+
auto try_read = [] (auto path) -> uint64_t {
3757+
try {
3758+
return read_first_line_as<uint64_t>(path);
3759+
} catch (...) {
3760+
return 0;
3761+
}
3762+
};
3763+
3764+
auto legacy_path = "/proc/sys/kernel/sched_wakeup_granularity_ns";
3765+
if (auto val = try_read(legacy_path); val) {
3766+
return {legacy_path, val};
3767+
}
3768+
3769+
// This will in practice almost always fail because debug fs requires root
3770+
// perms to read so we are out of luck
3771+
auto debug_fs_path = "/sys/kernel/debug/sched/wakeup_granularity_ns";
3772+
if (auto val = try_read(legacy_path); val) {
3773+
return {debug_fs_path, val};
3774+
}
3775+
3776+
return {"", 0};
3777+
}
3778+
37553779
static program_options::selection_value<network_stack_factory> create_network_stacks_option(reactor_options& zis) {
37563780
using value_type = program_options::selection_value<network_stack_factory>;
37573781
value_type::candidates candidates;
@@ -4493,6 +4517,22 @@ void smp::configure(const smp_options& smp_opts, const reactor_options& reactor_
44934517
.no_poll_aio = !reactor_opts.poll_aio.get_value() || (reactor_opts.poll_aio.defaulted() && reactor_opts.overprovisioned),
44944518
};
44954519

4520+
// Disable hot polling if sched wakeup granularity is too high
4521+
// dio thread will be starved otherwise
4522+
// see https://github.com/scylladb/seastar/issues/2696
4523+
if (!reactor_cfg.no_poll_aio || reactor_cfg.max_poll_time != 0us) {
4524+
auto [wakeup_file, granularity] = wakeup_granularity();
4525+
// 15M is chosen as it's what tuned sets. Though you probably already
4526+
// see an adverse effect earlier.
4527+
if (granularity >= 15000000) {
4528+
reactor_cfg.no_poll_aio = true;
4529+
reactor_cfg.max_poll_time = 0us;
4530+
seastar_logger.warn(
4531+
"Setting --poll-aio=0 and --idle-poll-time-us=0 due to too high sched_wakeup_granularity of {} in {}",
4532+
granularity, wakeup_file.string());
4533+
}
4534+
}
4535+
44964536
aio_nowait_supported = reactor_opts.linux_aio_nowait.get_value();
44974537
std::mutex mtx;
44984538

0 commit comments

Comments
 (0)