Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched: Support POSIX's SCHED_RR scheduling policy #1338

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 35 additions & 11 deletions core/sched.cc
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,10 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
}
thread* p = thread::current();

if (p->_realtime.has_slice()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that today it is possible to set the realtime time slice without setting realtime priority yet, and I think in this case you don't want to increment _run_time. So maybe you also need to check if realtime.priority is > 0?

Copy link
Author

@nmeum nmeum Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that today it is possible to set the realtime time slice without setting realtime priority yet […]

Do we want to support setting a time slice without providing a realtime priority? If so I can also adjust the code accordingly.

So maybe you also need to check if realtime.priority is > 0?

Added this for now in 8db3444

p->_realtime._run_time += interval;
}

const auto p_status = p->_detached_state->st.load();
assert(p_status != thread::status::queued);

Expand All @@ -295,8 +299,17 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
} else if (!called_from_yield) {
auto &t = *runqueue.begin();
if (p->_realtime._priority > 0) {
// Only switch to a higher-priority realtime thread
if (t._realtime._priority <= p->_realtime._priority) {
// If the threads have the same realtime priority, then only reschedule
// if the currently executed thread has exceeded its time slice (if any).
if (t._realtime._priority == p->_realtime._priority &&
((!p->_realtime.has_slice() || p->_realtime.has_remaining()))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all this means we are going to keep the current thread p running should stay running if _time_slice is 0 (run 'forever' until yields or waits) OR there is still time left to run per its _time_slice, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[…] should stay running if _time_slice is 0 (run 'forever' until yields or waits) OR there is still time left to run per its _time_slice, right?

Yes p->_realtime.has_slice() checks if it has a time slice (i.e. _time_slice != 0), p->_realtime.has_remaining() checks if there is still time remaining on the slice (if it has one).

Note that, even if the thread has exceeded its time slice it may still be selected to run again if there is no thread with a higher priority. Hence, the priority comparison in the if condition.

#ifdef __aarch64__
return switch_data;
#else
return;
#endif
// Otherwise, only switch to a higher-priority realtime thread.
} else if (t._realtime._priority < p->_realtime._priority) {
#ifdef __aarch64__
return switch_data;
#else
Expand Down Expand Up @@ -336,15 +349,27 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
p->_detached_state->st.store(thread::status::queued);

if (!called_from_yield) {
// POSIX requires that if a real-time thread doesn't yield but
// rather is preempted by a higher-priority thread, it be
// reinserted into the runqueue first, not last, among its equals.
enqueue_first_equal(*p);
if (p->_realtime.has_slice() && !p->_realtime.has_remaining()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe we need to check if realtime.priority>0 because maybe has_slice() just records some old setting and it's not in use now?

Or, maybe, for simplicity, we should just ensure that if the real-time priority is ever set to 0, then slice is also set to 0?

// The real-time thread has exceeded it's time slice, enqueue
// it at the end among its equals and reset the time slice.
enqueue(*p);
p->_realtime.reset_slice();
} else {
// POSIX requires that if a real-time thread doesn't yield but
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means that if the current thread p _time_slice is 0 OR p still has some remaining time to run, we will call enqueue_first_equal(). Is this correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's correct. If we got here it means p was preempted. If it still has remaining time, it means it was preempted by a higher-priority realtime thread but when that higher-priority thread doesn't want to run, this thread p should continue running and continue its current time slice. The documentation says: "A SCHED_RR thread that has been preempted by a higher priority thread and subsequently resumes execution as a running thread will complete the unexpired portion of its round-robin time quantum.". It should be the first one in its priority group to run (and therefore enqueue_first_equal()) just like when no time slices existed.

// rather is preempted by a higher-priority thread, it be
// reinserted into the runqueue first, not last, among its equals.
enqueue_first_equal(*p);
}
}

trace_sched_preempt();
p->stat_preemptions.incr();
} else {
// p is no longer running, if it has a realtime slice reset it.
if (p->_realtime.has_slice()) {
p->_realtime.reset_slice();
}
Comment on lines +369 to +372
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not entirely clear to me if the slice should be reset if the thread is no longer runnable (e.g. because of blocking I/O). POSIX does not explicitly describe when the slice should be reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm also not sure, but think this if is right. I think the idea of the time slice is to make sure that a single thread never in its priority group never runs more than 1ms (for example) without letting other threads in its group run. But if the thread blocks or yields voluntarily (I believe this if covers both cases, right?), then it gives some other thread a chance to run and it too has a chance to run for a whole time-slice, so it's only fair that this thread's time slice is reset to zero. I think.
I tried searching if anybody mentions this question, and couldn't find such a discussion.


// p is no longer running, so we'll switch to a different thread.
// Return the runtime p borrowed for hysteresis.
p->_runtime.hysteresis_run_stop();
Expand Down Expand Up @@ -380,7 +405,10 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
n->_runtime.add_context_switch_penalty();
}
preemption_timer.cancel();
if (n->_realtime._priority == 0) {
if (n->_realtime.has_slice()) {
assert(n->_realtime.has_remaining());
preemption_timer.set_with_irq_disabled(now + n->_realtime.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure (I think I understood it, but not 100% sure) that the logic above gurantees that if !has_remaining we don't reach here at all.

In particular, what happens if some thread is the only thread in its priority group, so it continues to run? Does the code above recognize this case and remember to reset its time slice?

Copy link
Author

@nmeum nmeum Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure (I think I understood it, but not 100% sure) that the logic above gurantees that if !has_remaining we don't reach here at all.

Unfortunately, I of cause don't have a proof that the assertion always holds. However, from understanding of the scheduler code, I believe that it should. My thinking was to add it and if we discover a path where it fails during testing than we might have found a bug. I can also remove the assertion if you want me to.

In particular, what happens if some thread is the only thread in its priority group, so it continues to run? Does the code above recognize this case and remember to reset its time slice?

If I understand your comment correctly, shouldn't this case be covered by the priority_precedence test that I added? If not we should expand the test suite to cover this.

} else if (n->_realtime._priority == 0) {
if (!called_from_yield) {
if (!runqueue.empty()) {
auto& t = *runqueue.begin();
Expand Down Expand Up @@ -920,10 +948,6 @@ unsigned thread::realtime_priority() const

void thread::set_realtime_time_slice(thread_realtime::duration time_slice)
{
if (time_slice > 0) {
WARN_ONCE("set_realtime_time_slice() used but real-time time slices"
" not yet supported");
}
_realtime._time_slice = time_slice;
}

Expand Down
19 changes: 19 additions & 0 deletions include/osv/sched.hh
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,25 @@ public:
using duration = thread_runtime::duration;
unsigned _priority = 0;
duration _time_slice = duration::zero();

// Total time this thread ran since starting this slice.
duration _run_time = duration::zero();

void reset_slice() {
_run_time = duration::zero();
}

bool has_slice() const {
return _time_slice != duration::zero();
}

bool has_remaining() const {
return _time_slice > _run_time;
}

duration remaining() const {
return _time_slice - _run_time;
}
};

// "tau" controls the length of the history we consider for scheduling,
Expand Down
2 changes: 1 addition & 1 deletion modules/tests/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ tests := tst-pthread.so misc-ramdisk.so tst-vblk.so tst-bsd-evh.so \
tst-sigaction.so tst-syscall.so tst-ifaddrs.so tst-getdents.so \
tst-netlink.so misc-zfs-io.so misc-zfs-arc.so tst-pthread-create.so \
misc-futex-perf.so misc-syscall-perf.so tst-brk.so tst-reloc.so \
misc-vdso-perf.so tst-string-utils.so
misc-vdso-perf.so tst-string-utils.so tst-thread-realtime.so
# libstatic-thread-variable.so tst-static-thread-variable.so \
# tst-f128.so \

Expand Down
128 changes: 128 additions & 0 deletions tests/tst-thread-realtime.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
#include <atomic>
#include <string>
#include <iostream>
#include <osv/sched.hh>

#define TIME_SLICE 100000

static std::string name = "tst-thr-wrk";
static int tests = 0, fails = 0;

static void report(bool ok, std::string msg)
{
++tests;
fails += !ok;
std::cout << (ok ? "PASS" : "FAIL") << ": " << msg << "\n";
}

static int fac(int n)
{
if (n == 0) {
return 0;
} else if (n == 1) {
return 1;
} else {
return fac(n - 1) + fac(n - 2);
}
}

bool runtime_equalized()
{
constexpr int num_threads = 5;
constexpr int switches = 3;
static std::atomic<bool> stop_threads;
sched::thread *threads[num_threads];

sched::cpu *c = sched::cpu::current();
for (int i = 0; i < num_threads; i++) {
threads[i] = sched::thread::make([&]{
while (!stop_threads.load()) {
fac(10);
}
}, sched::thread::attr().name(name));

threads[i]->pin(c);
threads[i]->set_realtime_priority(1);
threads[i]->set_realtime_time_slice(sched::thread_realtime::duration(TIME_SLICE));
threads[i]->start();
}

auto runtime = sched::thread_realtime::duration(TIME_SLICE * num_threads * switches);
sched::thread::sleep(runtime);
stop_threads = true;

bool ok = true;
long prev_switches = -1;
for (int i = 0; i < num_threads; i++) {
long switches = threads[i]->stat_switches.get();
if (prev_switches != -1 && prev_switches != switches) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that you want all threads to have exactly the same number of context switches? How can we be confident of this - can't one happen to have one more than the others because of some inaccuracy or something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that you want all threads to have exactly the same number of context switches?

Yes. My thinking was: If we assign each thread a time slice of size N and then wait N * NUM_THREADS * EXPECTED_SWITCHES time units then (on a single core machine under a realtime scheduling policy), we would expect each thread to be preempted after N seconds. As such, each thread should have EXPECTED_SWITCHES amount of context switches.

Maybe I am missing something obvious, but I haven't seen this test fail yet. However, since this is not a hard-realtime operating systems I assume we could see delays here and there in the scheduler? We can also make the comparison fuzzy, allowing the amount of expected context switches to be off by 1 or 2?

ok = false;
break;
}
prev_switches = switches;
}

for (int i = 0; i < num_threads; i++) {
delete threads[i];
}

return ok;
}

bool priority_precedence()
{
static std::atomic<bool> high_prio_stop;
sched::thread *high_prio = sched::thread::make([&]{
while (!high_prio_stop.load()) {
fac(10);
}
});

static std::atomic<bool> low_prio_stop;
sched::thread *low_prio = sched::thread::make([&]{
while (!low_prio_stop.load()) {
fac(10);
}
});

sched::cpu *c = sched::cpu::current();
high_prio->pin(c);
low_prio->pin(c);

high_prio->set_realtime_priority(2);
low_prio->set_realtime_priority(1);

// The higher priority thread has a time slice, but since there is no other thread
// with the same priority, it should monopolize the CPU and lower_prio shouldn't run.
high_prio->set_realtime_time_slice(sched::thread_realtime::duration(TIME_SLICE));

high_prio->start();
low_prio->start();

auto runtime = sched::thread_realtime::duration(TIME_SLICE * 3);
sched::thread::sleep(runtime);

// Since both threads are pinned to the CPU and the higher priority
// thread is always runnable, the lower priority thread should starve.
bool ok = high_prio->thread_clock().count() > 0 &&
low_prio->thread_clock().count() == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that it's theoretically possible that although you start()ed the high prio thread first and only then the low_prio one, maybe the low prio one got to run for a microsecond before the high prio one so its thread clock will not be exactly zero. But maybe in practice it doesn't happen...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can also have the opposite problem. If I understand correctly your TIME_SLICE is absolutely tiny, 0.1 milliseconds, and after starting high_prio and low_prio you only sleep 3 times that, i.e., 0.3 milliseconds, so it is theoretically possible that the test will pass even without any realtime priorities or anything, just because we let the first-ran highprio thread run for 0.3 milliseconds straight.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly your TIME_SLICE is absolutely tiny, 0.1 milliseconds

I increased it further in 3c27113

maybe the low prio one got to run for a microsecond before the high prio one

Note: They are both pinned to same CPU so the higher prio one should also be first in the CPU runqueue and should always be runnable so I am not sure under which scenario the lower prio one would get the CPU. However, I believe you have more expertise with the scheduling code. We can also discard this test case.

What test cases would you like to see instead for SCHED_RR?


low_prio_stop = true;
high_prio_stop = true;

delete high_prio;
delete low_prio;

return ok;
}

int main(int ac, char** av)
{
// Ensure that the main thread doesn't starve.
sched::thread *cur = sched::thread::current();
cur->set_realtime_priority(10);

report(runtime_equalized(), "runtime_equalized");
report(priority_precedence(), "priority_precedence");
std::cout << "SUMMARY: " << tests << " tests, " << fails << " failures\n";
}