-
-
Notifications
You must be signed in to change notification settings - Fork 605
Thread Scheduler
This page explains the basic mechanics of the OSv thread scheduler in terms of what happens in which parts of the code. Before reading this page, one is encouraged to familiarize oneself with the "The Thread Scheduler" chapter of the OSv Usenix paper.
The core part of the scheduler implementation is the cpu::reschedule_from_interrupt()
. At a high level, this method determines if the current thread p
running on this CPU should switch to a new thread n
or if p
should continue running. The key observation is that every kernel or application thread at some point will invoke reschedule_from_interrupt()
either involuntarily (aka preemption) or voluntarily - waiting for some condition to occur, sleeping, or yielding. And this method would be called many times during the thread's lifetime from the moment it starts until it is terminated.
The term voluntary or involuntary may be misleading as one may expect a thread to voluntarily give up running, which is only a case when yielding. The alternative nomenclature may call it controlled and uncontrolled, respectively - meaning the preemption when handling an interrupt is an uncontrolled re-scheduling, and in all other cases, it is a controlled or synchronous one. However, in the remaining part of this page, we will use the terms voluntary and involuntary.
The state machine of the thread scheduler is described here.
Each sched::cpu holds an array of incoming wakeup queues - incoming_wakeups
. The length of the incoming_wakeups
array is equal to the number of cpus sourcing wake-ups, and each entry of the incoming_wakeup_queue
type is a lock-less single-producer single-consumer queue (spsc) that holds a list of threads in the waking state to be woken up on this cpu and originating from the same or different cpu.
A queue at the given index c
is typically populated by the thread::wake_impl()
executed on the same or different originating cpu c
when waking a thread t
. Before a thread t
is woken up and appended to its cpu's incoming wakeup queue, it changes its typical state of waiting (or sending_lock) to the one of waking.
The incoming wakup queues are consumed by the cpu::handle_incoming_wakeups()
method that iterates over all cpu queues to enqueue the waking threads onto this cpu runqueue
(see below). The threads placed on the cpu runqueue
are changed from waking to queued. In the rare case when the dequeued thread is the current running one, its state is changed to running and skips runqueue. The handle_incoming_wakeups()
is called by the idle thread (see the do_idle()
), reschedule_from_interrupt()
and thread::yield()
.
In summary, the incoming_wakeups
queues serve as an intermediary step for threads between waiting and queued, and handle_incoming_wakeups()
processes those queues to populate the given cpu's runqueue
.
Unlike the incoming wakeup queues, each cpu holds a single runqueue
. The runqueue
is implemented using the boost' red-black tree and holds a list of threads in the state of queued enqueued by the handle_incoming_wakeups()
(see above) and destined to run next on this cpu. The threads are ordered by priority and runtime as described in the thread scheduler chapter of the OSv paper (also see the definitions of runtime
and realtime
).
The threads are dequeued from the runqueue
mainly by the reschedule_from_interrupt()
if it determines to switch out the current running thread. The other consumers are the load balancer thread and the thread::pin()
used to pin a thread to a given cpu. In the load-balancing case, the same thread is dequeued from a runqueue
of one cpu and enqueued onto a runqueue
of another one. In other words, it gets migrated between the current and other cpu as part of load balancing. The pinning case is way more complicated.
Even though the OSv thread scheduler is a tick-less one, it still needs to employ cpu timers to properly switch out a thread when its runtime is due. The timer mechanism provided by each individual cpu, in essence, allows setting a time in the future when a timer interrupt should get delivered. On the x86_64 architecture, it is implemented by the apic_clock_events
and on arm64 by the arm_clock_events
. Both apic_clock_events
and arm_clock_events
implement the clock_event_driver
interface with two key methods - void set(std::chrono::nanoseconds nanos)
and set_callback()
- to respectively set time to deliver an interrupt and register a callback with method fired()
that will be called when interrupt arrives.
The clock_event_driver
is a lower-level layer capable of setting a single timer event, but the thread scheduler needs a higher-level mechanism to memorize many timer events to be scheduled and executed. To accommodate the latter, OSv provides the timer_base
, timer_list
, and timer_set
abstractions.
The timer_base
is a base class that implements a timer event and holds:
-
_time
- point in time when this timer should be fired, -
_state
that can befree
,armed
orexpired
, and -
_t
of typetimer_base::client
with key method:-
timer_fired()
which is called when timer is fired, and -
_active_timers
to hold list of timers to beresume()
-ed orsuspend()
-ed
-
A timer event (or timer) - an instance of timer_base
- can be set(), reset(), or cancel()-led.
Each cpu has its preemption_timer
of type timer_base
with the client being the cpu itself. It is used by the reschedule_from_interrupt()
to schedule a point in time to preempt the current or newly switched-to thread based on its runtime. The class timer
- child of timer_base
- with a sched::thread
being a timer_base::client
implements a timer event to wake a specific thread. The thread::sleep()
is a good example of registering such thread client timer
.
In addition, each cpu holds timers
- an instance of the timers_list
, with an attribute _list
of type timer_set
- to memorize a list of all timers including the preemption_timer
scheduled to be set on this cpu. When an instance of timer
or timer_base
is set()
, it is added to the current cpu's timers._list
and removed from when cancel()
-ed.
When a new timer is set()
or reset()
, the timer_list
is rearm()-ed
to find and set the next earliest timer by calling clock_event->set()
.
The timer_list
provides a callback_dispatch
class implementing the clock_event_callback
interface with a static method fired()
invoked by the apic_clock_events
or arm_clock_events
when the next timer interrupt arrives. The timer_list::callback_dispatch::fired()
delegates by calling the method timer_list::fired()
on the current cpu's timers
. The timer_list::fired()
iterates over the _list
of all registered timers to find the 'expired' ones and calls expire()
on each. At the end, the corresponding timer client instance - a cpu or a thread - is notified by calling its timer_fired()
. In the case of thread, it ends up calling the wake()
.
Each cpu has a special thread called idle thread that gets to run when its runqueue
is empty. The idle thread has the lowest priority (in terms of numbers - infinite) and can never be preempted.
In essence, the idle thread runs an infinite loop where it calls cpu::do_idle()
followed by the cpu::schedule()
(wrapper around reschedule_from_interrupt()
). The cpu::do_idle()
tries to populate this cpu's runqueue
, which is empty at this point, with any waking threads by calling the handle_incoming_wakeups()
described above. The cpu::do_idle()
exits as soon as it enqueues at least one thread; otherwise, it makes the cpu go to 'sleep' by executing the hlt
on x86_64 or wfi
instruction on arm64 after exhausting all 10,000 attempts to handle_incoming_wakeups()
.
The cpu::reschedule_from_interrupt()
is at the heart of the OSv thread scheduler implementation. Every thread invokes this method many times, either voluntarily or involuntarily, during its lifetime.
Now that we better understand the handling of the incoming wakeups, runqueue, and cpu timers, it should be easier to appreciate what the cpu::reschedule_from_interrupt()
does. In essence, it determines if the current thread p
running on this CPU should switch to a new thread n
or if p
should continue running. In detail, it executes the following steps in this order:
- calls
handle_incoming_wakeups()
to add any waking threads to this cpu'srunqueue
, - does some runtime accounting for the current running thread
p
, - if the thread
p
is running (most likely involuntary case), it determines ifp
should continue to run or get switched out;- the
p
will continue running if:- the
runqueue
is empty (thepreemption_timer
is cancelled then), or - the realtime priority of
p
is higher than the priority of the threadt
at the head of therunqueue
, or - the runtime of
p
is less than the runtime of threadt
at the head of therunqueue
(thepreemption_timer
is cancelled then and optionally set for delta between runtimes ofp
andt
), or - the realtime priority of
p
is higher than the priority of threadt
and we were called fromyield()
; the 2 above apply when NOT called fromyield()
,
- the
- otherwise the
p
gets changed to queued and added back therunqueue
if not called fromyield()
- the
- if the
p
is NOT running (most likely waiting), it gets switched out but is NOT added to therunqueue
- if we are still here (
p
should stop running), the next threadn
is selected to run - it is simply the one at the head of therunqueue
; then
is dequeued from therunqueue
, - the next thread
n
is changed to running, - if
n
is not a realtime thread, thepreemption_timer
is set to preempt then
accordingly depending if it is called fromyield()
or not, - finally, the new thread
n
is switched to and from this moment new threadn
is running.- switching to a new thread involves switching regular stack, exception and interrupt one if applicable (on x86_64 only), thread pointer register and finally the instruction pointer itself (it is the same for all threads); this can be more or less tricky depending on the architecture - on x86_64 the switching is implemented by the method thread::switch_to() with some inlined assembly; on arm64 it is all implemented in assembly to have full control of what registers are used.
The cpu::reschedule_from_interrupt()
is most of the time called indirectly from these 2 wrapper functions, which do it a little bit differently and are used in different scenarios:
-
preempt()
- when an interrupt is handled (the involuntary case), -
cpu::schedule()
- in most other voluntary cases.
The cpu::reschedule_from_interrupt()
is also called by:
-
thread::pin(cpu *target_cpu)
- when pinning the thread to a cpu, -
thread::yield()
- when thread yields itself.
Mention realtime scheduler work in progress.
Preemption is an involuntary or uncontrolled way to reschedule the current thread which happens only when handling an interrupt. Types of interrupts include those triggered by the CPU timers (preemption_timer
and others), inter-processor ones (so-called IPIs) and others signaled by devices.
The generic interrupt handler - the aarch64 and x86_64 version calls a specific interrupt handler routine first and afterward executes the preempt()
function. For device interrupt handler it is common to wake another thread to delegate the work.
Please note that the preempt()
function will only reschedule_from_interrupt()
if the current thread is preemptable, otherwise it sets the needs_reschedule
to true. The idle thread is an example of the one that can never be preempted. Other threads can temporarily disable preemption by using the preempt_lock
. The preempt_lock
is a cheap way to provide thread safety for variables accessed by single cpu, because a thread in a preempt_lock
critical section even if interrupted will resume back where it left. The sched::preempt_enable()
called when unlocking the preempt_lock
invokes cpu::schedule()
if needs_reschedule
is set and interrupts are enabled.
The voluntary or controlled rescheduling happens mostly when a given thread p
needs to wait:
- for some time to pass, or
- for a condition to occur, or
- trying to acquire a mutex in the contended case.
In all cases above the cpu::schedule()
is called to switch to another thread n
while waiting for the condition to occur so that the original thread p
can be resumed and switched back in.
The voluntary re-scheduling is also executed by the scheduler explicitly when completing, pinning, or yielding a thread.
Most threads when running reach a point many times in their lifetime when they need to pause to wait for some condition to occur or some time to pass. To accommodate it, in summary, the scheduler changes the running thread to waiting, registers it as waiting for a specified condition, switches out to another thread, and hopefully eventually switches it back in and makes it running again after the condition is met. It should also be obvious, that some other thread must wake the waiting thread. So most of the time waiting is followed by waking. Sometimes threads will stay waiting forever until they are terminated by design.
The most common sequence of state transitions handled by the thread scheduler is this: running -> waiting -> waking -> queued -> running. The 1st transition from running to waiting is handled by the thread::wait*()
methods and the 2nd from waiting to waking by the thread::wake*()
methods described in the section waking below.
The process of making a thread wait is pretty involved and there are many flavors of it. It involves following building blocks - thread::prepare_wait()
which prepares to wait, thread::wait()
which reschedules to next thread, and thread::stop_wait()
which makes this thread running again. The thread::prepare_wait()
is the one that changes the thread from running to waiting.
The thread::prepare_wait()
and thread::stop_wait()
are used by the wait_guard
. All forms of the waiting logic involve the wait_guard
and thread::wait()
and are implemented by various templates with thread::do_wait_until(Mutex& mtx, Pred pred)
and thread::do_wait_for(Mutex& mtx, wait_object&&... wait_objects)
being the foundational ones:
thread::wait_until(Pred pred)
thread::wait_until(mutex_t& mtx, Pred pred)
thread::wait_until(mutex_t* mtx, Pred pred)
thread::wait_until_interruptible(Pred pred)
thread::wait_for(mutex& mtx, waitable&&... waitables)
thread::wait_for(waitable&&... waitables)
Waiting for some time to pass (aka sleeping) is implemented by thread::sleep()
and thread::sleep_impl()
which in essence involves setting a timer for a specified duration and then waiting for the timer interrupt - wait_until([&] { return t.expired(); })
.
The waiting threads are either woken by interrupt (especially the "sleeping" ones but also the "worker" threads used to implement device drivers) or by another thread explicitly calling some version of thread::wake*()
. The thread::do_wake_with()
and most thread::wake*
methods delegate to the underlying thread::wake_impl() which in essence changes the waiting thread to a waking one and puts a wakeup on the incoming wake-up queue as described above:
thread::do_wake_with(Action action, unsigned allowed_initial_states_mask)
thread::wake()
thread::wake_with_irq_disabled()
thread::wake_lock()
thread_handle::wake()
thread_handle::wake_from_kernel_or_with_irq_disabled()
The following methods delegate to thread::do_wake_with()
: