Thread Scheduler

This page explains the basic mechanics of the OSv thread scheduler in terms of what happens in which parts of the code. Before reading this page, one is encouraged to familiarize oneself with the "The Thread Scheduler" chapter of the OSv Usenix paper.

Introduction

The core part of the scheduler implementation is the cpu::reschedule_from_interrupt(). At a high level, this method determines if the current thread p running on this CPU should switch to a new thread n or if p should continue running. The key observation is that every kernel or application thread at some point will invoke reschedule_from_interrupt() either involuntarily (aka preemption) or voluntarily - waiting for some condition to occur, sleeping, or yielding. And this method would be called many times during the thread's lifetime from the moment it starts until it is terminated.

The term voluntary or involuntary may be misleading as one may expect a thread to voluntarily give up running, which is only a case when yielding. The alternative nomenclature may call it controlled and uncontrolled, respectively - meaning the preemption when handling an interrupt is an uncontrolled re-scheduling, and in all other cases, it is a controlled or synchronous one. However, in the remaining part of this page, we will use the terms voluntary and involuntary.

The state machine of the thread scheduler is described here.

Incoming wakeups

Each sched::cpu holds an array of incoming wakeup queues - incoming_wakeups. The length of the incoming_wakeups array is equal to the number of cpus sourcing wake-ups, and each entry of the incoming_wakeup_queue type is a lock-less single-producer single-consumer queue (spsc) that holds a list of threads in the waking state to be woken up on this cpu and originating from the same or different cpu.

A queue at the given index c is typically populated by the thread::wake_impl() executed on the same or different originating cpu c when waking a thread t. Before a thread t is woken up and appended to its cpu's incoming wakeup queue, it changes its typical state of waiting (or sending_lock) to the one of waking.

The incoming wakup queues are consumed by the cpu::handle_incoming_wakeups() method that iterates over all cpu queues to enqueue the waking threads onto this cpu runqueue (see below). The threads placed on the cpu runqueue are changed from waking to queued. In the rare case when the dequeued thread is the current running one, its state is changed to running and skips runqueue. The handle_incoming_wakeups() is called by the idle thread (see the do_idle()), reschedule_from_interrupt() and thread::yield().

In summary, the incoming_wakeups queues serve as an intermediary step for threads between waiting and queued, and handle_incoming_wakeups() processes those queues to populate the given cpu's runqueue.

Runqueue

Unlike the incoming wakeup queues, each cpu holds a single runqueue. The runqueue is implemented using the boost' red-black tree and holds a list of threads in the state of queued enqueued by the handle_incoming_wakeups() (see above) and destined to run next on this cpu. The threads are ordered by priority and runtime as described in the thread scheduler chapter of the OSv paper (also see the definitions of runtime and realtime).

The threads are dequeued from the runqueue mainly by the reschedule_from_interrupt() if it determines to switch out the current running thread. The other consumers are the load balancer thread and the thread::pin() used to pin a thread to a given cpu. In the load-balancing case, the same thread is dequeued from a runqueue of one cpu and enqueued onto a runqueue of another one. In other words, it gets migrated between the current and other cpu as part of load balancing. The pinning case is way more complicated.

Timers

Even though the OSv thread scheduler is a tick-less one, it still needs to employ cpu timers to properly switch out a thread when its runtime is due. The timer mechanism provided by each individual cpu, in essence, allows setting a time in the future when a timer interrupt should get delivered. On the x86_64 architecture, it is implemented by the apic_clock_events and on arm64 by the arm_clock_events. Both apic_clock_events and arm_clock_events implement the clock_event_driver interface with two key methods - void set(std::chrono::nanoseconds nanos) and set_callback() - to respectively set time to deliver an interrupt and register a callback with method fired() that will be called when interrupt arrives.

The clock_event_driver is a lower-level layer capable of setting a single timer event, but the thread scheduler needs a higher-level mechanism to memorize many timer events to be scheduled and executed. To accommodate the latter, OSv provides the timer_base, timer_list, and timer_set abstractions.

The timer_base is a base class that implements a timer event and holds:

_time - point in time when this timer should be fired,
_state that can be free, armed or expired, and
_t of type timer_base::client with key method:
- timer_fired() which is called when timer is fired, and
- _active_timers to hold list of timers to be resume()-ed or suspend()-ed

A timer event (or timer) - an instance of timer_base - can be set(), reset(), or cancel()-led.

Each cpu has its preemption_timer of type timer_base with the client being the cpu itself. It is used by the reschedule_from_interrupt() to schedule a point in time to preempt the current or newly switched-to thread based on its runtime. The class timer - child of timer_base - with a sched::thread being a timer_base::client implements a timer event to wake a specific thread. The thread::sleep() is a good example of registering such thread client timer.

In addition, each cpu holds timers - an instance of the timers_list, with an attribute _list of type timer_set - to memorize a list of all timers including the preemption_timer scheduled to be set on this cpu. When an instance of timer or timer_base is set(), it is added to the current cpu's timers._list and removed from when cancel()-ed. When a new timer is set() or reset(), the timer_list is rearm()-ed to find and set the next earliest timer by calling clock_event->set().

The timer_list provides a callback_dispatch class implementing the clock_event_callback interface with a static method fired() invoked by the apic_clock_events or arm_clock_events when the next timer interrupt arrives. The timer_list::callback_dispatch::fired() delegates by calling the method timer_list::fired() on the current cpu's timers. The timer_list::fired() iterates over the _list of all registered timers to find the 'expired' ones and calls expire() on each. At the end, the corresponding timer client instance - a cpu or a thread - is notified by calling its timer_fired(). In the case of thread, it ends up calling the wake().

Idle thread

Each cpu has a special thread called idle thread that gets to run when its runqueue is empty. The idle thread has the lowest priority (in terms of numbers - infinite) and can never be preempted.

In essence, the idle thread runs an infinite loop where it calls cpu::do_idle() followed by the cpu::schedule() (wrapper around reschedule_from_interrupt()). The cpu::do_idle() tries to populate this cpu's runqueue, which is empty at this point, with any waking threads by calling the handle_incoming_wakeups() described above. The cpu::do_idle() exits as soon as it enqueues at least one thread; otherwise, it makes the cpu go to 'sleep' by executing the hlt on x86_64 or wfi instruction on arm64 after exhausting all 10,000 attempts to handle_incoming_wakeups().

Rescheduling

The cpu::reschedule_from_interrupt() is at the heart of the OSv thread scheduler implementation. Every thread invokes this method many times, either voluntarily or involuntarily, during its lifetime.

Now that we better understand the handling of the incoming wakeups, runqueue, and cpu timers, it should be easier to appreciate what the cpu::reschedule_from_interrupt() does. In essence, it determines if the current thread p running on this CPU should switch to a new thread n or if p should continue running. In detail, it executes the following steps in this order:

calls handle_incoming_wakeups() to add any waking threads to this cpu's runqueue,
does some runtime accounting for the current running thread p,
if the thread p is running (most likely involuntary case), it determines if p should continue to run or get switched out;
- the p will continue running if:
  - the runqueue is empty (the preemption_timer is cancelled then), or
  - the realtime priority of p is higher than the priority of the thread t at the head of the runqueue, or
  - the runtime of p is less than the runtime of thread t at the head of the runqueue (the preemption_timer is cancelled then and optionally set for delta between runtimes of p and t), or
  - the realtime priority of p is higher than the priority of thread t and we were called from yield(); the 2 above apply when NOT called from yield(),
- otherwise the p gets changed to queued and added back the runqueue if not called from yield()
if the p is NOT running (most likely waiting), it gets switched out but is NOT added to the runqueue
if we are still here (p should stop running), the next thread n is selected to run - it is simply the one at the head of the runqueue; the n is dequeued from the runqueue,
the next thread n is changed to running,
if n is not a realtime thread, the preemption_timer is set to preempt the n accordingly depending if it is called from yield() or not,
finally, the new thread n is switched to and from this moment new thread n is running.
- switching to a new thread involves switching regular stack, exception and interrupt one if applicable (on x86_64 only), thread pointer register and finally the instruction pointer itself (it is the same for all threads); this can be more or less tricky depending on the architecture - on x86_64 the switching is implemented by the method thread::switch_to() with some inlined assembly; on arm64 it is all implemented in assembly to have full control of what registers are used.

The cpu::reschedule_from_interrupt() is most of the time called indirectly from these 2 wrapper functions, which do it a little bit differently and are used in different scenarios:

preempt() - when an interrupt is handled (the involuntary case),
cpu::schedule() - in most other voluntary cases.

The cpu::reschedule_from_interrupt() is also called by:

thread::pin(cpu *target_cpu) - when pinning the thread to a cpu,
thread::yield() - when thread yields itself.

Mention realtime scheduler work in progress.

Preemption

Preemption is an involuntary or uncontrolled way to reschedule the current thread which happens only when handling an interrupt. Types of interrupts include those triggered by the CPU timers (preemption_timer and others), inter-processor ones (so-called IPIs) and others signaled by devices.

The generic interrupt handler - the aarch64 and x86_64 version calls a specific interrupt handler routine first and afterward executes the preempt() function. For device interrupt handler it is common to wake another thread to delegate the work.

Preemption lock

Please note that the preempt() function will only reschedule_from_interrupt() if the current thread is preemptable, otherwise it sets the needs_reschedule to true. The idle thread is an example of the one that can never be preempted. Other threads can temporarily disable preemption by using the preempt_lock. The preempt_lock is a cheap way to provide thread safety for variables accessed by single cpu, because a thread in a preempt_lock critical section even if interrupted will resume back where it left. The sched::preempt_enable() called when unlocking the preempt_lock invokes cpu::schedule() if needs_reschedule is set and interrupts are enabled.

Voluntary switch

The voluntary or controlled rescheduling happens mostly when a given thread p needs to wait:

for some time to pass, or
for a condition to occur, or
trying to acquire a mutex in the contended case.

In all cases above the cpu::schedule() is called to switch to another thread n while waiting for the condition to occur so that the original thread p can be resumed and switched back in.

The voluntary re-scheduling is also executed by the scheduler explicitly when completing, pinning, or yielding a thread.

Waiting

Most threads when running reach a point many times in their lifetime when they need to pause to wait for some condition to occur or some time to pass. To accommodate it, in summary, the scheduler changes the running thread to waiting, registers it as waiting for a specified condition, switches out to another thread, and hopefully eventually switches it back in and makes it running again after the condition is met. It should also be obvious, that some other thread must wake the waiting thread. So most of the time waiting is followed by waking. Sometimes threads will stay waiting forever until they are terminated by design.

The most common sequence of state transitions handled by the thread scheduler is this: running -> waiting -> waking -> queued -> running. The 1st transition from running to waiting is handled by the thread::wait*() methods and the 2nd from waiting to waking by the thread::wake*() methods described in the section waking below.

The process of making a thread wait is pretty involved and there are many flavors of it. It involves following building blocks - thread::prepare_wait() which prepares to wait, thread::wait() which reschedules to next thread, and thread::stop_wait() which makes this thread running again. The thread::prepare_wait() is the one that changes the thread from running to waiting.

The thread::prepare_wait() and thread::stop_wait() are used by the wait_guard. All forms of the waiting logic involve the wait_guard and thread::wait() and are implemented by various templates with thread::do_wait_until(Mutex& mtx, Pred pred) and thread::do_wait_for(Mutex& mtx, wait_object&&... wait_objects) being the foundational ones:

Waiting for some time to pass (aka sleeping) is implemented by thread::sleep() and thread::sleep_impl() which in essence involves setting a timer for a specified duration and then waiting for the timer interrupt - wait_until([&] { return t.expired(); }).

Waking

The waiting threads are either woken by interrupt (especially the "sleeping" ones but also the "worker" threads used to implement device drivers) or by another thread explicitly calling some version of thread::wake*(). The thread::do_wake_with() and most thread::wake* methods delegate to the underlying thread::wake_impl() which in essence changes the waiting thread to a waking one and puts a wakeup on the incoming wake-up queue as described above:

The following methods delegate to thread::do_wake_with():

Uh oh!

Thread Scheduler

Introduction

Incoming wakeups

Runqueue

Timers

Idle thread

Rescheduling

Preemption

Preemption lock

Voluntary switch

Waiting

Waking

Load balancing

Pinning

Real-time scheduling

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!