Skip to content

Commit 82a38e7

Browse files
committed
runtime: improve scheduler documentation
1 parent d340136 commit 82a38e7

2 files changed

Lines changed: 142 additions & 7 deletions

File tree

std/runtime/blocking.jule

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ struct blockingJob {
1010
}
1111

1212
// Environment for the blocking thread pool.
13+
// Implements a worker thread pool for blocking tasks.
14+
// See "Thread synchronization" comment of the scheduler.
1315
struct blockingenv {
1416
maxWorkers: i32 // Constant once initialized.
1517

std/runtime/proc.jule

Lines changed: 140 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,8 @@
2525
// It behaves similarly to a typical thread.
2626
//
2727
// M (Machine)
28-
// A real operating system thread.
29-
// It is responsible for executing a coroutine.
28+
// A real operating system thread, aka worker thread.
29+
// It is responsible for executing a coroutine, timer and other works.
3030
// Only as many M instances may be created as permitted by COMAXPROCS.
3131
//
3232
// P (Processor)
@@ -44,13 +44,130 @@
4444
// When a parked coroutine is woken up, it must be enqueued into the runnable
4545
// coroutine queue. The scheduler decides when it will actually execute.
4646
//
47+
// WORKER THREAD PARKING/UNPARKING
48+
//
49+
// We need to balance between keeping enough running worker threads to utilize
50+
// available hardware parallelism and parking excessive running worker threads
51+
// to conserve CPU resources and power. This is not simple for two reasons:
52+
// (1) scheduler state is intentionally distributed (in particular, per-P work
53+
// queues), so it is not possible to compute global predicates on fast paths;
54+
// (2) for optimal thread management we would need to know the future (don't park
55+
// a worker thread when a new coroutine will be readied in near future).
56+
//
57+
// The current approach applies to three primary sources of potential work:
58+
// readying a coroutine and new/modified-earlier timers.
59+
// See below for additional details.
60+
//
61+
// We unpark an additional thread when we submit work if (this is wakep()):
62+
// 1. There is an idle P, and
63+
// 2. There are no "spinning" worker threads.
64+
//
65+
// A worker thread is considered spinning if it is out of local work and did
66+
// not find work in the global run queue or eventpoller; the spinning state is
67+
// denoted in m.spinning and in sched.nmspinning. Threads unparked this way are
68+
// also considered spinning; we don't do coroutine handoff so such threads are
69+
// out of work initially. Spinning threads spin on looking for work in per-P
70+
// run queues and timer heaps or from the GC before parking. If a spinning
71+
// thread finds work it takes itself out of the spinning state and proceeds to
72+
// execution. If it does not find work it takes itself out of the spinning
73+
// state and then parks.
74+
//
75+
// If there is at least one spinning thread (sched.nmspinning>1), we don't
76+
// unpark new threads when submitting work. To compensate for that, if the last
77+
// spinning thread finds work and stops spinning, it must unpark a new spinning
78+
// thread. This approach smooths out unjustified spikes of thread unparking,
79+
// but at the same time guarantees eventual maximal CPU parallelism
80+
// utilization.
81+
//
82+
// The main implementation complication is that we need to be very careful
83+
// during spinning->non-spinning thread transition. This transition can race
84+
// with submission of new work, and either one part or another needs to unpark
85+
// another worker thread. If they both fail to do that, we can end up with
86+
// semi-persistent CPU underutilization.
87+
//
88+
// The general pattern for submission is:
89+
// 1. Submit work to the local or global run queue, or timer heap.
90+
// 2. #StoreLoad-style memory barrier.
91+
// 3. Check sched.nmspinning.
92+
//
93+
// The general pattern for spinning->non-spinning transition is:
94+
// 1. Decrement nmspinning.
95+
// 2. #StoreLoad-style memory barrier.
96+
// 3. Check all per-P work queues for new work.
97+
//
98+
// Note that all this complexity does not apply to global run queue as we are
99+
// not sloppy about thread unparking when submitting to global queue. Also see
100+
// comments for nmspinning manipulation.
101+
//
102+
// How these different sources of work behave varies, though it doesn't affect
103+
// the synchronization approach:
104+
// * Ready coroutine: this is an obvious source of work; the coroutine is
105+
// immediately ready and must run on some thread eventually.
106+
// * New/modified-earlier timer: The current timer implementation uses eventpoll
107+
// in a thread with no work available to wait for the soonest timer.
108+
// If there is no thread waiting, we want a new spinning thread to go wait.
109+
//
47110
// STACK-ORIENTED COROUTINE HANDLING
48111
//
49112
// Jule runtime prioritizes avoiding heap allocations for coroutines.
50113
// Each coroutine instance is used from the stack. When necessary,
51114
// it can be copied or passed around via references/pointers.
52115
// Since it is stack-oriented, it must be handled carefully.
53116
// Otherwise, a stale coroutine copy may be used and cause critical issues.
117+
//
118+
// SYSCALLS/BLOCKING OPERATIONS
119+
//
120+
// Syscalls that cannot be scheduled by the scheduler (for example, blocking
121+
// read/write operations or waiting for a process) and other blocking tasks
122+
// block/occupy the worker thread (M) executing them. The scheduler cannot
123+
// detect this situation, and therefore assumes that the worker thread is still
124+
// executing work, so no new M is spawned to replace it. This can lead to a loss
125+
// of parallelism.
126+
//
127+
// The scheduler or the standard library does not explicitly try to prevent
128+
// this and instead delegates the responsibility to the developer. The scheduler
129+
// always assumes that coroutines are schedulable. This follows a "force correct
130+
// usage" philosophy rather than a "save everything" philosophy. Blocking tasks
131+
// must be handled by the developer outside of the scheduler.
132+
//
133+
// Go addresses this at the scheduler level. To solve this, when a worker thread
134+
// is blocked (for example, in a syscall), it releases its P, allowing the
135+
// scheduler to pair the released P with another M and continue executing ready
136+
// work. While this solves the lost worker thread problem, it can lead to a
137+
// significant increase in thread creation. In such cases, an approach similar
138+
// to "one thread per blocking task" is not sufficiently efficient or
139+
// performant.
140+
//
141+
// The Jule runtime does not attempt to automatically tolerate blocking tasks,
142+
// but it does not completely ignore them either. A runtime-managed blocking task
143+
// pool is provided for developers. This pool creates a reasonable number of
144+
// worker threads and distributes all blocking tasks among them. Developers are
145+
// expected to intentionally manage blocking tasks using this pool.
146+
// The implementation is located at: `runtime/blocking.jule`
147+
//
148+
// THREAD SYNCHRONIZATION
149+
//
150+
// Various mechanisms are used together to synchronize threads. For example,
151+
// eventpoll uses operating-system–specific facilities such as kqueue, epoll,
152+
// or IOCP in the background. This ensures that the thread does not consume CPU
153+
// while waiting in eventpoll, and it is also used to wait for the possible timer.
154+
//
155+
// In cases that require thread park/unpark, a `parker` is used. The parker
156+
// relies on synchronization primitives such as futex, depending
157+
// on the underlying operating system.
158+
//
159+
// CREDITS
160+
//
161+
// The overall structure and several scheduling concepts of this scheduler are
162+
// heavily inspired by the Go runtime scheduler. In particular, the C:M:P model,
163+
// work distribution strategy, and worker thread parking/unparking logic follow
164+
// the same high-level design philosophy.
165+
//
166+
// This is NOT a reimplementation of Go's scheduler, nor a line-by-line port.
167+
// The implementation is purpose-built for the Jule runtime, adapted to its
168+
// coroutine model, ABI, and constraints.
169+
//
170+
// See Go's runtime scheduler at the source code: `runtime/proc.go`
54171

55172
use "std/internal/runtime"
56173
use "std/internal/runtime/atomic"
@@ -279,7 +396,8 @@ fn pidleget(): &p {
279396
fn pidlegetSpinning(): &p {
280397
mut pp := pidleget()
281398
if pp == nil {
282-
// We found work that we cannot take, we must synchronize with non-spinning
399+
// See "Delicate dance" comment in findRunnable. We found work
400+
// that we cannot take, we must synchronize with non-spinning
283401
// Ms that may be preparing to drop their P.
284402
atomic::Store(&sched.needspinning, 1, atomic::Release)
285403
ret nil
@@ -522,8 +640,20 @@ fn injectclist(mut &batch: *[prunqsize]c, batchStart: u32, bsize: u32) {
522640
runqputbatch(m.pp, batch, batchStart+n, bsize)
523641
}
524642

643+
// Some P's might have become idle after we loaded `sched.npidle`
644+
// but before any coroutines were added to the queue, which could
645+
// lead to idle P's when there is work available in the global queue.
646+
// That could potentially last until other coroutines become ready
647+
// to run. That said, we need to find a way to hedge
648+
//
649+
// Calling wakep() here is the best bet, it will do nothing in the
650+
// common case (no racing on `sched.npidle`), while it could wake one
651+
// more P to execute C's, which might end up with >1 P's: the first one
652+
// wakes another P and so forth until there is no more work, but this
653+
// ought to be an extremely rare case.
654+
//
655+
// Also see "Worker thread parking/unparking" comment at the top of the file for details.
525656
wakep()
526-
ret
527657
}
528658

529659
// Gets a coroutine from local runnable queue and writes to cp.
@@ -849,11 +979,13 @@ top:
849979
// behalf. If we are not racing and the system is truly fully loaded
850980
// then no spinning threads are required, and the next thread to
851981
// naturally become spinning will clear the flag.
982+
//
983+
// Also see "Worker thread parking/unparking" comment at the top of the file.
852984
wasSpinning := m.spinning
853985
if m.spinning {
854986
m.spinning = false
855987
if atomic::Add(&sched.nmspinning, -1, atomic::Relaxed) < 0 {
856-
panic("findrunnable: negative nmspinning")
988+
panic("findRunnable: negative nmspinning")
857989
}
858990

859991
// Note the for correctness, only the last M transitioning from
@@ -1286,9 +1418,10 @@ fn resetspinning() {
12861418
m.spinning = false
12871419
nmspinning := atomic::Add(&sched.nmspinning, -1, atomic::Release)
12881420
if nmspinning < 0 {
1289-
panic("findrunnable: negative nmspinning")
1421+
panic("findRunnable: negative nmspinning")
12901422
}
12911423
// M wakeup policy is deliberately somewhat conservative, so check if we
1292-
// need to wakeup another P here.
1424+
// need to wakeup another P here. See "Worker thread parking/unparking"
1425+
// comment at the top of the file for details.
12931426
wakep()
12941427
}

0 commit comments

Comments
 (0)