33Compares fiber/goroutine channel performance across Go, Crystal, Rust (Tokio
44` current_thread ` ), and Tin.
55
6- ## Benchmarks
7-
8- | Name | Description |
9- | --------------| ------------------------------------------------------------------|
10- | ` bench ` | Pingpong - 2 fibers, 1M round trips (SPSC latency baseline) |
11- | ` pipeline ` | 4 relay fibers in series, 1M passes (multi-hop latency) |
12- | ` mpmc ` | 4 producers + 4 consumers, 1M msgs (MPMC throughput) |
13- | ` jitter ` | 8 workers, variable 0-3 yields/task, 1M tasks (scheduler stress) |
14- | ` pipeline10 ` | 10 relay fibers in series, 500K passes (deep pipeline latency) |
15- | ` fanout ` | 1 producer fans out to 8 worker fibers, 1M items (dispatch) |
16-
176## Running
187
198``` bash
@@ -24,14 +13,6 @@ Requires `go`, `crystal`, `cargo`, and `tin` (built at `../tin`).
2413Uses ` hyperfine ` (2 warmup runs + statistical multi-run) if available.
2514Install via ` yay -S hyperfine ` or ` cargo install hyperfine ` .
2615
27- ## Hardware
28-
29- | | |
30- | -| -|
31- | ** CPU** | Intel Core i7-9700K @ 3.60GHz (8 cores, no HT) |
32- | ** RAM** | 32 GB DDR4 |
33- | ** OS** | Arch Linux, kernel 6.19.11-arch1-1 |
34-
3516## Compiler versions
3617
3718| Language | Version |
@@ -45,96 +26,211 @@ Install via `yay -S hyperfine` or `cargo install hyperfine`.
4526
4627Hyperfine wall-clock means (2 warmup + statistical multi-run). Latency and
4728throughput are derived from the wall-clock time / message count internal to
48- each benchmark.
29+ each benchmark. The ** bold** row is the fastest runtime on that host.
30+
31+ <details open >
32+ <summary ><b >Benchmarks</b > (what each pattern measures)</summary >
33+
34+ | Name | Description |
35+ | --------------| ------------------------------------------------------------------|
36+ | ` bench ` | Pingpong - 2 fibers, 1M round trips (SPSC latency baseline) |
37+ | ` pipeline ` | 4 relay fibers in series, 1M passes (multi-hop latency) |
38+ | ` mpmc ` | 4 producers + 4 consumers, 1M msgs (MPMC throughput) |
39+ | ` jitter ` | 8 workers, variable 0-3 yields/task, 1M tasks (scheduler stress) |
40+ | ` pipeline10 ` | 10 relay fibers in series, 500K passes (deep pipeline latency) |
41+ | ` fanout ` | 1 producer fans out to 8 worker fibers, 1M items (dispatch) |
42+
43+ ** Pingpong** is the cheapest possible channel exercise: two fibers, one
44+ unbuffered channel, ping-pong forever. Drives SPSC latency / context-switch
45+ cost more than anything else.
46+
47+ ** Pipeline** chains 4 (or 10) fibers in series so each message hops through
48+ the whole chain before the next one enters. Stresses many cheap wakeups
49+ per message; the deep variant amplifies any per-hop overhead.
50+
51+ ** MPMC** is the contention test: four producers all racing into one shared
52+ buffered channel while four consumers race to drain. Single-runqueue
53+ schedulers shine here because they can dispatch from one ready queue
54+ without a per-thread lock.
55+
56+ ** Jitter** is the irregular-yield stress: each task yields 0-3 times
57+ before it finishes, so wakeups are bursty. Schedulers that can keep every
58+ worker hot win.
59+
60+ ** Fanout** is the dispatch test: one producer hands work to 8 worker fibers
61+ through a single channel. Throughput is gated by how fast the runtime can
62+ hand off ready tasks to idle workers.
63+
64+ The two host tabs below are independent runs of the same suite; absolute
65+ numbers vary by CPU + OS, so the ordering within a host is the relevant
66+ signal.
67+
68+ | Host | Machine | CPU | OS |
69+ | ------| ---------| -----| ----|
70+ | ** M4 Pro** | Apple MacBook Pro 16 (2024) | Apple M4 Pro (10P + 4E) | macOS 26.4.1 (arm64) |
71+ | ** i7-9700K** | Custom desktop | Intel i7-9700K @ 4.9 GHz (8C / 8T) | Arch Linux, kernel 7.0.5-arch1-1 |
72+
73+ </details >
74+
75+ <details >
76+ <summary ><b >M4 Pro</b > (MacBook Pro 16, macOS 26.4.1)</summary >
77+
78+ ### Pingpong - 1M round trips (lower is better)
79+
80+ | Language | Wall time | Latency / round trip |
81+ | ----------| ----------:| ---------------------:|
82+ | ** Tin** | ** 38.6 ms** | ** ~ 39 ns** |
83+ | Crystal | 69.2 ms | ~ 69 ns |
84+ | Rust | 120.3 ms | ~ 120 ns |
85+ | Go | 216.4 ms | ~ 216 ns |
86+
87+ ### Pipeline - 1M passes, 4 stages (lower is better)
88+
89+ | Language | Wall time | Latency / pass |
90+ | ----------| ----------:| ---------------:|
91+ | ** Tin** | ** 97.9 ms** | ** ~ 98 ns** |
92+ | Crystal | 171.0 ms | ~ 171 ns |
93+ | Rust | 249.6 ms | ~ 250 ns |
94+ | Go | 538.1 ms | ~ 538 ns |
95+
96+ ### MPMC - 1M messages, 4 producers + 4 consumers (higher is better)
97+
98+ | Language | Wall time | Throughput |
99+ | ----------| ----------:| -----------:|
100+ | Crystal | 9.2 ms | ~ 108.7M msgs/s |
101+ | Go | 44.7 ms | ~ 22.4M msgs/s |
102+ | ** Tin** | ** 46.6 ms** | ** ~ 21.5M msgs/s** |
103+ | Rust | 102.7 ms | ~ 9.7M msgs/s |
104+
105+ ### Jitter - 1M tasks, 8 workers, 0-3 yields (higher is better)
106+
107+ | Language | Wall time | Throughput |
108+ | ----------| ----------:| -----------:|
109+ | ** Tin** | ** 32.5 ms** | ** ~ 30.8M tasks/s** |
110+ | Rust | 111.8 ms | ~ 8.94M tasks/s |
111+ | Crystal | 253.1 ms | ~ 3.95M tasks/s |
112+ | Go | 393.8 ms | ~ 2.54M tasks/s |
113+
114+ ### Pipeline10 - 500K passes, 10 stages (lower is better)
115+
116+ | Language | Wall time | Latency / pass |
117+ | ----------| ----------:| ---------------:|
118+ | ** Tin** | ** 95.8 ms** | ** ~ 192 ns** |
119+ | Crystal | 257.8 ms | ~ 516 ns |
120+ | Rust | 262.6 ms | ~ 525 ns |
121+ | Go | 586.2 ms | ~ 1172 ns |
122+
123+ ### Fanout - 1M items, 1 producer + 8 workers (higher is better)
124+
125+ | Language | Wall time | Throughput |
126+ | ----------| ----------:| -----------:|
127+ | ** Tin** | ** 49.1 ms** | ** ~ 20.4M items/s** |
128+ | Crystal | 71.6 ms | ~ 14.0M items/s |
129+ | Rust | 164.9 ms | ~ 6.06M items/s |
130+ | Go | 211.1 ms | ~ 4.74M items/s |
131+
132+ </details >
133+
134+ <details >
135+ <summary ><b >i7-9700K</b > (Arch Linux, kernel 7.0.5)</summary >
49136
50137### Pingpong - 1M round trips (lower is better)
51138
52139| Language | Wall time | Latency / round trip |
53140| ----------| ----------:| ---------------------:|
54- | Crystal | 72.3 ms | ~ 72 ns |
55- | ** Tin** | ** 103.6 ms** | ** ~ 104 ns** |
56- | Rust | 297.3 ms | ~ 297 ns |
57- | Go | 537.1 ms | ~ 537 ns |
141+ | Crystal | 72.8 ms | ~ 73 ns |
142+ | ** Tin** | ** 103.7 ms** | ** ~ 104 ns** |
143+ | Rust | 297.8 ms | ~ 298 ns |
144+ | Go | 749.6 ms ± 305 ms | host-noisy |
58145
59146### Pipeline - 1M passes, 4 stages (lower is better)
60147
61148| Language | Wall time | Latency / pass |
62149| ----------| ----------:| ---------------:|
63- | Crystal | 157 .6 ms | ~ 158 ns |
64- | ** Tin** | ** 257 .1 ms** | ** ~ 257 ns** |
65- | Rust | 694.0 ms | ~ 694 ns |
66- | Go | 1122 ms | ~ 1122 ns |
150+ | Crystal | 155 .6 ms | ~ 156 ns |
151+ | ** Tin** | ** 259 .1 ms** | ** ~ 259 ns** |
152+ | Rust | 698.7 ms | ~ 699 ns |
153+ | Go | 1317 ms ± 511 ms | host-noisy |
67154
68155### MPMC - 1M messages, 4 producers + 4 consumers (higher is better)
69156
70157| Language | Wall time | Throughput |
71158| ----------| ----------:| -----------:|
72- | Crystal | 11.2 ms | ~ 89.3M msgs/s |
73- | ** Tin** | ** 34.7 ms** | ** ~ 28.8M msgs/s** |
74- | Go | 56.5 ms | ~ 17.7M msgs/s |
75- | Rust | 303.0 ms | ~ 3.30M msgs/s |
159+ | Crystal | 11.3 ms | ~ 88.5M msgs/s |
160+ | ** Tin** | ** 36.1 ms** | ** ~ 27.7M msgs/s** |
161+ | Go | 56.2 ms | ~ 17.8M msgs/s |
162+ | Rust | 304.4 ms | ~ 3.29M msgs/s |
76163
77164### Jitter - 1M tasks, 8 workers, 0-3 yields (higher is better)
78165
79166| Language | Wall time | Throughput |
80167| ----------| ----------:| -----------:|
81- | ** Tin** | ** 67.3 ms** | ** ~ 14.86M tasks/s** |
82- | Rust | 284.3 ms | ~ 3.52M tasks/s |
83- | Go | 407.5 ms | ~ 2.45M tasks/s |
84- | Crystal | 850.5 ms | ~ 1.18M tasks/s |
168+ | ** Tin** | ** 67.6 ms** | ** ~ 14.8M tasks/s** |
169+ | Rust | 289.0 ms | ~ 3.46M tasks/s |
170+ | Go | 409.4 ms | ~ 2.44M tasks/s |
171+ | Crystal | 489.6 ms | ~ 2.04M tasks/s |
85172
86173### Pipeline10 - 500K passes, 10 stages (lower is better)
87174
88175| Language | Wall time | Latency / pass |
89176| ----------| ----------:| ---------------:|
90- | Crystal | 180.0 ms | ~ 360 ns |
91- | ** Tin** | ** 282 .2 ms** | ** ~ 564 ns** |
92- | Rust | 752 .8 ms | ~ 1506 ns |
93- | Go | 757.6 ms | ~ 1515 ns |
177+ | Crystal | 180.8 ms | ~ 362 ns |
178+ | ** Tin** | ** 287 .2 ms** | ** ~ 574 ns** |
179+ | Rust | 748 .8 ms | ~ 1498 ns |
180+ | Go | 761.7 ms | ~ 1523 ns |
94181
95182### Fanout - 1M items, 1 producer + 8 workers (higher is better)
96183
97184| Language | Wall time | Throughput |
98185| ----------| ----------:| -----------:|
99- | Crystal | 71.1 ms | ~ 14.06M items/s |
100- | ** Tin** | ** 164.0 ms** | ** ~ 6.10M items/s** |
101- | Rust | 432.7 ms | ~ 2.31M items/s |
102- | Go | 651.2 ms | ~ 1.54M items/s |
186+ | Crystal | 70.6 ms | ~ 14.2M items/s |
187+ | ** Tin** | ** 164.9 ms** | ** ~ 6.06M items/s** |
188+ | Rust | 428.8 ms | ~ 2.33M items/s |
189+ | Go | 744.1 ms ± 343 ms | host-noisy |
190+
191+ </details >
103192
104193## Summary
105194
106- Across the 6 benchmarks Crystal wins 5 (latency-bound channel patterns) and
107- Tin wins 1 (jitter / irregular yield patterns). Tin places second in the
108- remaining 5 and beats Go and Rust on every single benchmark.
195+ Tin leads 5 of the 6 benchmarks on M4 Pro and the jitter benchmark on
196+ 9700K; Crystal still wins MPMC and the dispatch-density patterns on the
197+ older 8-core Linux box. On M4 Pro Tin beats every other runtime on every
198+ benchmark except MPMC.
109199
110- | Benchmark | Tin vs leader | Tin vs runner-up |
111- | -------------| --------------: | -----------------: |
112- | Pingpong | 1.43x slower than Crystal | 2.87x faster than Rust |
113- | Pipeline-4 | 1.63x slower than Crystal | 2.70x faster than Rust |
114- | MPMC | 3.10x slower than Crystal | 1.63x faster than Go |
115- | Jitter | ** leader** | 4.22x faster than Rust |
116- | Pipeline-10 | 1.57x slower than Crystal | 2.67x faster than Rust |
117- | Fanout | 2.31x slower than Crystal | 2.64x faster than Rust |
200+ | Benchmark | Tin on M4 Pro | Tin on i7-9700K |
201+ | -------------| --------------------------- | --------------------------- |
202+ | Pingpong | ** leader ** (1.80x ahead of Crystal) | 1.42x slower than Crystal |
203+ | Pipeline-4 | ** leader ** (1.75x ahead of Crystal) | 1.67x slower than Crystal |
204+ | MPMC | 5.05x slower than Crystal | 3.19x slower than Crystal |
205+ | Jitter | ** leader** (3.44x ahead of Rust) | ** leader ** (4.28x ahead of Rust) |
206+ | Pipeline-10 | ** leader ** (2.69x ahead of Crystal) | 1.59x slower than Crystal |
207+ | Fanout | ** leader ** (1.46x ahead of Crystal) | 2.34x slower than Crystal |
118208
119209## Notes
120210
121211- ** Scheduler model.** Tin uses M: N scheduling: M fiber coroutines multiplexed
122212 onto N OS worker threads via a single shared run queue. Go uses M: N with
123213 per-P work-stealing queues; Crystal uses M: N green threads. Rust Tokio
124- ` current_thread ` is single-threaded. Tin's shared-queue design avoids
125- per-thread queue overhead at the cost of higher lock contention under
126- MPMC/fanout workloads.
127- - ** Pingpong / pipeline.** Crystal's lead comes from lower green-thread
128- context-switch cost vs OS threads. Tin's ~ 104 ns pingpong is ~ 3x faster than
129- Go/Rust; the gap to Crystal is the cost Tin pays for real OS threads under
130- the channel.
131- - ** Jitter.** Tin now leads this benchmark by a wide margin (~ 12.6x ahead of
132- Crystal, ~ 4.2x ahead of Rust). The irregular yield pattern stresses
133- scheduler-wake throughput; Tin's autoyield + worker-stealing keeps every
134- worker hot, while Crystal's single-runqueue serializes wakeups.
135- - ** MPMC.** Tin's throughput now sits between Go and Crystal. Variance under
136- contention is moderate; multiple runs cluster within ~ 25-40 ms.
137- - ** Fanout / pipeline10.** Both are dispatch-density benchmarks; Crystal's
138- green-thread cheapness dominates. Tin holds a clean 2nd place.
214+ ` current_thread ` is single-threaded. Tin's shared-queue design pays for
215+ itself on dispatch-heavy patterns and stresses under MPMC contention.
216+ - ** Pingpong / pipeline.** Tin wins both outright on M4 Pro; on the older
217+ 8-core 9700K Crystal's green-thread cheapness still leads by ~ 1.4-1.7x.
218+ Both numbers represent ~ 30-260 ns per channel hop -- Tin and Crystal sit
219+ in the same order of magnitude, well ahead of Go/Rust.
220+ - ** MPMC.** Crystal's single-runqueue green threads dominate this one on
221+ both hosts (~ 89-108M msgs/s). Tin sits in a near-tie with Go around ~ 21-28M
222+ msgs/s. Multi-worker channel contention is the regime where Tin's shared
223+ queue hurts most -- a known trade-off, not a regression.
224+ - ** Jitter.** Tin leads on both hosts (3.4-7.8x ahead of next runtime).
225+ The irregular yield pattern stresses scheduler-wake throughput; Tin's
226+ autoyield + multi-worker pool keeps every worker hot, while Crystal's
227+ single runqueue serializes wakeups.
228+ - ** Pipeline10 / fanout.** Dispatch-density patterns. M4 Pro's higher core
229+ count + faster IPC tips them to Tin; on the 8-core 9700K Crystal's
230+ green-thread cheapness still wins these two.
231+ - ** Go variance on the 9700K box.** Several Go runs (pingpong, pipeline,
232+ fanout) show standard deviations in the hundreds of ms (mean ±300 ms).
233+ The host had background load during the sweep; Go's worker-startup is
234+ most sensitive to that. Treat those numbers as upper bounds.
139235- ** Rust Tokio current_thread** is single-threaded by construction, so the
140236 multi-worker patterns penalize it heavily.
0 commit comments