Skip to content

Commit 6ac5294

Browse files
committed
docs: update README with detailed benchmark results and analysis
1 parent 9b626e9 commit 6ac5294

1 file changed

Lines changed: 166 additions & 70 deletions

File tree

bench/README.md

Lines changed: 166 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,6 @@
33
Compares fiber/goroutine channel performance across Go, Crystal, Rust (Tokio
44
`current_thread`), and Tin.
55

6-
## Benchmarks
7-
8-
| Name | Description |
9-
|--------------|------------------------------------------------------------------|
10-
| `bench` | Pingpong - 2 fibers, 1M round trips (SPSC latency baseline) |
11-
| `pipeline` | 4 relay fibers in series, 1M passes (multi-hop latency) |
12-
| `mpmc` | 4 producers + 4 consumers, 1M msgs (MPMC throughput) |
13-
| `jitter` | 8 workers, variable 0-3 yields/task, 1M tasks (scheduler stress) |
14-
| `pipeline10` | 10 relay fibers in series, 500K passes (deep pipeline latency) |
15-
| `fanout` | 1 producer fans out to 8 worker fibers, 1M items (dispatch) |
16-
176
## Running
187

198
```bash
@@ -24,14 +13,6 @@ Requires `go`, `crystal`, `cargo`, and `tin` (built at `../tin`).
2413
Uses `hyperfine` (2 warmup runs + statistical multi-run) if available.
2514
Install via `yay -S hyperfine` or `cargo install hyperfine`.
2615

27-
## Hardware
28-
29-
| | |
30-
|-|-|
31-
| **CPU** | Intel Core i7-9700K @ 3.60GHz (8 cores, no HT) |
32-
| **RAM** | 32 GB DDR4 |
33-
| **OS** | Arch Linux, kernel 6.19.11-arch1-1 |
34-
3516
## Compiler versions
3617

3718
| Language | Version |
@@ -45,96 +26,211 @@ Install via `yay -S hyperfine` or `cargo install hyperfine`.
4526

4627
Hyperfine wall-clock means (2 warmup + statistical multi-run). Latency and
4728
throughput are derived from the wall-clock time / message count internal to
48-
each benchmark.
29+
each benchmark. The **bold** row is the fastest runtime on that host.
30+
31+
<details open>
32+
<summary><b>Benchmarks</b> (what each pattern measures)</summary>
33+
34+
| Name | Description |
35+
|--------------|------------------------------------------------------------------|
36+
| `bench` | Pingpong - 2 fibers, 1M round trips (SPSC latency baseline) |
37+
| `pipeline` | 4 relay fibers in series, 1M passes (multi-hop latency) |
38+
| `mpmc` | 4 producers + 4 consumers, 1M msgs (MPMC throughput) |
39+
| `jitter` | 8 workers, variable 0-3 yields/task, 1M tasks (scheduler stress) |
40+
| `pipeline10` | 10 relay fibers in series, 500K passes (deep pipeline latency) |
41+
| `fanout` | 1 producer fans out to 8 worker fibers, 1M items (dispatch) |
42+
43+
**Pingpong** is the cheapest possible channel exercise: two fibers, one
44+
unbuffered channel, ping-pong forever. Drives SPSC latency / context-switch
45+
cost more than anything else.
46+
47+
**Pipeline** chains 4 (or 10) fibers in series so each message hops through
48+
the whole chain before the next one enters. Stresses many cheap wakeups
49+
per message; the deep variant amplifies any per-hop overhead.
50+
51+
**MPMC** is the contention test: four producers all racing into one shared
52+
buffered channel while four consumers race to drain. Single-runqueue
53+
schedulers shine here because they can dispatch from one ready queue
54+
without a per-thread lock.
55+
56+
**Jitter** is the irregular-yield stress: each task yields 0-3 times
57+
before it finishes, so wakeups are bursty. Schedulers that can keep every
58+
worker hot win.
59+
60+
**Fanout** is the dispatch test: one producer hands work to 8 worker fibers
61+
through a single channel. Throughput is gated by how fast the runtime can
62+
hand off ready tasks to idle workers.
63+
64+
The two host tabs below are independent runs of the same suite; absolute
65+
numbers vary by CPU + OS, so the ordering within a host is the relevant
66+
signal.
67+
68+
| Host | Machine | CPU | OS |
69+
|------|---------|-----|----|
70+
| **M4 Pro** | Apple MacBook Pro 16 (2024) | Apple M4 Pro (10P + 4E) | macOS 26.4.1 (arm64) |
71+
| **i7-9700K** | Custom desktop | Intel i7-9700K @ 4.9 GHz (8C / 8T) | Arch Linux, kernel 7.0.5-arch1-1 |
72+
73+
</details>
74+
75+
<details>
76+
<summary><b>M4 Pro</b> (MacBook Pro 16, macOS 26.4.1)</summary>
77+
78+
### Pingpong - 1M round trips (lower is better)
79+
80+
| Language | Wall time | Latency / round trip |
81+
|----------|----------:|---------------------:|
82+
| **Tin** | **38.6 ms** | **~39 ns** |
83+
| Crystal | 69.2 ms | ~69 ns |
84+
| Rust | 120.3 ms | ~120 ns |
85+
| Go | 216.4 ms | ~216 ns |
86+
87+
### Pipeline - 1M passes, 4 stages (lower is better)
88+
89+
| Language | Wall time | Latency / pass |
90+
|----------|----------:|---------------:|
91+
| **Tin** | **97.9 ms** | **~98 ns** |
92+
| Crystal | 171.0 ms | ~171 ns |
93+
| Rust | 249.6 ms | ~250 ns |
94+
| Go | 538.1 ms | ~538 ns |
95+
96+
### MPMC - 1M messages, 4 producers + 4 consumers (higher is better)
97+
98+
| Language | Wall time | Throughput |
99+
|----------|----------:|-----------:|
100+
| Crystal | 9.2 ms | ~108.7M msgs/s |
101+
| Go | 44.7 ms | ~22.4M msgs/s |
102+
| **Tin** | **46.6 ms** | **~21.5M msgs/s** |
103+
| Rust | 102.7 ms | ~9.7M msgs/s |
104+
105+
### Jitter - 1M tasks, 8 workers, 0-3 yields (higher is better)
106+
107+
| Language | Wall time | Throughput |
108+
|----------|----------:|-----------:|
109+
| **Tin** | **32.5 ms** | **~30.8M tasks/s** |
110+
| Rust | 111.8 ms | ~8.94M tasks/s |
111+
| Crystal | 253.1 ms | ~3.95M tasks/s |
112+
| Go | 393.8 ms | ~2.54M tasks/s |
113+
114+
### Pipeline10 - 500K passes, 10 stages (lower is better)
115+
116+
| Language | Wall time | Latency / pass |
117+
|----------|----------:|---------------:|
118+
| **Tin** | **95.8 ms** | **~192 ns** |
119+
| Crystal | 257.8 ms | ~516 ns |
120+
| Rust | 262.6 ms | ~525 ns |
121+
| Go | 586.2 ms | ~1172 ns |
122+
123+
### Fanout - 1M items, 1 producer + 8 workers (higher is better)
124+
125+
| Language | Wall time | Throughput |
126+
|----------|----------:|-----------:|
127+
| **Tin** | **49.1 ms** | **~20.4M items/s** |
128+
| Crystal | 71.6 ms | ~14.0M items/s |
129+
| Rust | 164.9 ms | ~6.06M items/s |
130+
| Go | 211.1 ms | ~4.74M items/s |
131+
132+
</details>
133+
134+
<details>
135+
<summary><b>i7-9700K</b> (Arch Linux, kernel 7.0.5)</summary>
49136

50137
### Pingpong - 1M round trips (lower is better)
51138

52139
| Language | Wall time | Latency / round trip |
53140
|----------|----------:|---------------------:|
54-
| Crystal | 72.3 ms | ~72 ns |
55-
| **Tin** | **103.6 ms** | **~104 ns** |
56-
| Rust | 297.3 ms | ~297 ns |
57-
| Go | 537.1 ms | ~537 ns |
141+
| Crystal | 72.8 ms | ~73 ns |
142+
| **Tin** | **103.7 ms** | **~104 ns** |
143+
| Rust | 297.8 ms | ~298 ns |
144+
| Go | 749.6 ms ± 305 ms | host-noisy |
58145

59146
### Pipeline - 1M passes, 4 stages (lower is better)
60147

61148
| Language | Wall time | Latency / pass |
62149
|----------|----------:|---------------:|
63-
| Crystal | 157.6 ms | ~158 ns |
64-
| **Tin** | **257.1 ms** | **~257 ns** |
65-
| Rust | 694.0 ms | ~694 ns |
66-
| Go | 1122 ms | ~1122 ns |
150+
| Crystal | 155.6 ms | ~156 ns |
151+
| **Tin** | **259.1 ms** | **~259 ns** |
152+
| Rust | 698.7 ms | ~699 ns |
153+
| Go | 1317 ms ± 511 ms | host-noisy |
67154

68155
### MPMC - 1M messages, 4 producers + 4 consumers (higher is better)
69156

70157
| Language | Wall time | Throughput |
71158
|----------|----------:|-----------:|
72-
| Crystal | 11.2 ms | ~89.3M msgs/s |
73-
| **Tin** | **34.7 ms** | **~28.8M msgs/s** |
74-
| Go | 56.5 ms | ~17.7M msgs/s |
75-
| Rust | 303.0 ms | ~3.30M msgs/s |
159+
| Crystal | 11.3 ms | ~88.5M msgs/s |
160+
| **Tin** | **36.1 ms** | **~27.7M msgs/s** |
161+
| Go | 56.2 ms | ~17.8M msgs/s |
162+
| Rust | 304.4 ms | ~3.29M msgs/s |
76163

77164
### Jitter - 1M tasks, 8 workers, 0-3 yields (higher is better)
78165

79166
| Language | Wall time | Throughput |
80167
|----------|----------:|-----------:|
81-
| **Tin** | **67.3 ms** | **~14.86M tasks/s** |
82-
| Rust | 284.3 ms | ~3.52M tasks/s |
83-
| Go | 407.5 ms | ~2.45M tasks/s |
84-
| Crystal | 850.5 ms | ~1.18M tasks/s |
168+
| **Tin** | **67.6 ms** | **~14.8M tasks/s** |
169+
| Rust | 289.0 ms | ~3.46M tasks/s |
170+
| Go | 409.4 ms | ~2.44M tasks/s |
171+
| Crystal | 489.6 ms | ~2.04M tasks/s |
85172

86173
### Pipeline10 - 500K passes, 10 stages (lower is better)
87174

88175
| Language | Wall time | Latency / pass |
89176
|----------|----------:|---------------:|
90-
| Crystal | 180.0 ms | ~360 ns |
91-
| **Tin** | **282.2 ms** | **~564 ns** |
92-
| Rust | 752.8 ms | ~1506 ns |
93-
| Go | 757.6 ms | ~1515 ns |
177+
| Crystal | 180.8 ms | ~362 ns |
178+
| **Tin** | **287.2 ms** | **~574 ns** |
179+
| Rust | 748.8 ms | ~1498 ns |
180+
| Go | 761.7 ms | ~1523 ns |
94181

95182
### Fanout - 1M items, 1 producer + 8 workers (higher is better)
96183

97184
| Language | Wall time | Throughput |
98185
|----------|----------:|-----------:|
99-
| Crystal | 71.1 ms | ~14.06M items/s |
100-
| **Tin** | **164.0 ms** | **~6.10M items/s** |
101-
| Rust | 432.7 ms | ~2.31M items/s |
102-
| Go | 651.2 ms | ~1.54M items/s |
186+
| Crystal | 70.6 ms | ~14.2M items/s |
187+
| **Tin** | **164.9 ms** | **~6.06M items/s** |
188+
| Rust | 428.8 ms | ~2.33M items/s |
189+
| Go | 744.1 ms ± 343 ms | host-noisy |
190+
191+
</details>
103192

104193
## Summary
105194

106-
Across the 6 benchmarks Crystal wins 5 (latency-bound channel patterns) and
107-
Tin wins 1 (jitter / irregular yield patterns). Tin places second in the
108-
remaining 5 and beats Go and Rust on every single benchmark.
195+
Tin leads 5 of the 6 benchmarks on M4 Pro and the jitter benchmark on
196+
9700K; Crystal still wins MPMC and the dispatch-density patterns on the
197+
older 8-core Linux box. On M4 Pro Tin beats every other runtime on every
198+
benchmark except MPMC.
109199

110-
| Benchmark | Tin vs leader | Tin vs runner-up |
111-
|-------------|--------------:|-----------------:|
112-
| Pingpong | 1.43x slower than Crystal | 2.87x faster than Rust |
113-
| Pipeline-4 | 1.63x slower than Crystal | 2.70x faster than Rust |
114-
| MPMC | 3.10x slower than Crystal | 1.63x faster than Go |
115-
| Jitter | **leader** | 4.22x faster than Rust |
116-
| Pipeline-10 | 1.57x slower than Crystal | 2.67x faster than Rust |
117-
| Fanout | 2.31x slower than Crystal | 2.64x faster than Rust |
200+
| Benchmark | Tin on M4 Pro | Tin on i7-9700K |
201+
|-------------|---------------------------|---------------------------|
202+
| Pingpong | **leader** (1.80x ahead of Crystal) | 1.42x slower than Crystal |
203+
| Pipeline-4 | **leader** (1.75x ahead of Crystal) | 1.67x slower than Crystal |
204+
| MPMC | 5.05x slower than Crystal | 3.19x slower than Crystal |
205+
| Jitter | **leader** (3.44x ahead of Rust) | **leader** (4.28x ahead of Rust) |
206+
| Pipeline-10 | **leader** (2.69x ahead of Crystal) | 1.59x slower than Crystal |
207+
| Fanout | **leader** (1.46x ahead of Crystal) | 2.34x slower than Crystal |
118208

119209
## Notes
120210

121211
- **Scheduler model.** Tin uses M:N scheduling: M fiber coroutines multiplexed
122212
onto N OS worker threads via a single shared run queue. Go uses M:N with
123213
per-P work-stealing queues; Crystal uses M:N green threads. Rust Tokio
124-
`current_thread` is single-threaded. Tin's shared-queue design avoids
125-
per-thread queue overhead at the cost of higher lock contention under
126-
MPMC/fanout workloads.
127-
- **Pingpong / pipeline.** Crystal's lead comes from lower green-thread
128-
context-switch cost vs OS threads. Tin's ~104 ns pingpong is ~3x faster than
129-
Go/Rust; the gap to Crystal is the cost Tin pays for real OS threads under
130-
the channel.
131-
- **Jitter.** Tin now leads this benchmark by a wide margin (~12.6x ahead of
132-
Crystal, ~4.2x ahead of Rust). The irregular yield pattern stresses
133-
scheduler-wake throughput; Tin's autoyield + worker-stealing keeps every
134-
worker hot, while Crystal's single-runqueue serializes wakeups.
135-
- **MPMC.** Tin's throughput now sits between Go and Crystal. Variance under
136-
contention is moderate; multiple runs cluster within ~25-40 ms.
137-
- **Fanout / pipeline10.** Both are dispatch-density benchmarks; Crystal's
138-
green-thread cheapness dominates. Tin holds a clean 2nd place.
214+
`current_thread` is single-threaded. Tin's shared-queue design pays for
215+
itself on dispatch-heavy patterns and stresses under MPMC contention.
216+
- **Pingpong / pipeline.** Tin wins both outright on M4 Pro; on the older
217+
8-core 9700K Crystal's green-thread cheapness still leads by ~1.4-1.7x.
218+
Both numbers represent ~30-260 ns per channel hop -- Tin and Crystal sit
219+
in the same order of magnitude, well ahead of Go/Rust.
220+
- **MPMC.** Crystal's single-runqueue green threads dominate this one on
221+
both hosts (~89-108M msgs/s). Tin sits in a near-tie with Go around ~21-28M
222+
msgs/s. Multi-worker channel contention is the regime where Tin's shared
223+
queue hurts most -- a known trade-off, not a regression.
224+
- **Jitter.** Tin leads on both hosts (3.4-7.8x ahead of next runtime).
225+
The irregular yield pattern stresses scheduler-wake throughput; Tin's
226+
autoyield + multi-worker pool keeps every worker hot, while Crystal's
227+
single runqueue serializes wakeups.
228+
- **Pipeline10 / fanout.** Dispatch-density patterns. M4 Pro's higher core
229+
count + faster IPC tips them to Tin; on the 8-core 9700K Crystal's
230+
green-thread cheapness still wins these two.
231+
- **Go variance on the 9700K box.** Several Go runs (pingpong, pipeline,
232+
fanout) show standard deviations in the hundreds of ms (mean ±300 ms).
233+
The host had background load during the sweep; Go's worker-startup is
234+
most sensitive to that. Treat those numbers as upper bounds.
139235
- **Rust Tokio current_thread** is single-threaded by construction, so the
140236
multi-worker patterns penalize it heavily.

0 commit comments

Comments
 (0)