Commit 82d2359
[zephyr] Replace Parquet shuffle with zstd-chunk format (#4782)
## Summary
- Replaces Parquet-based scatter/reduce shuffle with flat zstd frames +
byte-range sidecar. Drops Arrow from the shuffle data plane.
- Adds memory-bounded external-sort fan-in and byte-budgeted pass-1
spill batch so skewed and large-item shuffles don't OOM the worker.
- Net **−165 lines** across `shuffle.py` / `external_sort.py` /
`plan.py` / `execution.py`.
## Format
Each scatter source writes a single binary file: a concatenation of zstd
frames. Each frame is one sorted chunk, containing repeated
`pickle.dump(sub_batch)` calls into a single zstd stream (sub-batch size
= 1024 items by default). A JSON `.scatter_meta` sidecar maps
`target_shard → [(offset, length)]`. Sidecars aggregate into one
`scatter_metadata` manifest per stage (wire format unchanged from
before).
On read, `ScatterFileIterator` fetches each chunk via one `cat_file`
range GET and streams sub-batches with `pickle.load`. Per-iterator
memory is bounded by `sub_batch_size * avg_item_bytes +
chunk_compressed_bytes` — independent of chunk row count.
Gone:
- Segment-rotation for schema evolution (`_ensure_writer`, `seg_idx`,
`pa.unify_schemas`)
- Arrow-vs-pickle envelope peek (`use_pickle_envelope`,
`pa.RecordBatch.from_pylist`)
- Row-group statistics + predicate pushdown (`equality_predicates`,
`iter_parquet_row_groups`)
- PyArrow dataset memory-leak workaround (`_get_scatter_read_fs`,
block-size budgeting)
- `pyarrow` on the shuffle data plane
## External-sort scaling
Two knobs now scale with the workload instead of being hardcoded:
- `compute_fan_in(per_iter_bytes, mem_limit)` — pass-1 fan-in floored at
4, capped at `EXTERNAL_SORT_FAN_IN=500`, otherwise sized to fit 50% of
worker memory given `max_chunk_rows * avg_item_bytes` per open chunk.
- `compute_write_batch_size(avg_item_bytes)` — pass-1 `pending` buffer
sized to ~64 MB of items (capped at 10k). Prior fixed
`_WRITE_BATCH_SIZE=10_000` could OOM at 10 GB on 1 MB items.
`_merge_sorted_chunks` reads `shard.max_chunk_rows` and
`shard.avg_item_bytes` from the manifest and passes both values through
`external_sort_merge`.
## Benchmarks on marin-dev (4 workers × 8 GB RAM)
| Workload | Baseline (Parquet) | New | Hot-worker peak mem |
|---|---|---|---|
| Uniform 10 GB, 250 B items | 736 s | **392 s** | 551 MB |
| Uniform 10 GB, 1 MB items | — | **352 s** | 621 MB |
| Skew 90% 10 GB, 250 B items | **OOM** | 800 s | 3.09 GB |
| Skew 90% 10 GB, 1 MB items | **OOM** | 1349 s | 7.18 GB |
| Skew 90% 50 GB, 250 B items | **OOM** | 7796 s* | 3.58 GB |
| Skew 90% 50 GB, 1 MB items | **OOM** | 2996 s | 7.33 GB |
*Includes ~35 min lost to a mid-run coordinator preemption + automatic
pipeline retry.
Uniform throughput 1.88× faster than Parquet at 10 GB small. Every
skewed case baseline OOMs on now completes with memory bounded below the
worker limit.
## Tests
- `lib/zephyr/tests/test_shuffle.py` rewritten for the new API (13
tests).
- `lib/zephyr/tests/test_groupby.py` pickle-roundtrip test updated.
- `lib/zephyr/tests/benchmark_shuffle.py` new — synthetic 10-50 GB
shuffle with `--hot-shard-frac`/`--hot-key-pool`.
- `test_shuffle.py` (13), `test_groupby.py` (23), `test_execution.py`
(40) all pass locally.
## Test plan
- [x] Unit tests pass locally
- [x] Uniform shuffle (10 GB, small + large items) on marin-dev
- [x] Skewed shuffle (10 GB + 50 GB, small + large items) on marin-dev
- [ ] Datakit ferry on marin — deferred (not necessary given direct
shuffle benchmarks)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com>1 parent e78e1dc commit 82d2359
File tree
8 files changed
+520
-642
lines changed- lib/zephyr
- src/zephyr
- tests
8 files changed
+520
-642
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
39 | 38 | | |
40 | 39 | | |
41 | 40 | | |
| |||
112 | 111 | | |
113 | 112 | | |
114 | 113 | | |
115 | | - | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
116 | 117 | | |
117 | | - | |
118 | | - | |
| 118 | + | |
119 | 119 | | |
120 | 120 | | |
121 | 121 | | |
| |||
232 | 232 | | |
233 | 233 | | |
234 | 234 | | |
235 | | - | |
236 | 235 | | |
237 | 236 | | |
238 | 237 | | |
239 | 238 | | |
240 | 239 | | |
241 | 240 | | |
242 | | - | |
243 | | - | |
244 | | - | |
245 | | - | |
246 | | - | |
247 | | - | |
248 | | - | |
249 | | - | |
250 | | - | |
251 | | - | |
252 | | - | |
253 | | - | |
254 | 241 | | |
255 | | - | |
256 | | - | |
| 242 | + | |
| 243 | + | |
257 | 244 | | |
258 | 245 | | |
259 | | - | |
| 246 | + | |
260 | 247 | | |
261 | 248 | | |
262 | 249 | | |
263 | 250 | | |
264 | | - | |
265 | 251 | | |
266 | 252 | | |
267 | 253 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
11 | | - | |
12 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
13 | 15 | | |
14 | 16 | | |
15 | 17 | | |
| |||
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
34 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
35 | 39 | | |
36 | 40 | | |
37 | | - | |
38 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
39 | 55 | | |
40 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
41 | 60 | | |
42 | 61 | | |
43 | 62 | | |
44 | 63 | | |
45 | 64 | | |
46 | 65 | | |
47 | 66 | | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
48 | 94 | | |
49 | 95 | | |
50 | 96 | | |
| |||
87 | 133 | | |
88 | 134 | | |
89 | 135 | | |
| 136 | + | |
| 137 | + | |
90 | 138 | | |
91 | 139 | | |
92 | 140 | | |
93 | 141 | | |
94 | 142 | | |
95 | | - | |
| 143 | + | |
96 | 144 | | |
97 | 145 | | |
98 | 146 | | |
99 | 147 | | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
100 | 155 | | |
101 | 156 | | |
102 | 157 | | |
| |||
109 | 164 | | |
110 | 165 | | |
111 | 166 | | |
| 167 | + | |
| 168 | + | |
112 | 169 | | |
113 | | - | |
| 170 | + | |
114 | 171 | | |
115 | 172 | | |
116 | 173 | | |
| |||
119 | 176 | | |
120 | 177 | | |
121 | 178 | | |
122 | | - | |
| 179 | + | |
123 | 180 | | |
124 | 181 | | |
125 | 182 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| |||
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| |||
635 | 635 | | |
636 | 636 | | |
637 | 637 | | |
638 | | - | |
| 638 | + | |
639 | 639 | | |
640 | 640 | | |
641 | 641 | | |
| |||
644 | 644 | | |
645 | 645 | | |
646 | 646 | | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
647 | 655 | | |
648 | | - | |
| 656 | + | |
| 657 | + | |
649 | 658 | | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
650 | 662 | | |
651 | 663 | | |
652 | 664 | | |
653 | | - | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
654 | 672 | | |
655 | 673 | | |
656 | 674 | | |
657 | | - | |
658 | | - | |
659 | | - | |
660 | | - | |
661 | | - | |
| 675 | + | |
662 | 676 | | |
663 | 677 | | |
664 | 678 | | |
| |||
0 commit comments