[zephyr] Use msgspec msgpack for scatter chunks, fall back to pickle#5091
[zephyr] Use msgspec msgpack for scatter chunks, fall back to pickle#5091hsuhanooi wants to merge 2 commits intomarin-community:mainfrom
Conversation
Replace cloudpickle sub-batches with msgspec msgpack for scatter chunk serialization. msgpack is 2-5x faster on write and ~1.5x faster on read for plain dicts. A one-byte format tag prefixes each frame so readers can dispatch to msgpack or pickle. Items with frozenset/set values are detected before encoding and routed to the pickle path to avoid silent data loss from msgspec's frozenset-to-list coercion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8d462891b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The previous implementation called _has_set_types only on items[0], which silently corrupted heterogeneous chunks where later items contained frozenset or set values (msgspec coerces them to list without raising). Scan every item now. Also removes the list[:5] slice on nested lists so frozensets inside longer nested containers are not missed. Adds a regression test for the heterogeneous-chunk case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
🤖 Micro-benchmark: msgpack vs cloudpickle serde (isolated, no I/O) Setup: 5 runs each, N items per chunk, zstd level 3.
The string-heavy dataset is the only case where msgpack produces larger output (+15%), because cloudpickle can intern repeated short strings. For numeric and mixed data msgpack is equal or smaller. |
|
🤖 End-to-end benchmark: benchmark_shuffle.py (8 input shards x 50K items x 200 bytes, 8 output shards, 3 warm repeats, local executor)
Warmed throughput: 31.8K vs 29.0K items/s → +9.6% end-to-end. The serde gain is larger in isolation (2–5x) but scatter/reduce is only one slice of total pipeline time alongside item generation, file I/O, coordinator scheduling, and k-way merge. |
|
🤖 The local benchmark shows +10% end-to-end but the benchmark data is 97% string bytes by volume (3 int routing keys + 168-char random payload per item), which is the worst case for msgpack: the micro-benchmark showed +15% compressed size for string-heavy chunks and only 1.2x read speedup. On GCS that size penalty is not free. For a string-heavy chunk of 50K items x 200 bytes (~10 MB uncompressed), pickle compresses to roughly 5 MB and msgpack to ~5.75 MB. At 100-200 MB/s GCS throughput that is 4-7 ms extra per chunk on both the scatter write and reduce read sides. At 1,000 mapper chunks in a large text job that is 8-14 seconds of extra network time, which exceeds the entire CPU-side serde gain measured locally. Summary by workload type:
Marin's actual shuffle workloads are almost entirely in the third row. Recommending we keep the format-tag infrastructure (the \x00/\x01 frame prefix and dispatch in _iter_chunk) for future use, but not default to msgpack until there is a numeric-heavy pipeline to target. Closing in favor of merging the OOM-proof scatter work first. |
|
Yeah not sure depending on the input whether this is worthwhile |
|
Nice experiment! Thanks @hsuhanooi !
Agreed. Would prefer to not introduce extra/non-default knobs at this time to avoid complexity 🙇 |
|
🤖 Closing — the format-tag infrastructure (\x00/\x01 frame prefix and dispatch in _iter_chunk) is a useful foundation, but defaulting to msgpack does not net-positive for Marin's string-heavy workloads on GCS. See benchmark comments above. |
Replace cloudpickle sub-batches with msgspec msgpack for scatter chunk serialization. msgpack is 2-5x faster on write and ~1.5x faster on read for plain dicts with primitive values. A one-byte format tag prefixes each frame so readers dispatch to msgpack or pickle. Items containing frozenset or set are detected before encoding and routed to the pickle path, preventing silent data loss from msgspec's frozenset-to-list coercion.