Commit 3c0d43b
authored
zaphyr: vortex^H^H^H^H^H^H parquet shuffle (#3482)
* Shuffle via Parquet, with Pickle fallback
* Closes: #3478
* Scatter operations now write intermediate shuffle data as Parquet
columnar files instead of per-chunk pickle blobs (we would write mapper
x reducer x chunks files before, which cause problems at scale). Each
mapper shard writes Parquet file(s) [^1] with `shard_idx` and
`chunk_idx` columns, reducers filter via predicate pushdown to their
respective chunks and do k-merge via sorted chunks (as before)
* Add `ParquetDiskChunk` - references a slice of a shared Parquet file,
filtered to target shard and chunk
* Rename `DiskChunk` → `PickleDiskChunk` to clarify its role as fallback
* Fall back to pickle with a warning when item is not Arrow-serializable
[^1]: a single mapper may produce more than one parquet file IFF the
schema of the chunks changes, that's possible e.g. when null field
becomes concrete type, i.e. evolves to optional field.1 parent 7afc138 commit 3c0d43b
File tree
6 files changed
+392
-102
lines changed- lib/zephyr
- src/zephyr
- tests
6 files changed
+392
-102
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
759 | 759 | | |
760 | 760 | | |
761 | 761 | | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
762 | 765 | | |
763 | 766 | | |
764 | 767 | | |
| |||
0 commit comments