[zephyr] External sort spill: Parquet instead of pickle+zstd by rjpower · Pull Request #4695 · marin-community/marin

rjpower · 2026-04-13T19:15:14Z

Summary

Replaces .pkl.zst spill files in external_sort_merge with Parquet files written via a new SpillWriter (byte-budgeted pq.ParquetWriter with a background I/O thread).
Items are pickled into a single _zephyr_payload binary column. Python heapq.merge semantics on both passes are unchanged, so behavior is identical — this is a format swap only.
Pass 2 reads spills back with pq.ParquetFile.iter_batches and unpickles one row group at a time to feed the heap merge.
Pass-2 read-batch-size estimation now reads row-group metadata directly from the parquet file instead of probing a pickled sample.

This is the minimal slice cherry-picked from #4178 that removes raw pickle files from zephyr's shuffle data plane. The scatter envelope and reduce merge are untouched; follow-ups will promote the sort key to a first-class column (Tier 2) and move reduce to columnar Arrow merge (Tier 3).

SpillWriter is added as lib/zephyr/src/zephyr/spill_writer.py verbatim from #4178. external_sort.py is its only caller in this PR.

claude · 2026-04-13T19:15:38Z

Claude finished @rjpower's task in 3m 14s —— View job

PR Review: [zephyr] External sort spill: Parquet instead of pickle+zstd

Read changed files and diff
Review external_sort.py changes
Review new spill_writer.py
Post review findings

The format swap from pickle+zstd to Parquet is clean — heapq.merge semantics are preserved, the pass-1 write path is simplified with SpillWriter, and pass-2 read-batch estimation is improved by reading parquet metadata directly instead of probing a pickled sample. The makedirs fix for pq.ParquetWriter not creating parent directories is a good catch.

I left 4 inline comments. Here's the summary:

Issues

1. SpillWriter.close() deadlock when background thread has errored (inline)

If _background_writer_loop catches a write error and exits, close() tries to queue.put(remaining) then queue.put(_SENTINEL) into a maxsize=1 queue with no consumer. The second put blocks forever. Fix: check self._error_box before putting to the queue.

2. __exit__ flushes on error paths (inline)

When the with SpillWriter(...) block raises, __exit__ still flushes remaining accumulated data — wasted I/O for a partial run, and a second vector into the deadlock above. Consider short-circuiting when exc_type is not None.

Nits

3. Bare except Exception in _safe_read_batch_size (inline)

Silently swallows metadata read failures. A logger.warning would help debuggability. (Pre-existing pattern, not a regression.)

4. _background_writer_loop doesn't catch errors from queue.get() (inline)

Minor robustness improvement — wrap the entire loop body including get() in the try/except.

Not flagged (looks good)

Pass-2 _read_run using pf.iter_batches(batch_size=read_batch_size) is a clean replacement for the old progressive-delete pattern — parquet row-group streaming naturally bounds memory.
Cleanup in finally block now calls url_to_fs(path) per file instead of reusing a single fs — slightly more work but more correct.
_items_to_table serialization approach is straightforward; the _zephyr_payload binary column is a sensible envelope for the pickle-per-item strategy.
SpillWriter landing in isolation (no other imports) is a clean way to stage it for later use by scatter/reduce.

No unit tests for SpillWriter itself — it's exercised indirectly through shuffle/groupby/execution tests, but edge cases (empty writes, background thread error propagation, byte threshold boundaries) aren't covered. Worth adding, especially given the threading.

claude · 2026-04-13T19:18:24Z

+    def close(self) -> None:
+        """Flush remaining accumulated data and wait for the background thread to finish."""
+        if self._closed:
+            return
+        self._closed = True
+        remaining = self._accumulator.flush()
+        if remaining is not None:
+            self._queue.put(remaining)
+        self._queue.put(_SENTINEL)
+        self._thread.join()
+        self._writer.close()
+        self._check_error()


Potential deadlock when background thread exits with an error and there's remaining data.

If _background_writer_loop catches a BaseException from writer.write_table(item) and returns early, the thread dies. Then close() runs:

self._accumulator.flush() returns remaining data

self._queue.put(remaining) — succeeds (queue was empty after the failed item was dequeued)

self._queue.put(_SENTINEL) — blocks forever: maxsize=1, the queue is full from step 2, and the thread is dead so nobody is draining it

The _check_error() call is at the end of close() and never reached.

Suggestion: check for errors before putting to the queue, or use put with a timeout, or check self._thread.is_alive() before attempting puts:

def close(self) -> None: if self._closed: return self._closed = True remaining = self._accumulator.flush() if remaining is not None and not self._error_box: self._queue.put(remaining) if not self._error_box: self._queue.put(_SENTINEL) self._thread.join() self._writer.close() self._check_error()

claude · 2026-04-13T19:18:29Z

+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        self.close()


__exit__ flushes and writes on error paths too.

When the with SpillWriter(...) block raises (e.g., heapq.merge hits a bad key), __exit__ still calls close(), which flushes remaining accumulated data and writes it — wasted I/O for a partial run that will never be read. More importantly, this interacts with the deadlock above: if the write path errored, close() could hang on the queue put.

Consider short-circuiting when entering from an exception:

Suggested change

def __exit__(self, exc_type, exc_val, exc_tb) -> None:

self.close()

def __exit__(self, exc_type, exc_val, exc_tb) -> None:

if exc_type is not None:

# Error path: don't flush remaining data, just tear down the thread.

self._closed = True

self._queue.put(_SENTINEL)

self._thread.join(timeout=5)

self._writer.close()

else:

self.close()

claude · 2026-04-13T19:18:36Z

+def _background_writer_loop(
+    write_queue: "queue.Queue[pa.Table | object]",
+    writer: pq.ParquetWriter,
+    error_box: list[BaseException],
+) -> None:
+    """Drain write_queue, writing each table as a row group. Stops on _SENTINEL."""
+    while True:
+        item = write_queue.get()
+        if item is _SENTINEL:
+            return
+        try:
+            writer.write_table(item)
+        except BaseException as exc:
+            error_box.append(exc)
+            return


Minor: if write_queue.get() itself raises (e.g., the thread is interrupted), the exception propagates uncaught and the thread dies silently. The caller would then deadlock on close(). This is very unlikely in practice, but a catch-all around the entire loop body (including get()) would make it robust:

def _background_writer_loop(write_queue, writer, error_box): while True: try: item = write_queue.get() if item is _SENTINEL: return writer.write_table(item) except BaseException as exc: error_box.append(exc) return

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b1bf78812

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T19:19:16Z

+        if remaining is not None:
+            self._queue.put(remaining)
+        self._queue.put(_SENTINEL)
+        self._thread.join()


Prevent close from hanging after background write failure

close() unconditionally does blocking put() calls (remaining and then _SENTINEL) before checking whether the writer thread has already failed. If writer.write_table() raises in _background_writer_loop, the thread exits, and a subsequent put() can block forever on the single-slot queue because no consumer remains. In that failure mode (e.g., transient GCS/Parquet write error), external sort can hang instead of surfacing the exception.

Useful? React with 👍 / 👎.

ravwojdyla

Have you run this on a reasonable size shuffle? What is the perf delta?

Replaces the .pkl.zst spill files in external_sort_merge with Parquet files written through a new SpillWriter (byte-budgeted ParquetWriter with a background I/O thread). Items are cloudpickle-style serialized into a single `_zephyr_payload` binary column; the Python heapq.merge semantics on both passes are unchanged, so behavior is identical. This is the minimal slice cherry-picked from #4178 (arrow-scatter-reduce) that removes raw pickle files from zephyr's shuffle data plane without touching the scatter envelope or reduce merge. Follow-ups will promote the sort key to a first-class column and move reduce to columnar Arrow merge. Pass-2 read-batch-size estimation now reads row-group metadata directly from the parquet file instead of probing a pickled sample.

rjpower · 2026-04-14T21:27:53Z

Have you run this on a reasonable size shuffle? What is the perf delta?

It's a minor improvement, though the focus was more on using the same file format everywhere:

⏺ Phase-1 results (~770MT, one shard, one run each)

Step-level wall-clock from step_runner / stage completion logs:

Phase	main (pickle+zstd spill)	branch (parquet spill)	Δ
Dedup step total	541.0s	508.7s	−32.3s (−6.0%)
Stage 0 — Map→Scatter	122.7s	117.9s	−4.8s (no spill; noise)
Stage 1 — Reduce→Scatter (sort + spill + merge)	309.4s	281.2s	−28.3s (−9.1%)
Stage 2 — Reduce	89.9s	87.6s	−2.3s (no spill; noise)

ravwojdyla

As much as Parquet shuffle was an improvement, I become increasingly convinced it's not the right path forward ¹. Parquet adds aspects that are not easy to control (Arrow), adds requirements that we don't care about (e.g. consistent schema and row-groups).

A simpler approach could be to use zstd chunks in a binary file with a shard -> range sidecar metadata. Effectively ~~a deconstructed parquet,~~ something tailored to our use-case? At least in my quick tests it's more efficient, less memory intensive than parquet while the code is less complicated (no schema dance, no parquet reader shenanigans).

unless maybe we go full Arrow/rust. ↩

ravwojdyla · 2026-04-15T00:47:14Z

+        return sum(len(t) for t in self._tables)
+
+
+def _background_writer_loop(


should we reuse/abstract the zephyr.writers.ThreadedBatchWriter?

rjpower · 2026-04-15T17:09:21Z

Hrm, I'm a bit loathe to create our own file format. I agree with you Parquet adds some additional complexity, but as you note, blocked zstd + sidecar is kind of like Parquet.

I agree with you on the "full Parquet" - trying to infer a schema from arbitrary Python - but in this case our schema is a single column, so unlikely to be a fuss. How about we restructure this:

we hide the representation behind the SpillWriter/SpillReader
we can use the dumb parquet for now
if it turns out we get another 10% from using our own writer format, this has already given us a nice box to put it in?

The rest of the Zephyr code won't see any schemas, just use the reader/writer interfaces.

ravwojdyla · 2026-04-15T18:04:42Z

@rjpower box around shuffle format sounds great!

Rename spill_writer.py -> spill.py. Make SpillWriter/SpillReader an opaque chunked row-format API: no pa.Schema, pa.Table, or parquet types in the public surface. Callers write Python items and iterate back items (or chunks of items); the Parquet single-payload-column backing is an implementation detail. external_sort.py now imports only zephyr.spill and drops the _zephyr_payload / _SPILL_SCHEMA / _items_to_table plumbing. Pass-2 memory budgeting uses SpillReader.approx_item_bytes instead of reading parquet row-group metadata directly. Run files are now *.spill. SpillWriter is rewritten on top of zephyr.writers.ThreadedBatchWriter (per ravwojdyla's PR review) instead of open-coding a background thread + single-slot queue + sentinel handshake. This drops ~50 lines of threading plumbing and picks up ThreadedBatchWriter's polling submit (fixes the P1 close-hang-after-write-error), error propagation on submit/close, and error-path __exit__ that doesn't block.

rjpower · 2026-04-15T19:31:19Z

Okay, rewrote so it's just the SpillWriter & SpillReader, and we can replace the internals with our own format at our leisure.

rjpower · 2026-04-15T19:37:36Z

I thought about trying to use the same thing for the reducer spill & the scatter outputs, but decided it would be a bigger change than we wanted here. Probably better to do that at the "new file format" level instead, and just update the SpillWriter/Reader to do it.

rjpower added the agent-generated Created by automation/agent label Apr 13, 2026

claude bot reviewed Apr 13, 2026

View reviewed changes

Comment thread lib/zephyr/src/zephyr/external_sort.py

claude bot reviewed Apr 13, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

rjpower requested a review from ravwojdyla April 14, 2026 00:55

ravwojdyla reviewed Apr 14, 2026

View reviewed changes

ravwojdyla reviewed Apr 15, 2026

View reviewed changes

rjpower added 2 commits April 15, 2026 12:13

Address PR comments.

3129abf

rjpower force-pushed the rjpower/20260413-parquet-scatter branch from 2b1bf78 to e18849b Compare April 15, 2026 19:28

rjpower merged commit 8f2b98a into main Apr 15, 2026
37 checks passed

rjpower deleted the rjpower/20260413-parquet-scatter branch April 15, 2026 19:36

		def __exit__(self, exc_type, exc_val, exc_tb) -> None:
		self.close()

		return sum(len(t) for t in self._tables)


		def _background_writer_loop(

Conversation

rjpower commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

claude bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: [zephyr] External sort spill: Parquet instead of pickle+zstd

Issues

Nits

Not flagged (looks good)

Uh oh!

claude bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ravwojdyla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 14, 2026

Uh oh!

ravwojdyla left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

ravwojdyla Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

ravwojdyla commented Apr 15, 2026

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

Uh oh!

rjpower commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjpower commented Apr 13, 2026 •

edited

Loading

claude bot commented Apr 13, 2026 •

edited

Loading

ravwojdyla left a comment •

edited

Loading

ravwojdyla left a comment •

edited

Loading