[zephyr] Arrow-native scatter/reduce: 1.1x reduce speedup by rjpower · Pull Request #4178 · marin-community/marin

rjpower · 2026-03-26T17:27:30Z

Replace Python-object-centric scatter/reduce with Arrow columnar operations.
Scatter accumulates Arrow micro-batches (64 items) with a persisted _sort_key
column, flushing sorted row groups at 2MB via pc.sort_indices. Reduce
concatenates chunk tables and sorts columnar instead of heapq.merge over
deserialized Python dicts. External sort spills to Parquet instead of
pickle+zstd.

Benchmark (500K items, 64 shards, ~150 byte rows simulating fuzzy dedup):
Scatter: 0.64s (786K items/s)
Reduce Arrow: 1.28s (391K items/s)
Reduce Python: 13.29s (38K items/s)
Reduce speedup: 10.4x

Backward compatible: old format files without _sort_key fall back to the
Python merge path.

claude · 2026-03-26T17:27:46Z

Claude finished @rjpower's task in 1m 53s —— View job

Review: Arrow-native scatter/reduce

Replaces Python-centric scatter/reduce with Arrow columnar operations. Scatter accumulates micro-batches with a persisted _sort_key column, flushing sorted row groups at 2MB. Reduce concatenates and sorts in Arrow instead of heapq.merge over deserialized Python dicts. Backward-compatible fallback when _sort_key is absent. The implementation matches the stated objectives.

No specification found (no linked issue, design doc, or .agents/projects/ file). At 762 lines added, a specification is expected per repo conventions.

Findings:

Bug: _find_group_boundaries is O(n) Python per-element. plan.py:188-189 calls arr[i].as_py() for every row in the sorted table. For large tables this is a hot loop doing Python scalar extraction. Consider pc.run_end_encode or a diff-based approach (pc.not_equal(arr[:-1], arr[1:])) to find boundaries in Arrow without per-element Python calls. This partially negates the 10x speedup claim for large datasets — the benchmark uses only 1000 unique keys across 500K items, so groups are large and this loop is cheap; with high-cardinality keys (e.g. document hashes in dedup) it would dominate.
Bug: _ShardBuffer._flush_micro detects has_sort from sorts[0] (shuffle.py:537). If the first item legitimately has sort_value=None but others don't (or vice versa), this would produce inconsistent envelopes. In practice sort_fn is either always None or always set for a given scatter, so this is safe today, but the detection is fragile — consider checking sort_fn is not None at buffer construction instead.
Correctness: _reduce_gen skips Arrow path for pickled items (plan.py:287) but _arrow_reduce_gen has an external-sort branch that reads _ZEPHYR_SHUFFLE_ITEM_COL without checking is_pickled. If a future caller bypasses _reduce_gen and calls _arrow_reduce_gen directly on a pickled shard, items would be raw bytes. The benchmark does call _arrow_reduce_gen directly — currently safe because the benchmark doesn't use pickled items, but the function lacks a guard.
Resource: _read_run in external_sort.py:216-218 reads the entire run file into memory via pq.read_table(path). If runs are large, this loads all of them simultaneously during heapq.merge. Consider using pq.ParquetFile(path).iter_batches() to stream rows instead.
Nit (not blocking): external_sort_merge_arrow uses two different filesystem APIs — fsspec.core.url_to_fs on line 224 and iris.marin_fs.url_to_fs on line 226. The original external_sort_merge does the same, so this is pre-existing, but worth noting.

Tests cover sort-key propagation, Arrow vs Python equivalence, combiner path, external sort roundtrip, and cleanup. Test coverage looks adequate for the new paths.

rjpower · 2026-03-26T17:27:50Z

🤖 Specification (>500 LOC PR)

Problem: Scatter/reduce serializes items as Python dicts, sorts via heapq.merge (O(N log K) Python comparisons), and external sort spills via pickle+zstd. The reduce path is the bottleneck: 13s for 500K items.

Approach:

shuffle.py: New _ShardBuffer class accumulates Arrow micro-batches (64 items), flushes sorted row groups at 2MB. Envelope includes _sort_key column. get_chunk_tables() yields Arrow tables without Python materialization. has_sort_key propagated through manifest.
plan.py: _arrow_merge_sorted_chunks() concatenates all chunk tables and sorts via pc.sort_indices. _find_group_boundaries() detects contiguous key groups. _arrow_reduce_gen() does late materialization — items only become Python objects when reducer_fn iterates. Falls back to Python path when has_sort_key=False or items are pickled.
external_sort.py: external_sort_merge_arrow() writes Parquet spill files in pass 1 (Arrow sort), reads via heapq.merge in pass 2 (Python). Eliminates pickle from scatter/reduce.
execution.py: Updated _make_envelope call to pass dummy key_values for Arrow serializability test.

Key code: The critical optimization is in _arrow_reduce_gen (plan.py:215-275). It reads chunk tables as Arrow, concatenates with promote_options="default" for schema evolution, sorts via pc.sort_indices, then groups by key using boundary detection on the sorted _sort_key column. Items are only materialized to Python when reducer_fn iterates.

Tests: 8 new tests in test_shuffle.py covering Arrow merge correctness vs Python path, sort key column presence, external sort roundtrip/cleanup, zero-memory fallback, combiner interaction. All 46 existing tests pass. New benchmark_scatter_reduce.py for isolated measurement.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d6fa272b44

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T17:33:14Z

+        table = pq.read_table(path)
+        yield from table.to_pylist()


Stream Arrow run files during external merge

_read_run materializes each spill file with pq.read_table(...).to_pylist() before yielding any row. Because heapq.merge primes all iterators, this loads every run fully into Python memory at once, which can OOM on the large shards that trigger external sort in the first place. This defeats the memory-safety goal of the external path; read runs incrementally (e.g., scanner/batches) instead of whole-table materialization.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-26T17:33:14Z

+    )
+
+    if use_external:
+        sort_keys: list[tuple[str, str]] = [(_ZEPHYR_SHUFFLE_SORT_KEY_COL, "ascending")]


Preserve secondary sort key in external Arrow reduce

When external sort is used, reduce ordering is computed only by _sort_key. This drops sort_by semantics for equal keys, so reducers that depend on within-group order (for example next(items) / keep-first patterns) can produce different results from the in-memory Arrow path and the legacy Python merge path. The external Arrow sort/merge key needs to include _sort_secondary when present.

Useful? React with 👍 / 👎.

Is this true?

hrm i don't know why it says this, it adds the secondary sort key right below:

# Peek at the first chunk table to check for secondary sort column first_tables = list(islice((t for it in shard.iterators for t in it.get_chunk_tables()), 1)) if first_tables and _ZEPHYR_SHUFFLE_SORT_SECONDARY_COL in first_tables[0].column_names: sort_keys.append((_ZEPHYR_SHUFFLE_SORT_SECONDARY_COL, "ascending"))

chatgpt-codex-connector · 2026-03-26T17:33:14Z

+        batch_tables = list(islice(chunk_tables_gen, EXTERNAL_SORT_FAN_IN))
+        if not batch_tables:
+            break
+        combined = pa.concat_tables(batch_tables)


Enable schema promotion when concatenating Arrow spill inputs

Pass-1 external Arrow sort concatenates chunk tables without schema promotion. Scatter explicitly supports schema evolution across segments, and evolved item structs cause pa.concat_tables to raise ArrowInvalid here, so large reduces can fail only when they spill. Use promoted concatenation (as done in the non-external Arrow path) to keep schema-evolved inputs working.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-26T17:33:14Z

+            list(items),
+            self.shard_idx,
+            self.chunk_idx,
+            list(keys),


Keep pickle fallback compatible with non-Arrow key types

The new buffer path always passes key_values into the envelope, including when pickled=True. That means group keys must now be Arrow-serializable, and hashable-but-non-Arrow keys (e.g., frozenset) fail during RecordBatch.from_pylist, even though group_by only requires hashable keys and previously supported this via the pickle envelope path. This is a correctness regression for those pipelines.

Useful? React with 👍 / 👎.

rjpower · 2026-03-26T23:47:35Z

@ravwojdyla

So in practice, ~10% faster:

Both runs completed successfully. Results are nearly identical:

	Run 221049 (original)	Run 222210 (arrow)	Delta
Total docs	11,142,116	11,142,116	0
Duplicates	483,797	483,792	-5
Unique	10,658,319	10,658,324	+5
Dup rate	4.34%	4.34%	~0
Runtime	1h14m08s	1h06m14s	-7m54s (11% faster)

Both warn Connected components did not converge after 10 iterations (12 pipelines: p0 = MinHash LSH, p1–p10 = CC iterations, p11 = final dedup tagging). The 5-document difference is likely due to non-determinism in the CC convergence — functionally identical results.

The arrow run (222210) was ~8 minutes faster, mainly from a faster p0 stage (49m55s vs 57m13s).

ravwojdyla · 2026-03-26T23:58:03Z

@rjpower ah would have definitely preferred the 10x 🤞

Both warn Connected components did not converge after 10 iterations (12 pipelines: p0 = MinHash LSH, p1–p10 = CC iterations, p11 = final dedup tagging)

Makes sense.

The 5-document difference is likely due to non-determinism in the CC convergence — functionally identical results.

Yep - btw any other diff in counters? Asking to spot check the counters.

There's a bunch of changes in this PR, I wonder if we could narrow it down to the part that gives us the 10%, or at least the majority of it? wdyt?

rjpower · 2026-03-27T00:03:54Z

We can drop the rust changes, those have a pretty minimal effect. The rest is pretty related:

Where the wins come from

shuffle.py — Scatter write path (-13.5% CPU share, main wall-clock win)

Old hot function	Samples	New hot function	Samples
`<listcomp>` (Python envelope building)	14.9%	— eliminated —	—
`_write_buffer` (per-item buffering)	8.0%	`_flush_micro` (micro-batch Arrow flush)	7.8%
`_prepare_batch` (dict→Arrow)	2.3%	`_flush_buffer` (sorted batch flush)	6.1%
`_make_envelope` (pickle wrapping)	14.9% incl.	`write_table` (native Arrow write)	3.4%

The new path accumulates _SCATTER_MICRO_BATCH_SIZE=64 rows in Python, converts to an Arrow table once, sorts by _sort_key, and writes sorted row groups. This eliminates the per-item Python dict→pickle→list append loop.

plan.py — Reduce path (-1.4% CPU, structural improvement)

Old	New	Attribution
`_reduce_gen` + `external_sort_merge` (Python heapq merge of pickled items)	`_arrow_reduce_gen` + `_arrow_merge_sorted_chunks` (Arrow concat + sort_indices + take)	Data stays columnar through reduce; `_find_group_boundaries` uses vectorized `pc.not_equal` instead of per-row key extraction
`_reduce_bucket_to_links` 9.5% inclusive	`_reduce_bucket_to_links` unchanged	The reducer function itself is identical; only the sort/group wrapper changed

fuzzy.py — Bucket key fix (enables Arrow path)

The bucket value is now emitted as signed int64 (b - (1 << 64) for values ≥ 2⁶³) so Arrow infers int64 instead of failing on uint64 overflow. This is the enabler for the Arrow scatter path — without it, buckets fall back to pickle serialization.

minhash_ops.rs — Minor Rust optimizations (~same MinHash %, slightly faster absolute)

OnceLock for the whitespace regex (was recompiled per call)
Reusable ngram_buf for char-window encoding (avoids String allocation per ngram)

Pipeline stage timing

Both runs: 12 pipelines (p0 = MinHash LSH scatter, p1–p10 = CC iterations, p11 = final dedup tagging). CC did not converge in 10 iterations for either run.

	Original	Arrow
p0	57m13s (77% of total)	49m55s (75% of total)
p1–p11	~1m14s avg each	~1m17s avg each

The p0 speedup (~7 minutes) accounts for essentially all of the wall-clock improvement. p1–p11 are slightly slower (+3s avg) because the Arrow reduce path has higher per-pipeline overhead (concat + sort + take) that doesn't amortize on these small CC iteration passes.

@claude take care of this, remove the rust/** changes pls

claude · 2026-03-27T00:04:25Z

Claude finished @rjpower's task in 35s —— View job

Removing rust/** changes

Revert rust/dupekit/src/minhash_ops.rs to origin/main
Commit and push

Reverted rust/dupekit/src/minhash_ops.rs to match origin/main, removing the OnceLock regex caching and reusable ngram_buf optimizations from this PR. Pushed as c91ae159b.

rjpower · 2026-03-27T00:07:50Z

Looks like the counters are lost after the run finishes? We should probably write them at least to the coordinator log.

Here's the CC trajectory:

  │ Iteration │ Original (changes) │ Arrow (changes) │ Delta │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 1         │ 460,946            │ 460,946         │ 0     │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 2         │ 107,554            │ 107,109         │ -445  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 3         │ 54,537             │ 54,080          │ -457  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 4         │ 29,366             │ 29,156          │ -210  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 5         │ 20,963             │ 20,869          │ -94   │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 6         │ 17,867             │ 17,622          │ -245  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 7         │ 15,982             │ 16,192          │ +210  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 8         │ 14,930             │ 14,936          │ +6    │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 9         │ 13,601             │ 13,765          │ +164  │
  ├───────────┼────────────────────┼─────────────────┼───────┤
  │ 10        │ 13,132             │ 12,996          │ -136  │
  └───────────┴────────────────────┴─────────────────┴───────┘

looks identical for the first step and then it's non-deterministic after that according to Claude, so expected to see some variation.

ravwojdyla · 2026-03-27T00:13:24Z

Looks like the counters are lost after the run finishes? We should probably write them at least to the coordinator log.

Yep - agreed. I would like maybe some integration with the artifact/executor as well, but step by step :)

ravwojdyla

I think this makes sense, my only worry is that it adds a fairly complicated special branch (to already complicated code). I wouldn't be surprised if we kept the previous logic and squeeze something on to order of 10% from there. wdyt?

Either way, if we push forward on this, I think we need to confirm this works on larger scale. I'm happy to run that experiment if needed.

ravwojdyla · 2026-03-27T00:15:48Z

+            external_sort_dir=sort_dir,
+        )
+    )
+    import os


man I don't get it - why do LLMs like local import so much?

i don't know either, right, it's weird

ravwojdyla · 2026-03-27T00:50:25Z

+    serialization/deserialization in pass 1. Pass 2 still uses Python heapq
+    for simplicity.
+    """
+    from zephyr.writers import ensure_parent_dir


nit: local import

ravwojdyla · 2026-03-27T00:53:39Z

+        for batch in pf.iter_batches():
+            yield from batch.to_pylist()
+
+    run_iters = [_read_run(p) for p in run_paths]


nit: this logic is exactly repeated from above, consider to extract.

ravwojdyla · 2026-03-27T01:07:42Z

            seg_paths.append(seg_file)
            ensure_parent_dir(seg_file)
-            writer = pq.ParquetWriter(seg_file, schema)
+            writer = pq.ParquetWriter(seg_file, schema, compression="zstd", compression_level=1)


How much do we gain just by using faster compression?

this was added after the initial runs, so we'll see, i don't think it's a big gain over snappy though

if no gain - should we just use default?

ravwojdyla · 2026-03-27T01:12:38Z

+            scanner = dataset.scanner(
+                columns=columns,
+                filter=(
+                    (pc.field(_ZEPHYR_SHUFFLE_SHARD_IDX_COL) == self.shard_idx)
+                    & (pc.field(_ZEPHYR_SHUFFLE_CHUNK_IDX_COL) == chunk_idx)
+                ),
+                batch_size=batch_size,
+                use_threads=False,
+            )
+            batches = list(scanner.to_batches())
+            if batches:
+                yield pa.Table.from_batches(batches)


we don't need a lazy scanner if we materialize all batches anyway?

we can remove it if desired it's more an option if we e.g. wanted to bound the memory for the external sort

I would vote to remove it, if it's nice-to-have and not on the critical path.

ravwojdyla · 2026-03-27T01:40:18Z

            file_entries.append(entry)

+        # has_sort_key is True only if ALL entries with data for this shard have it
+        has_sort_key = bool(file_entries) and all(entry.get("has_sort_key", False) for entry in file_entries)


Q: is this necessary for arrow sort to work?

ravwojdyla · 2026-03-27T01:47:05Z

+_ZEPHYR_SHUFFLE_SORT_KEY_COL = "_sort_key"
+_ZEPHYR_SHUFFLE_SORT_SECONDARY_COL = "_sort_secondary"


There's no need to use _ prefix, in fact that inconsistent with the other field names. These are envelope fields so we can call them whatever we like.

i'm fine either way -- but don't we want to keep the parquet flat if possible?

I assume there's maybe some overhead - but we would need to benchmark that. I would assume at this time this doesn't matter, and there are much lower hanging fruits out there.

ravwojdyla · 2026-03-27T16:24:24Z

+    )
+
+    if use_external:
+        sort_keys: list[tuple[str, str]] = [(_ZEPHYR_SHUFFLE_SORT_KEY_COL, "ascending")]


Is this true?

rjpower · 2026-03-27T17:43:04Z

I think this makes sense, my only worry is that it adds a fairly complicated special branch (to already complicated code). I wouldn't be surprised if we kept the previous logic and squeeze something on to order of 10% from there. wdyt?

i do think this is much faster, just that fuzzy dedup isn't a particularly great stress test -- ~40% of the time is in n-gram computation, for example. i think it's the right direction -- though I think I can also simplify this a lot by getting rid of the vestigial code as well, let me take a pass at that.

Either way, if we push forward on this, I think we need to confirm this works on larger scale. I'm happy to run that experiment if needed.

+1, or feel free to send me some commands you use. i feel i should do it if I was the one to annoy us with this sidequest.

ravwojdyla · 2026-03-27T18:27:11Z

though I think I can also simplify this a lot by getting rid of the vestigial code as well, let me take a pass at that.

If we could simplify this PR that would be amazing!

+1, or feel free to send me some commands you use. i feel i should do it if I was the one to annoy us with this sidequest.

Running on at least a single full split (e.g. quality=high) of nemotron would be a good stress test. Instead of doing fuzzy, we can do exact-paragraph to reduce the wall-time while still exercise reasonable scale shuffle. We can use the experiments/dedup/poc_nemotron.py experiment for that.

rjpower · 2026-03-27T19:16:56Z

though I think I can also simplify this a lot by getting rid of the vestigial code as well, let me take a pass at that.

If we could simplify this PR that would be amazing!

+1, or feel free to send me some commands you use. i feel i should do it if I was the one to annoy us with this sidequest.

Running on at least a single full split (e.g. quality=high) of nemotron would be a good stress test. Instead of doing fuzzy, we can do exact-paragraph to reduce the wall-time while still exercise reasonable scale shuffle. We can use the experiments/dedup/poc_nemotron.py experiment for that.

👍 yeah i can run that for exact

Replace Python-object-centric scatter/reduce with Arrow columnar operations. Scatter now accumulates Arrow micro-batches with a persisted _sort_key column and sorts via pc.sort_indices. Reduce concatenates Arrow tables and sorts columnar instead of heapq.merge over deserialized Python dicts. External sort spills to Parquet instead of pickle+zstd. Benchmark: 500K items x 64 shards, reduce drops from 13.3s to 1.3s (10.4x). Backward compatible — old format files without _sort_key fall back to the Python merge path.

…torized boundaries - Stream Arrow run files via iter_batches() instead of whole-table read to avoid OOM on large shards that trigger external sort - Add promote_options="default" to concat_tables in external sort for schema-evolved items - Include _sort_secondary in external sort keys and merge key so within-group order is preserved - Vectorize _find_group_boundaries using pc.not_equal diff instead of per-element .as_py() calls — matters for high-cardinality keys - Store has_sort as a field on _ShardBuffer set at construction instead of fragile sorts[0] detection - Add assertion guard in _arrow_reduce_gen for pickled shards

- Rename scatter columns to _zephyr_* prefix (no compat aliases) - Replace _make_envelope/_make_pickle_envelope with unified make_envelope_batch() and unwrap_items() helpers - Replace PickleDiskChunk with ParquetDiskChunk for intermediates - Remove pickle-based external sort; rename external_sort_merge_arrow to external_sort_merge, extract _read_parquet_run helper - Remove _merge_sorted_chunks (Python heapq path); _arrow_reduce_gen handles both flat and pickled envelopes via unwrap_items - Remove has_sort_key field (always true), ZEPHYR_META_COLUMNS (unused), _ITEM_BYTES_FALLBACK (unreachable) Net: -292 lines, one serialization format, one reduce path.

When a shard is retried (e.g. due to heartbeat timeout), the old and new attempts share the same external sort directory. The old attempt's finally block can delete run files the new attempt is reading, causing FileNotFoundError. Adding /attempt-{n} to the path isolates each attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When fan_in is overestimated (due to first chunk having 0 bytes), the last batch may contain only empty tables. ParquetWriter.close() without any write_table() calls doesn't create a file on GCS, but the path is still appended to run_paths. The subsequent read_metadata then fails with FileNotFoundError. Fix: skip empty batches after concat, and probe past empty chunks when computing the sort budget. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… staleness Writes go through pyarrow's native GcsFileSystem (via ParquetWriter), but reads used fsspec/gcsfs which caches parent directory listings. When read_metadata(run-0000) cached the dir, run-0001 (written moments later) wasn't in the cache yet, causing FileNotFoundError. Fix: use pyarrow-native path resolution for all reads (no filesystem arg), matching the write path. Added GCS integration test that reproduces the bug.

…ernal sort SpillWriter: byte-budgeted ParquetWriter wrapper with background write thread for GCS upload overlap. Uses zstd-1 compression (was snappy for spills). Two modes: write_table() accumulates and auto-flushes row groups, write_row_group() writes immediately (for scatter where each chunk must be a separate row group). TableAccumulator: byte-budgeted Arrow table batching, replacing the row-count-based _MERGE_OUTPUT_BATCH_SIZE in the k-way merge output. Refactors: - external_sort: _write_spill_file uses SpillWriter (was binary-search row-group sizing), merge output uses TableAccumulator - shuffle: _write_parquet_scatter uses SpillWriter (was manual ParquetWriter + pending_chunk + _flush_pending)

Replaces the .pkl.zst spill files in external_sort_merge with Parquet files written through a new SpillWriter (byte-budgeted ParquetWriter with a background I/O thread). Items are cloudpickle-style serialized into a single `_zephyr_payload` binary column; the Python heapq.merge semantics on both passes are unchanged, so behavior is identical. This is the minimal slice cherry-picked from #4178 (arrow-scatter-reduce) that removes raw pickle files from zephyr's shuffle data plane without touching the scatter envelope or reduce merge. Follow-ups will promote the sort key to a first-class column and move reduce to columnar Arrow merge. Pass-2 read-batch-size estimation now reads row-group metadata directly from the parquet file instead of probing a pickled sample.

## Summary - Replaces `.pkl.zst` spill files in `external_sort_merge` with Parquet files written via a new `SpillWriter` (byte-budgeted `pq.ParquetWriter` with a background I/O thread). - Items are pickled into a single `_zephyr_payload` binary column. Python `heapq.merge` semantics on both passes are unchanged, so behavior is identical — this is a format swap only. - Pass 2 reads spills back with `pq.ParquetFile.iter_batches` and unpickles one row group at a time to feed the heap merge. - Pass-2 read-batch-size estimation now reads row-group metadata directly from the parquet file instead of probing a pickled sample. This is the minimal slice cherry-picked from #4178 that removes raw pickle files from zephyr's shuffle data plane. The scatter envelope and reduce merge are untouched; follow-ups will promote the sort key to a first-class column (Tier 2) and move reduce to columnar Arrow merge (Tier 3). `SpillWriter` is added as `lib/zephyr/src/zephyr/spill_writer.py` verbatim from #4178. `external_sort.py` is its only caller in this PR.

github-actions · 2026-04-26T01:55:43Z

This pull request has been inactive for 23 days and is marked as stale.
If there is no further activity within 7 days, it will be automatically closed.
If you believe this PR should remain open, please add a comment or update the PR.

rjpower added the agent-generated Created by automation/agent label Mar 26, 2026

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

rjpower force-pushed the arrow-scatter-reduce branch from 7e9b79f to 175a1f6 Compare March 26, 2026 19:46

rjpower requested a review from ravwojdyla March 27, 2026 00:12

rjpower changed the title ~~[zephyr] Arrow-native scatter/reduce: 10x reduce speedup~~ [zephyr] Arrow-native scatter/reduce: 1.1x reduce speedup Mar 27, 2026

ravwojdyla reviewed Mar 27, 2026

View reviewed changes

rjpower force-pushed the arrow-scatter-reduce branch 2 times, most recently from e3d9b3c to b1c1029 Compare March 27, 2026 19:05

rjpower force-pushed the arrow-scatter-reduce branch 3 times, most recently from 4772d73 to 8b8b9bf Compare April 1, 2026 23:24

rjpower and others added 8 commits April 2, 2026 10:26

More perf adjustments.

7b2352e

speedup.

9762833

rjpower added 2 commits April 2, 2026 10:26

Add dedup benchmark and experiment scripts

8572a27

rjpower force-pushed the arrow-scatter-reduce branch from 8b8b9bf to 8572a27 Compare April 2, 2026 17:26

Random improvements.

9b21193

rjpower mentioned this pull request Apr 13, 2026

[zephyr] External sort spill: Parquet instead of pickle+zstd #4695

Merged

github-actions Bot added the stale label Apr 26, 2026

		_ZEPHYR_SHUFFLE_SORT_KEY_COL = "_sort_key"
		_ZEPHYR_SHUFFLE_SORT_SECONDARY_COL = "_sort_secondary"

Conversation

rjpower commented Mar 26, 2026

Uh oh!

claude Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: Arrow-native scatter/reduce

Uh oh!

rjpower commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rjpower commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravwojdyla commented Mar 26, 2026

Uh oh!

rjpower commented Mar 27, 2026

Where the wins come from

shuffle.py — Scatter write path (-13.5% CPU share, main wall-clock win)

plan.py — Reduce path (-1.4% CPU, structural improvement)

fuzzy.py — Bucket key fix (enables Arrow path)

minhash_ops.rs — Minor Rust optimizations (~same MinHash %, slightly faster absolute)

Pipeline stage timing

Uh oh!

claude Bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Removing rust/** changes

Uh oh!

rjpower commented Mar 27, 2026

Uh oh!

ravwojdyla commented Mar 27, 2026

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

claude Bot commented Mar 26, 2026 •

edited

Loading

rjpower commented Mar 26, 2026 •

edited

Loading

claude Bot commented Mar 27, 2026 •

edited

Loading