perf(join): vectorize high-fanout inner join build takes#7143
perf(join): vectorize high-fanout inner join build takes#7143RitwijParmar wants to merge 2 commits into
Conversation
Greptile SummaryThis PR optimises the inner-join probe path for high-fanout workloads by first collecting all
Confidence Score: 4/5Safe to merge; the change is a pure performance optimisation on the hash-join probe path with a correct growable fallback, and the new Python test exercises the targeted scenario. The vectorised path produces correct output and the output-order guarantee holds because both the build-side concat and the probe-side index extraction walk src/daft-local-execution/src/join/inner_join.rs — specifically the threshold logic in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[probe_inner: iterate input tables] --> B[For each input_table: collect all matches into Vec BuildMatch]
B --> C{should_use_vectorized_take?}
C -->|matches >= 1024 AND fanout >= 4x probe_rows AND avg run len >= 8| D[build_side_with_vectorized_take]
C -->|otherwise| E[build_side_with_growable]
D --> F[Group consecutive same-build-table matches into runs]
F --> G[For each run: table.take row_idxs]
G --> H[RecordBatch::concat taken_tables]
E --> I[GrowableRecordBatch::extend per match]
I --> J[GrowableRecordBatch::build]
H --> K[Extract probe indices from matches]
J --> K
K --> L[input_table.take probe_indices]
L --> M[Assemble final joined RecordBatch]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[probe_inner: iterate input tables] --> B[For each input_table: collect all matches into Vec BuildMatch]
B --> C{should_use_vectorized_take?}
C -->|matches >= 1024 AND fanout >= 4x probe_rows AND avg run len >= 8| D[build_side_with_vectorized_take]
C -->|otherwise| E[build_side_with_growable]
D --> F[Group consecutive same-build-table matches into runs]
F --> G[For each run: table.take row_idxs]
G --> H[RecordBatch::concat taken_tables]
E --> I[GrowableRecordBatch::extend per match]
I --> J[GrowableRecordBatch::build]
H --> K[Extract probe indices from matches]
J --> K
K --> L[input_table.take probe_indices]
L --> M[Assemble final joined RecordBatch]
Reviews (1): Last reviewed commit: "perf(join): vectorize high-fanout inner ..." | Re-trigger Greptile |
| return build_side_with_growable(build_side_tables, matches); | ||
| } | ||
|
|
||
| let mut taken_tables = Vec::with_capacity(count_build_table_runs(matches)); |
There was a problem hiding this comment.
Redundant O(n) traversal of
matches
count_build_table_runs is called here to size taken_tables, but the same traversal was already performed inside should_use_vectorized_take when deciding to take this path. For large match sets (≥1024 entries, which is when vectorized take is even attempted), this doubles the scan cost just to compute capacity. Passing the pre-computed run count through from the dispatch site would eliminate the redundancy without changing behaviour.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| fn should_use_vectorized_take(matches: &[BuildMatch], probe_rows: usize) -> bool { | ||
| if matches.len() < MIN_MATCHES_FOR_VECTORIZED_TAKE || probe_rows == 0 { | ||
| return false; | ||
| } | ||
|
|
||
| if matches.len() < probe_rows.saturating_mul(MIN_FANOUT_FOR_VECTORIZED_TAKE) { | ||
| return false; | ||
| } | ||
|
|
||
| let run_count = count_build_table_runs(matches); | ||
| matches.len() >= run_count.saturating_mul(MIN_AVG_RUN_LEN_FOR_VECTORIZED_TAKE) | ||
| } |
There was a problem hiding this comment.
Fanout threshold uses total probe rows, not matched probe rows
probe_rows is input_table.len(), the total number of probe-side rows in the current micro-batch, including rows that produced zero matches. In sparse-match scenarios (e.g. 1 000 probe rows, 10 of which each match 200 build rows), matches.len() is 2 000 but probe_rows * MIN_FANOUT_FOR_VECTORIZED_TAKE is 4 000, so the fanout guard short-circuits and the growable path is always taken — even though average run length easily clears MIN_AVG_RUN_LEN_FOR_VECTORIZED_TAKE. This won't produce wrong output, but the optimization won't fire in some high-fanout, low-selectivity workloads where it would help most.
Summary
takebatches instead of per-rowGrowableRecordBatch::extendFixes #7076
Tests
cargo fmt --checkcargo check -p daft-local-executiongit diff --check