Skip to content

Comments

partially implement FULL OUTER + LEFT and RIGHT hash joins#5096

Open
PThorpe92 wants to merge 33 commits intotursodatabase:mainfrom
PThorpe92:full_outer_hash_joins
Open

partially implement FULL OUTER + LEFT and RIGHT hash joins#5096
PThorpe92 wants to merge 33 commits intotursodatabase:mainfrom
PThorpe92:full_outer_hash_joins

Conversation

@PThorpe92
Copy link
Collaborator

@PThorpe92 PThorpe92 commented Feb 9, 2026

Description

Implements LEFT OUTER, RIGHT, and FULL OUTER hash joins. Previously, hash joins only supported inner join semantics; any outer join fell back to nested-loop. This enables hash join acceleration for the most common outer join patterns.

RIGHT JOIN is implemented as table swap/rewrite + LEFT JOIN semantics. FULL OUTER sets both outer and full_outer flags on JoinInfo.

New opcodes:
HashMarkMatched, HashScanUnmatched, HashNextUnmatched track which build-side entries matched during probing, then iterate unmatched entries for outer join NULL emission.

track_matched flag on HashTableConfig enables matched_bits vectors (one bool per entry per bucket). Entries with NULL keys are now kept when track_matched is set so they appear in the unmatched scan. Matched bits survive partition eviction/reload for spilled tables.

Optimizer:

  • HashJoinType enum (Inner, LeftOuter, FullOuter) propagated through access method selection.
  • FULL OUTER forces hash join regardless of cost (needed for unmatched build scanning).
  • Index-preference bypass skipped for FULL OUTER (can't fall back to nested-loop).
  • Sort elimination disabled when outer hash joins are present (unmatched scan produces hash-bucket order).
  • Build tables with join_info.outer are rejected to prevent incorrect matches when cursors are in NullRow mode.

Why we need subroutines for chained outer hash joins:

When a FULL OUTER (or LEFT OUTER) hash join is followed by additional inner table loops in a multi-way join chain, the unmatched emission paths - both the unmatched build scan and the unmatched probe path - need to re-enter those subsequent inner loops to produce correct results. For example, in t1 FULL OUTER JOIN t2 ON ... LEFT JOIN t3 ON ..., when we emit an unmatched t2 row/no match in t1, we still need to scan t3 to find any rows that join with t2. Without the GoSub wrapper, the inner loop cursors (e.g. t3's cursor) may not be open or rewound at the point where unmatched rows are emitted, because the unmatched scan happens after the main probe loop has finished. The GoSub/Return pattern solves this by wrapping the inner table loops in a subroutine: during normal execution the main loop calls GoSub to enter the subroutine, and when the unmatched scan later needs to emit a row, it can call the same GoSub to re-enter the inner loops from scratch - rewinding cursors, evaluating join conditions, and dispatching through the correct emit path (ORDER BY sorter, GROUP BY, aggregates, etc.) without duplicating any of that codegen. The Return instruction at the end of the subroutine jumps back to whichever call site invoked it, so the same subroutine body serves both the matched and unmatched paths.

Limitations

  • Chained FULL OUTER (A FULL OUTER B FULL OUTER C): rejected because the second join's build table has join_info.outer from the first.
  • FULL OUTER self-join: rejected (same root page for build and probe).
  • FULL OUTER with USING/NATURAL: not supported (hash join skipped for USING clauses).
  • Any FULL OUTER JOIN with no equijoin condition: still not implemented for normal NLJ paths, and hash join requires equality predicate

Perf?

TPC-H- Query 13

Before: 12.58s
After: 3.56s
Sqlite: 8.65s

Copy link

@turso-bot turso-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review @pereman2

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 11, 2026

Merging this PR will not alter performance

✅ 279 untouched benchmarks
⏩ 105 skipped benchmarks1


Comparing PThorpe92:full_outer_hash_joins (dedb1c9) with main (c58b2f6)

Open in CodSpeed

Footnotes

  1. 105 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@PThorpe92 PThorpe92 force-pushed the full_outer_hash_joins branch 2 times, most recently from ba64aa5 to 36d9123 Compare February 13, 2026 17:36
@PThorpe92 PThorpe92 marked this pull request as ready for review February 13, 2026 20:10
Copy link

@turso-bot turso-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review @pereman2

@PThorpe92 PThorpe92 force-pushed the full_outer_hash_joins branch 2 times, most recently from 00da6b4 to c195d45 Compare February 20, 2026 20:57
@PThorpe92 PThorpe92 force-pushed the full_outer_hash_joins branch from 87094fc to 52e9a65 Compare February 21, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant