feat: `HashJoinOp` mlir implementation #129

Trion129 · 2025-04-28T15:11:22Z

This commit adds MLIR implementation for HashJoinOp
This tries to follow a similar MLIR pattern as in issue #97

Substrait.cpp: Verifier implementations for HashJoin and KeyComparison operations. For Key comparisons tries to allow similar types. Might not be the best way.
Import.cpp: Converts HashJoinRel into Filter->HashJoin->ConditionRegion. Post join filter is lifted to filter, key comparison operator added to nested region.
Export.cpp: It will not be 1->1 mapping form import to export as currently it does not convert Filter->HashJoin into HashJoin, I have assumed it might not be needed and these 2 can convert to seperate relations.
*.td files: Added enum and Op implementations.

Skipped work:
Doesn't implement custom_function_id. It is not needed by us at the moment, so left a TODO for now.

CLAassistant · 2025-04-28T15:11:28Z

All committers have signed the CLA.

mortbopet

Very nice work! left a bunch of comments. Obviously there needs to be import/export/IR tests, but given the draft state, i assume those are in the pipeline :).

The post-join filter is definitely a funny thing here, and principally i agree with your approach (creating a filter node after the hash-join). It's a normalization of the IR, and it avoids introducing redundant operations (i.e. if we'd actually create a post-join filter in the hash-join op).
However, it does present an issue for round-tripping through the IR, which this would break. It's obviously not impossible to fix this (one way would be to add extra logic in the export pass to detect how a filter op is being used, and infer the post-join filter from there)... @ingomueller-net thoughts?

EDIT: guess we aren't the first to pose this question!
https://github.com/substrait-io/substrait/blob/413c7c8c8ea149ea1596c9c3b2e57151d6ce63f7/site/docs/faq.md?plain=1#L7-L12

One way we could add it in there is to have an optional filter region, containing a substrait.filter node:

%2 = subtrait.hash_join %0, %1 on {
  ^bb0(%arg0: tuple<si32, si32>, %arg1: tuple<si32, si32, si32>):
    %3 = field_reference %arg0[0] : tuple<si32, si32> // corresponds to `left`
    %4 = field_reference %arg1[0] : tuple<si32, si32, si32>  // corresponds to `right`
    %5 = call @cmp(%3, %4) : (si32, si32) -> si1  // corresponds to `custom_function_reference`
    // ... or ...
    %5 = compare not_distinct_from %3, %4 : (si32, si32) -> si1  // corresponds to `simple`
    yield %5 : si1
  } filter {
    ^bb0(%arg0: tuple<si32, si32>, %arg1: tuple<si32, si32, si32>):
    %res = substrait.filter ...
    yield %res : si1
  }

mortbopet · 2025-04-29T07:03:50Z

include/substrait-mlir/Dialect/Substrait/IR/SubstraitOps.td

+    Substrait_ExpressionType:$lhs,
+    Substrait_ExpressionType:$rhs,
+    OptionalAttr<SimpleComparisonType>:$comparison_type,
+    OptionalAttr<UI32Attr>:$custom_function_id


If custom_function_id is deliberately not implemented, i'd say remove this argument and write a TODO in the description field.

mortbopet · 2025-04-29T07:06:48Z

include/substrait-mlir/Dialect/Substrait/IR/SubstraitOps.td

+  let results = (outs SI1:$result);
+
+  let assemblyFormat = [{
+    $comparison_type $lhs `,` $rhs attr-dict `:` `(` type($lhs) `,` type($rhs) `)` `->` type($result)


`(` type($lhs) `,` type($rhs) `)` `->` type($result)

could be

functional-type(operands, $result)

mortbopet · 2025-04-29T07:08:04Z

include/substrait-mlir/Dialect/Substrait/IR/SubstraitOps.td

+  let assemblyFormat = [{
+    $join_type $left `,` $right
+    (`advanced_extension` `` $advanced_extension^)?
+    attr-dict `:` type($left) `,` type($right) `->` type($result) $condition


as above (functional-type).

mortbopet · 2025-04-29T07:12:23Z

lib/Dialect/Substrait/IR/Substrait.cpp

+}
+
+LogicalResult KeyComparisonOp::verify() {
+  auto &res =


Since this may be a bit unintuitive at first (i.e. one might expect that the comparison op should always have identical types for both operands), a comment might be warranted here.

Do we have some documentation about substraits stance here? That is, this code is assuming that the comparison op is able to perform comparisons against types that are "cast-compatible" (int <=> decimal, string <=> varchar).

mortbopet · 2025-04-29T07:12:52Z

lib/Dialect/Substrait/IR/Substrait.cpp

+  if (failed(res))
+    return res;
+
+  return success();


Could just return res; here.

mortbopet · 2025-04-29T07:16:35Z

lib/Target/SubstraitPB/Export.cpp

+    return op->emitOpError("missing join condition");
+  }
+
+  Block &conditionBlock = op.getCondition().front();


As above (op.getBody()).

mortbopet · 2025-04-29T07:17:04Z

lib/Target/SubstraitPB/Export.cpp

+  if (!compareOp) {
+    return op->emitOpError("join condition must be a KeyComparisonOp");
+  }


This could be removed, since you're already checking for this in your verifier.
nit: remove braces.

mortbopet · 2025-04-29T07:18:57Z

lib/Target/SubstraitPB/Export.cpp

+  if (auto leftFieldRef =
+          dyn_cast_or_null<FieldReferenceOp>(leftKey.getDefiningOp())) {
+    leftKeyExpr = exportOperation(leftFieldRef);
+  } else {
+    return op->emitOpError() << "left key must be a field reference";
+  }
+
+  FailureOr<std::unique_ptr<Expression>> rightKeyExpr;
+  if (auto rightFieldRef =
+          dyn_cast_or_null<FieldReferenceOp>(rightKey.getDefiningOp())) {
+    rightKeyExpr = exportOperation(rightFieldRef);
+  } else {
+    return op->emitOpError() << "right key must be a field reference";
+  }


This logic should be moved to a verifier of the KeyComparisonOp (i.e. set hasVerifier = 1 for the op).

mortbopet · 2025-04-29T07:20:58Z

lib/Target/SubstraitPB/Import.cpp

+  if (failed(hashJoinOp)) {
+    return failure();
+  }


mortbopet · 2025-04-29T07:22:09Z

lib/Target/SubstraitPB/Import.cpp

+    return mlir::emitError(builder.getLoc(),
+                           "custom comparison functions not yet supported");


Nit: change to "custom comparison functions for hash_join not yet supported" to make it a bit more clear, when one has a very large substrait input file, and it may be a bit hard to decipher what the error message refers to.

Trion129 added 4 commits April 28, 2025 14:12

initial changes

5ebb426

Remove left and right keys to avoid redundancy

af9aded

Fix Yield MLIR generation issues

934f8cd

Fix for comparison of similar key types

ab30453

Trion129 changed the title ~~HashJoinOp MLIR implementation~~ feat: HashJoinOp MLIR implementation Apr 28, 2025

Trion129 changed the title ~~feat: HashJoinOp MLIR implementation~~ feat: HashJoinOp mlir implementation Apr 28, 2025

Trion129 marked this pull request as draft April 28, 2025 17:58

mortbopet reviewed Apr 29, 2025

View reviewed changes

		return mlir::emitError(builder.getLoc(),
		"custom comparison functions not yet supported");

feat: HashJoinOp mlir implementation #129

Are you sure you want to change the base?

feat: HashJoinOp mlir implementation #129

Uh oh!

Conversation

Trion129 commented Apr 28, 2025

Uh oh!

CLAassistant commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mortbopet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: `HashJoinOp` mlir implementation #129

feat: `HashJoinOp` mlir implementation #129

CLAassistant commented Apr 28, 2025 •

edited

Loading

mortbopet left a comment •

edited

Loading