Skip to content

Conversation

@Trion129
Copy link

This commit adds MLIR implementation for HashJoinOp
This tries to follow a similar MLIR pattern as in issue #97

  1. Substrait.cpp: Verifier implementations for HashJoin and KeyComparison operations. For Key comparisons tries to allow similar types. Might not be the best way.
  2. Import.cpp: Converts HashJoinRel into Filter->HashJoin->ConditionRegion. Post join filter is lifted to filter, key comparison operator added to nested region.
  3. Export.cpp: It will not be 1->1 mapping form import to export as currently it does not convert Filter->HashJoin into HashJoin, I have assumed it might not be needed and these 2 can convert to seperate relations.
  4. *.td files: Added enum and Op implementations.

Skipped work:
Doesn't implement custom_function_id. It is not needed by us at the moment, so left a TODO for now.

@CLAassistant
Copy link

CLAassistant commented Apr 28, 2025

CLA assistant check
All committers have signed the CLA.

@Trion129 Trion129 changed the title HashJoinOp MLIR implementation feat: HashJoinOp MLIR implementation Apr 28, 2025
@Trion129 Trion129 changed the title feat: HashJoinOp MLIR implementation feat: HashJoinOp mlir implementation Apr 28, 2025
@Trion129 Trion129 marked this pull request as draft April 28, 2025 17:58
Copy link
Collaborator

@mortbopet mortbopet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! left a bunch of comments. Obviously there needs to be import/export/IR tests, but given the draft state, i assume those are in the pipeline :).

The post-join filter is definitely a funny thing here, and principally i agree with your approach (creating a filter node after the hash-join). It's a normalization of the IR, and it avoids introducing redundant operations (i.e. if we'd actually create a post-join filter in the hash-join op).
However, it does present an issue for round-tripping through the IR, which this would break. It's obviously not impossible to fix this (one way would be to add extra logic in the export pass to detect how a filter op is being used, and infer the post-join filter from there)... @ingomueller-net thoughts?

EDIT: guess we aren't the first to pose this question!
https://github.com/substrait-io/substrait/blob/413c7c8c8ea149ea1596c9c3b2e57151d6ce63f7/site/docs/faq.md?plain=1#L7-L12

One way we could add it in there is to have an optional filter region, containing a substrait.filter node:

%2 = subtrait.hash_join %0, %1 on {
  ^bb0(%arg0: tuple<si32, si32>, %arg1: tuple<si32, si32, si32>):
    %3 = field_reference %arg0[0] : tuple<si32, si32> // corresponds to `left`
    %4 = field_reference %arg1[0] : tuple<si32, si32, si32>  // corresponds to `right`
    %5 = call @cmp(%3, %4) : (si32, si32) -> si1  // corresponds to `custom_function_reference`
    // ... or ...
    %5 = compare not_distinct_from %3, %4 : (si32, si32) -> si1  // corresponds to `simple`
    yield %5 : si1
  } filter {
    ^bb0(%arg0: tuple<si32, si32>, %arg1: tuple<si32, si32, si32>):
    %res = substrait.filter ...
    yield %res : si1
  }

Substrait_ExpressionType:$lhs,
Substrait_ExpressionType:$rhs,
OptionalAttr<SimpleComparisonType>:$comparison_type,
OptionalAttr<UI32Attr>:$custom_function_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If custom_function_id is deliberately not implemented, i'd say remove this argument and write a TODO in the description field.

let results = (outs SI1:$result);

let assemblyFormat = [{
$comparison_type $lhs `,` $rhs attr-dict `:` `(` type($lhs) `,` type($rhs) `)` `->` type($result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`(` type($lhs) `,` type($rhs) `)` `->` type($result)

could be

functional-type(operands, $result)

let assemblyFormat = [{
$join_type $left `,` $right
(`advanced_extension` `` $advanced_extension^)?
attr-dict `:` type($left) `,` type($right) `->` type($result) $condition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above (functional-type).

}

LogicalResult KeyComparisonOp::verify() {
auto &res =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this may be a bit unintuitive at first (i.e. one might expect that the comparison op should always have identical types for both operands), a comment might be warranted here.

Do we have some documentation about substraits stance here? That is, this code is assuming that the comparison op is able to perform comparisons against types that are "cast-compatible" (int <=> decimal, string <=> varchar).

Comment on lines +1098 to +1101
if (failed(res))
return res;

return success();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just return res; here.

return op->emitOpError("missing join condition");
}

Block &conditionBlock = op.getCondition().front();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above (op.getBody()).

Comment on lines +803 to +805
if (!compareOp) {
return op->emitOpError("join condition must be a KeyComparisonOp");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be removed, since you're already checking for this in your verifier.
nit: remove braces.

Comment on lines +810 to +823
if (auto leftFieldRef =
dyn_cast_or_null<FieldReferenceOp>(leftKey.getDefiningOp())) {
leftKeyExpr = exportOperation(leftFieldRef);
} else {
return op->emitOpError() << "left key must be a field reference";
}

FailureOr<std::unique_ptr<Expression>> rightKeyExpr;
if (auto rightFieldRef =
dyn_cast_or_null<FieldReferenceOp>(rightKey.getDefiningOp())) {
rightKeyExpr = exportOperation(rightFieldRef);
} else {
return op->emitOpError() << "right key must be a field reference";
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be moved to a verifier of the KeyComparisonOp (i.e. set hasVerifier = 1 for the op).

Comment on lines +698 to +700
if (failed(hashJoinOp)) {
return failure();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

Comment on lines +733 to +734
return mlir::emitError(builder.getLoc(),
"custom comparison functions not yet supported");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: change to "custom comparison functions for hash_join not yet supported" to make it a bit more clear, when one has a very large substrait input file, and it may be a bit hard to decipher what the error message refers to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants