-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp the generation of runtime division checks on ARM64 #111543
base: main
Are you sure you want to change the base?
Conversation
This is WIP. I've taken a different approach to adding new nodes, instead adding a pass that modifies the HIR. The pass will run through all of the code in the function looking for The added HIR looks like this for the signed overflow check, for example. This is checking for
Here's the example @kunalspathak mentioned in #64795:
Before the change:
After the change:
The main difference is at label IG04, rather than a fixed sequence of compare and branch instructions chosen at the emit stage, the compiler has decided to build a logical expression for the overflow check and emit a The approach is working well when: It seems to have an adverse effect on MinOpts though, because splitting the tree will often spill and there aren't any optimization passes running to clear up these spills. At the moment I haven't focused on the efficiency of the pass itself but I believe it could be improved. I could borrow the recursive traversals in the earlier morph phase to build a work-list for where checks need to be added. Then the pass can be linear over a pre-built list of nodes rather than a search in a loop. I would just have to be careful to update all of the locations of the nodes after any trees are split, but I think this should be possible. I've also had to make a temporary fix on a problem with the tree splitting code where it wasn't correctly updating the node flags after splitting out side effects. After splitting the tree I traverse it post-order to update all of the flags. There might be a more efficient way of doing this. |
I think the build is failing on Release mode due to use of |
What do you need this for? Increasing the size of |
I need a way of uniquely identifying a
I think many of the regressions are caused by spilling in My options for continuing this are:
I think it's sensible to allow the compiler to have a view of all of this code being added early on in the pipeline, but this might not make sense in the tiering model. So option 2 could be a good compromise. I'll need to look into individual regression cases for both options regardless of choice. I would be grateful for any opinions on this and the approach in general. |
src/coreclr/jit/arithchecks.cpp
Outdated
impCloneExpr(divisor, &divisorCopy, CHECK_SPILL_NONE, nullptr DEBUGARG("cloned for runtime check")); | ||
impCloneExpr(dividend, ÷ndCopy, CHECK_SPILL_NONE, nullptr DEBUGARG("cloned for runtime check")); | ||
|
||
// (dividend < 0 && divisor == -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be dividend == MinValue
and divisor == -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I misinterpreted the exception case and this likely explains some of the issues I'm seeing. Thanks.
src/coreclr/jit/arithchecks.cpp
Outdated
code.block = divBlock; | ||
code.stmt = divBlock->firstStmt(); | ||
code.tree = tree; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like something here should be setting GTF_DIV_MOD_NO_BY_ZERO
and GTF_DIV_MOD_NO_OVERFLOW
on the DIV
node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm relying on the settings of these flags in morph stage, e.g. in fgMorphSmpOp
(morph.cpp:8584). When I call GenTree::OperExceptions
I'm implicitly checking these flags too. As I'm not doing any further processing of the types of the operands I don't think I can make that decision in this pass, unless I'm missing something here?
Yes, the address of nodes can be used, see e.g. However, most likely there is no need for any form of "visited" check at all; instead you can shape your pass so that it visits all IR exactly once. See e.g. the various helper expansions in helperexpansion.cpp; those are shaped so that they visit all IR once while allowing for expansion of internal nodes into control flow.
I agree that (2) would be most reasonable. You may want to experiment with some alternative simpler and cheaper ways of accomplishing what this pass is doing. One thing that comes to mind is expanding the checks as QMARKs during import. That is, instead of creating a QMARK(dividend == MinValue & divisor == -1, CALL CORINFO_HELP_OVERFLOW, QMARK(divisor == 0, GCALL CORINFO_HELP_THROWDIVBYZERO, DIV dividend, divisor)) (marking the division node with |
Fixes dotnet#64795 This patch wraps GT_DIV/GT_UDIV nodes on integral types with GT_QMARK trees that contain the necessary conformance checks (overflow/divide-by-zero) when compiling the code with FullOpts enabled. Currently these checks are added during the Emit phase, this is still the case for MinOpts. The aim is to allow the compiler to make decisions on code position and instruction selection for these checks. For example on ARM64 this enables certain scenarios to choose the cbz instruction over cmp/beq, can lead to more compact code. It also allows some of the comparisons in the checks to be hoisted out of loops.
e22295b
to
f8928ff
Compare
Hi @jakobbotsch, Thanks for the help. I've updated the pull request now with the implementation introducing It's still not fully clear from the diffs exactly how much impact is caused by the It's much clearer now where the spills are occurring now, quite a few are related to trying to clone a I'll also need to think about a way of making sure the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments. Let us merge #111797 and make this change Arm64 only. I can take a look after that. Also, don't expect to see MinOpts diffs, so please double check why they are coming.
src/coreclr/jit/compiler.h
Outdated
@@ -4681,6 +4681,8 @@ class Compiler | |||
|
|||
GenTree* impThrowIfNull(GenTreeCall* call); | |||
|
|||
void impImportDivision(bool isSigned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this under #ifdef TARGET_ARM64
?
@@ -7292,12 +7292,9 @@ void Compiler::impImportBlockCode(BasicBlock* block) | |||
// Other binary math operations | |||
|
|||
case CEE_DIV: | |||
oper = GT_DIV; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, just do it for #ifdef TARGET_ARM64
.
src/coreclr/jit/importer.cpp
Outdated
@@ -13823,3 +13820,137 @@ methodPointerInfo* Compiler::impAllocateMethodPointerInfo(const CORINFO_RESOLVED | |||
memory->m_tokenConstraint = tokenConstrained; | |||
return memory; | |||
} | |||
|
|||
void Compiler::impImportDivision(bool isSigned) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the method docs.
There shouldn't be any MinOpts diff in this latest version (4b58921) as there aren't any on my local run, but if they are still there after the CI run I'll take a look again. |
c777c08
to
4b58921
Compare
src/coreclr/jit/importer.cpp
Outdated
// That said, as of now it *is* a large node, so we'll do this with an assert rather | ||
// than an "if". | ||
assert(GenTree::s_gtNodeSizes[GT_CALL] == TREE_NODE_SZ_LARGE); | ||
GenTree* divNode = new (this, GT_CALL) GenTreeOp(oper, resultType, dividend, divisor DEBUGARG(/*largeNode*/ true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can all of this logic be written so that we delegate to the standard div logic unless the QMARK
handling logic actually kicks in? E.g. this weird GT_CALL
sized node does not need to be allocated like this for your integral-only cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried removing this logic, it hits assertions in Lowering where some code has tried to change the node type to GT_CAST
and it can't because the nodes are different sizes. I think it would be best to keep this logic for all DIV
nodes for now? And there would be some cleanup later involving removing any implicit assumptions about the size of DIV
nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to the code in lowering that is changing an integer division to a cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here on current main branch:
runtime/src/coreclr/jit/lower.cpp
Line 7329 in da4f0a3
divMod->ChangeOper(GT_CAST); |
The assertion fires in SetOper
, so I'm not sure if all of the failures I'm seeing are from this exact code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that should probably be cleaned up.
In any case this code should still be refactored so that it falls back to the existing code for the cases it is not trying to handle differently. But it is ok to keep this code allocating a large node (it should rather use gtNewLargeOperNode
, though, the comment above this is wrong since these divisions do not get transformed into helper calls).
src/coreclr/jit/importer.cpp
Outdated
{ | ||
// Spill the divisor, as (divisor == 0) is always checked. | ||
GenTree* divisorCopy = nullptr; | ||
impCloneExpr(divisor, &divisorCopy, CHECK_SPILL_NONE, nullptr DEBUGARG("divisor used in runtime checks")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong -- it reorders the evaluation of divisor
with dividend
and other things on the IL stack. I would suggest you avoid popping the dividend and divisor until after this point by using impStackTop
above. Then you can pop the divisor, clone it with impCloneExpr(divisor, CHECK_SPILL_ALL, ...)
and then afterwards you can pop the dividend.
Note that this function is returning the tree that is one of the copies of the divisor, so you should use it instead of calling gtClone
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, is the problem that the cloning of the expressions is in the wrong order? If I make sure the dividend check is added first and order my clones correctly, that will make sure the dividend is always cloned first if it needs to be. I'm not familiar with the import stage, so I can't see how calling impPopStack
affects the execution order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The importer works as follows: it keeps the current list of basic block statements (impStmtList
+ impLastStmt
) and it keeps values representing each IL stack entry (accessed with e.g. impPopStack()
and impStackTop
). At all times, there is an invariant that the current basic block will evaluate first the basic block statements and then the IL evaluation stack from bottom to the top. That is the defined evaluation order.
If you ever pop a value from the stack, then the side effects of that value need to be evaluated after everything that is in the list of basic block statements and after everything that was below it on the IL stack. You are responsible for ensuring that this ends up being the evaluation order. The importer provides some tools to ensure that this happens, like the notion of spilling: when you want to append a side effect to the basic block list, spilling takes care to check interference between the side effect and things that are currently on the IL stack. When it detects interference, it spills what is on the IL stack by appending its side effect (and everything before it) to the basic block list of statements.
To clone an expression impCloneExpr
sometimes may need to store it into a temporary value and append that store to the statement list. By passing CHECK_SPILL_NONE
here you are telling it that it is ok to evaluate the value before everything currently on the IL stack. That's not correct; it will mean that side effects in the divisor will be reordered to happen before things that were below it on the IL stack.
If you pass CHECK_SPILL_ALL
you still have a problem that the divisor will potentially end up executing before the dividend. Hence why I suggested the stepwise pop/cloning process above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, hopefully the latest change now reflects this properly.
The implementation is now regressing #63905, I think I can see why.
The snippet is coming from this IL produced by the regression test:
The spilling required for the divide-by-zero check has caused it to be evaluated before the null-check on the indirection to the struct field. I would need to refactor the code to make |
I still see |
src/coreclr/jit/importer.cpp
Outdated
impImportDivisionWithChecks(oper, type, op1, op2); | ||
break; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you had before looked better. It should be shaped something like this:
case CEE_DIV:
#ifdef TARGET_ARM64
if (opts.OptimizationEnabled() && impImportDivisionWithChecks(oper, type, op1, op2))
{
break;
}
#endif
oper = GT_DIV;
goto MATH_MAYBE_CALL_NO_OVF;
Then just have impImportDivisionWithChecks
return false for the cases it does not handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this I would have to evaluate type
, op1
and op2
in this block which might make it look repetitive again, especially if copied across CEE_DIV/CEE_DIV_UN
. It's using the original behavior as a fallback now so it makes sense to try and merge this with the original logic in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. I prefer a solution where the changes are restricted to be inside the new function you introduce. I don't see what duplication we would have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that this block (from the label MATH_OP2_FLAGS
) needs to be evaluated before I have access to those parameters to the function:
op2 = impStackTop().val;
op1 = impStackTop(1).val;
/* Can't do arithmetic with references */
assertImp(genActualType(op1->TypeGet()) != TYP_REF && genActualType(op2->TypeGet()) != TYP_REF);
// Change both to TYP_I_IMPL (impBashVarAddrsToI won't change if its a true byref, only
// if it is in the stack)
impBashVarAddrsToI(op1, op2);
type = impGetByRefResultType(oper, uns, &op1, &op2);
So I could refactor the function again to not take these as arguments but that would mean copying this block into the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division is not supported with byrefs, so this block is unnecessary to have in your function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing this, but removing impGetByRefResultType
loses some important side effects and I think it causes the recent set of failures. It's inserting upcasts for integers and processing native int
types. I've renamed it to impProcessResultType
as it doesn't just handle byrefs and added it back to my function.
src/coreclr/jit/importer.cpp
Outdated
divNode->AsOp()->gtOp1 = impCloneExpr(dividend, ÷ndCopy, CHECK_SPILL_NONE, | ||
nullptr DEBUGARG("dividend used in runtime checks")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CHECK_SPILL_NONE
should be CHECK_SPILL_ALL
. Similar below.
…impImportDivisionWithChecks
This seems to be passing now, with some positive PerfScore improvements in the diffs. I've made sure checks are also added this way for modulo operations as they are morphed to divisions on ARM64. There are some opportunities for improvement that @a74nh and I have noticed in the diffs: Some compilations contain duplicated throw code, where more than one check occurs in the function. This can increase the code size for some functions but shouldn't have much impact on performance.
Another thing to note is that these throw blocks are being assigned block weights, previously they were marked as run rarely. It's not clear to me if this is going to have any impact. We've also noticed that some checks are now preferring CCMP over consecutive branches, see the sequence below. This should be a good thing as it reduces branch density. The sequence could be improved by removing the unnecessary
|
I'm curious if you compared this with the approach suggested by @SingleAccretion in #64795 (comment). |
I didn't manage to get this working initially, down to needing more experience across the code base, but I think it could follow on from this patch to try that. The problem with hoisting conditional logic isn't unique to code added by the compiler though. For example, this code doesn't result in any optimization in the same way:
Could this be improved by implementing loop unswitching, which would hoist the condition and duplicate the loop to produce
Then the true loop is optimized to a single throw statement, and the false loop is optimized away. On ARM64 I'm seeing this loop block generated at the moment:
which could be simplified to just the first Has this been discussed before? A quick search shows this issue: #65342 |
Yes, the hoisting implementation in the JIT is notoriously poor. So you kind of have to compensate for that. Introducing
I think the other approach is incompatible with this PR, so we would need to revert this PR. That's why I would be curious to see an experiment to do that instead. (Wrt. the "dividend == MinValue && divisor == -1" check: I would suggest just leaving that check up to the div node for now instead of materializing it explicitly, given that the dividend is not expected to be invariant.) |
result = | ||
gtNewQmarkNode(resultType, condition, | ||
gtNewColonNode(resultType, gtNewHelperCallNode(CORINFO_HELP_OVERFLOW, resultType), result)); | ||
} | ||
|
||
divisor = gtClone(divisor, false); | ||
result = | ||
gtNewQmarkNode(resultType, | ||
// (divisor == 0) | ||
gtNewOperNode(GT_EQ, TYP_INT, divisor, gtNewIconNode(0, genActualType(divisor))), | ||
gtNewColonNode(resultType, gtNewHelperCallNode(CORINFO_HELP_THROWDIVZERO, resultType), result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gtThenLikelihood
should be set for these qmark nodes to indicate that we never expect the exceptions to be thrown. I do not think the current perfscore diffs show the right picture because this is missing.
Notably, since #64795 (comment) I've realized that it should probably be like
Where
|
That will work too, though I am unsure how well hoisting is going to handle it. runtime/src/coreclr/jit/optimizer.cpp Lines 4425 to 4454 in d2650b6
and then furthermore some assertion prop work will be needed to generate assertions from CKZERO and eliminate it based on them. That will also largely follow the handling for GT_NULLCHECK , however.
@snickolls-arm are you up for giving it a go? |
Yes I'll have a go at this. I'll open another draft PR when it's close to completion and we can compare with this one. |
Fixes #64795
This patch introduces a new compilation phase that passes over the
GenTrees
looking forGT_DIV/GT_UDIV
nodes on integral types, and morphs the code to introduce the necessary conformance checks (overflow/divide-by-zero) early on in the compilation pipeline. Currently these are added during the Emit phase, meaning optimizations don't run on any code introduced.The aim is to allow the compiler to make decisions on code position and instruction selection for these checks. For example on ARM64 this enables certain scenarios to choose the
cbz
instruction overcmp/beq
, can lead to more compact code. It also allows some of the comparisons in the checks to be hoisted out of loops.