Feat: Add loop support to the optimise-relinearization pass#2758
Feat: Add loop support to the optimise-relinearization pass#2758akashmadhu4 wants to merge 5 commits intogoogle:mainfrom
Conversation
j2kun
left a comment
There was a problem hiding this comment.
So I like this approach, and reviewing the PR gave me some ideas for making it even better. I will describe those ideas and, if you don't feel up for it, we can merge this PR in (modulo the one comment and linter failure) and move on.
The main reason I think it would eventually need improvement is that, in the looped linalg kernels I've been working on, specifically the baby-step giant-step kernel, there are necessary if/else statements in the loop body.
So one improvement I see is that this approach can be generalized beyond loop support to support any region-bearing op. The loopBoundaryDegrees could be generalized to be something like fixedResultDegrees that signals to the solver that, for any operation present in the map, the solver must hard-code its result degrees as a constraint. One could also enforce that, at any basic block boundary, the block args (both entering and exiting) must always have linear degree. Then you can walk<WalkOrder::PostOrder>([&] (Block *block) {...}) to go block by block, and populate the map after each solve as you do here. In this way, I think you can remove most of the branches that specialize to loops/yields except for the step that populates the map from the solver output.
Which brings me to my next improvement: use RegionBranchOpInterface to avoid having to know about the op type at all. Though the terminology around that interface always confuses me, in practice RegionBranchOpInterface allows you to take an op's result, and (via getPredecessorValues I believe) get the program points that forward control flow to the op result (or conversely, getSuccessorRegions). This would allow you to connect, say, all three of an iter_arg to its loop-yielded value to the corresponding op result without hard-coding anything about affine.for/scf.for or affine.yield/scf.yield.
But even better, it would allow you to use one code path to support all for loops and if statements (and scf.while!).
Even further, you could use this interface technique to make a single global ILP that handles all ops and nested regions in a single formulation. You would use the connection between the region branching points described above to create constraints that effectively say "the degree of an iter_arg == the degree of the init == the degree of the yielded operand == the degree of the op result", but in code you would just loop over predecessors/successors and agnostically add constraints making all of them equal. And that would allow the ILP to find a solution in which the relinearization is delayed across a region boundary.
All that said, I am still not sure a global ILP is worth it here. Lazy relinearization is not the most important optimization, IMO, and having one ILP per basic block would have better compile-time performance (i.e., HEIR runtime) and not sacrifice all that much latency because most of the optimization opportunity is inside a loop's body. So in that case, the use of the interface would mainly be to allow you to support any kind of nested region-holding op with control flow (in particular, scf.if) and use it to populate the loopBoundaryDegrees (/ fixedResultDegrees) map without having to switch over a list of supported op types.
Ok, after that huge wall of text, I will also answer your specific questions:
I constrained the initial iter_arg to degree 1
As mentioned in the comment, I think all iter_args should be forced to have linear degree. Partly because of my next answer...
Handling iter_args that was first analysed as non-secret but becomes secret via yield
The way loop support is handled before this pass in the pipeline is to (a) peel the first iteration of a loop when an iter_arg is initialized with a cleartext value, so that iter_args are always invariantly ciphertexts, and (b) the loop is partially unrolled. This means that you should be able to safely ignore secretness discrepancies in the iter_args.
It also adds some context to my thoughts above: since the loops are partially unrolled, you should assume this pass will have sufficiently large blocks to work with and opportunities to do lazy relinearization. This reduces the marginal benefit of deferring relinearization across blocks, and hence the benefit of a global solve vs a block-local solve.
| }); | ||
| }) | ||
| .Case<affine::AffineYieldOp, scf::YieldOp>([&](auto op) { | ||
| // For loop yield ops, the degree returned must not exceed the degree |
There was a problem hiding this comment.
I actually think the degrees should be equal. In particular, when this is lowered to the ckks scheme, having degrees that are not equal across iter args will produce a type error.
I think it would be a good and simplifying assumption to enforce by fiat that all iter_args have a linear key basis.
|
Thank you for explaining your ideas @j2kun and for clarifying my questions. This was really helpful. I agree with you on extending the approach to generalise beyond the loop support especially in cases like the baby-step giant-step kernel you mentioned.I’ll take a closer look at RegionBranchOpInterface, since it seems like a clean way to unify handling for loops, conditionals, and other control-flow constructs. Regarding the global vs. block-level ILP, my understanding is that a global ILP could enable more optimal decisions (like delaying relinearization across blocks) . However as you mentioned partial unrolling creates sufficiently large blocks, so most of the useful lazy relinearization opportunities can be captured locally . Therefore, a block level ILP seems like a practical approach. I’ll continue working on incorporating these ideas and follow up with updates before we merge this PR. |
Summary of the changes:
Previously loops have to be unrolled for running the pass which optimally inserts mgmt::RelinearizeOp's . Now loops can be processed without unrolling . If there is a nested loop , it treats inner loop as a self contained ILP problem , solves them to find their output degree and use this solved degree as fixed constraints in the parent loop's ILP solver.
Notes / Open Questions for Discussion:
The initial iter_arg was constraint to 1 was because getDimension(iter_arg, solver) returns the maximum accumulated degree across all loop iterations
Is it better to do this handling on SecretnessAnalysis rather than handling it while creating variables.
Fixes #2600