-
Notifications
You must be signed in to change notification settings - Fork 950
[otbn] Define SIMD instructions #29231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f0c84b5 to
81053bf
Compare
nasahlpa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pascal, this looks really good to me. Just a couple of nits/questions.
5997928 to
1c3cca1
Compare
nasahlpa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the feedback!
h-filali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etterli this looks very good. I just have some minor comments and questions.
hw/ip/otbn/data/bignum-insns.yml
Outdated
| The vectors are multiplied elementwise and each product is truncated to the element length. | ||
| The final vector is stored in `wrd`. | ||
|
|
||
| This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the final result added to the accumulate WSR or just stored there? Or why are we storing to wrd and ACC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instruction operates over 4 cycles and after each cycle there is one vector chunk result which must be stored somewhere. It cannot be written to the WDR before the instruction finishes (otherwise the lane version won't work properly) so we store the partial results in ACC. This way we "construct" the result over multiple cycles and use ACC as buffer. At the end we still must perform a write to the desired WDR.
I agree this sentence should be extended similar to how the mulvm instructions are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there should be more remarks on how the partial results are accumulated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that makes sense. That was also what I had in mind. I think what tripped me up a bit is the fact that we are using the accumulate register even though we are not accumulating. Maybe we can mention explicitly that ACC is overwritten and we do not accumulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"writes the final result to the ACC WSR" I think this includes these points?
1c3cca1 to
6505711
Compare
|
@andreaskurth this PR will break the OTBN regressions as it introduces instruction definitions which at the moment neither the RTL nor the simulator knows. And that's kind of okay as we need to split up the work into multiple PRs. However, I think we should increase the version number and set back the design and verification stages in |
6505711 to
134ff78
Compare
rswarbrick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me: thanks!
Do we have an update on this? I agree that we should move OTBN to D1/V1. I'd suggest doing this in this PR as #29232 follows afterwards. |
vogelpi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The referenced commit_id is not belonging to a commit on master, but a commit in this PR. Once the PR is merged, the ID will change. This doesn't work.
7240de4 to
134ff78
Compare
|
@vogelpi Thanks for clarifying the process of the version increase. I have now removed the version change commit and will create a new PR once this one is merged to master. |
Okay, sounds good. Thank you! |
This introduces vectorized (SIMD) big number instructions operating on the 256-bit WDRs. These instructions interpret WDRs as vectors of unsigned elements. The width of the elements is for most instructions 32 bits except for a few instructions which support also larger widths. Signed-off-by: Pascal Etterli <[email protected]>
…lanations Adds a simple example how to use the bn.pack and bn.unpack instructions. Signed-off-by: Pascal Etterli <[email protected]>
This illustrates the functionlaty of the bn.trn1 and bn.trn2 instructions. Signed-off-by: Pascal Etterli <[email protected]>
134ff78 to
22f2027
Compare
This PR defines the new SIMD (vectorized) bignum instructions for OTBN operating on the WDRs. The new instructions operate on 8 32-bit elements per WDR:
bn.addv(m): Vectorized addition with optional pseudo-modulo reduction.bn.subv(m): Vectorized subtraction with optional pseudo-modulo reductionbn.mulv(l): Vectorized multiplication with optional 'by lane' mode.bn.mulvm(l): Vectorized Montgomery multiplication (without the conditional subtraction step of Montgomery). With optional 'by lane' mode.bn.trn1/bn.trn2: Instructions to interleave vector elements. This is especially useful for NTT and INTT to shuffle the vector elements when the stride between elements is smaller than what two WDRs provide. These instructions also operate on 64-bit and 128-bit vector elements.bn.shv: Vectorized logic shift instructionbn.pack/bn.unpk: Instructions to pack / unpack vectors from/to 32-bit elements to/from 24-bit element vectors. This allows to save on DMEM space.The instructions are described in more detail in
hw/ip/otbn/data/bignum-insns.yml(first commit). Forbn.pack/bn.unpkandbn.trnsee also the programmer's manual. Alternatively one can build the documentation locally and view the generated ISA manual (useutil/site/build-docs.sh serve).When this is merged, various OTBN DV tests will fail because the Random Instruction Generator (RIG) parses the
bignum-insn.ymlfile and will generate programs including the new instructions. This will result in failing tests because illegal instructions are encountered as these are not yet added to the simulator and RTL.This PR will have minor conflicts with #29025 (only some documentation files). So we should agree on which to merge first (probably #29025 first).