diff --git a/hw/ip/otbn/README.md b/hw/ip/otbn/README.md
index 99f12eeba3408..3d85460ebed9b 100644
--- a/hw/ip/otbn/README.md
+++ b/hw/ip/otbn/README.md
@@ -707,7 +707,7 @@ All read-write (RW) WSRs are set to 0 when OTBN starts an operation (when 1 is w
- The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions.
+ The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions as well as their vectorized variants.
This WSR is also visible as CSRs `MOD0` through to `MOD7`.
@@ -741,7 +741,7 @@ All read-write (RW) WSRs are set to 0 when OTBN starts an operation (when 1 is w
- The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction.
+ The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction and the vectorized multiplication instructions like {{#otbn-insn-ref BN.MULV}}.
diff --git a/hw/ip/otbn/data/bignum-insns.yml b/hw/ip/otbn/data/bignum-insns.yml
index 8ed94ffa6183d..e9c54ddfbdd45 100644
--- a/hw/ip/otbn/data/bignum-insns.yml
+++ b/hw/ip/otbn/data/bignum-insns.yml
@@ -957,3 +957,507 @@
lsu:
type: wsr-store
target: [wsr]
+
+- mnemonic: bn.addv
+ synopsis: Add vector elementwise
+ operands: &bn-addv-operands
+ - name: elen
+ # We use an enum here for consistency with other vector instructions which support multiple element lengths.
+ type: enum(.8s)
+ doc: |
+ Select the bit-width of the vector elements.
+ - `.8s`: WDRs are interpreted as 8 32-bit elements
+ - name: wrd
+ doc: Name of the destination WDR
+ - name: wrs1
+ doc: Name of the first source WDR
+ - name: wrs2
+ doc: Name of the second source WDR
+ syntax: &bn-addv-syntax |
+ , ,
+ glued-ops: true
+ doc: |
+ Add two WDR registers interpreted as vectors.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`.
+ The vectors are summed elementwise and each sum is truncated to the element length.
+ The final vector is stored in `wrd`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnva
+ mapping:
+ sub: b0
+ mod: b0
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b000
+ wrd: wrd
+
+- mnemonic: bn.addvm
+ synopsis: Pseudo-Modulo add vector elementwise
+ operands: *bn-addv-operands
+ syntax: *bn-addv-syntax
+ glued-ops: true
+ doc: |
+ Add two WDR registers interpreted as vectors and reduce the results using the MOD WSR.
+
+ The modulus by which the addition results are reduced comes from the bottom word of the MOD WSR at the corresponding element length.
+ For example, for `.8s` the modulus is taken from `MOD[31:0]`.
+ Let `q` be this modulus for the explanation below.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of the same size.
+ The vectors are summed elementwise.
+ If an individual result is equal to or larger than `q` then `q` is subtracted from it.
+ The final results are truncated to the element length and the whole vector is stored in `wrd`.
+
+ This operation correctly implements addition modulo `q`, providing that the intermediate results are less than `2 * q`.
+ The intermediate results are small enough if both inputs are less than `q`.
+ Note, this instruction can be used to implement the conditional subtraction step for the Montgomery multiplication (see `bn.mulvm`).
+ For this case, one input should be zero and the second input can be within `[0, 2*q[` as the instruction will subtract `q` when the result is equal to or greater than `q`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2, mod]
+ encoding:
+ scheme: bnva
+ mapping:
+ sub: b0
+ mod: b1
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b000
+ wrd: wrd
+
+- mnemonic: bn.subv
+ synopsis: Subtract vector elementwise
+ operands: *bn-addv-operands
+ syntax: *bn-addv-syntax
+ glued-ops: true
+ doc: |
+ Subtract `wrs2` from `wrs1` interpreted as vectors.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`.
+ The vectors are subtracted elementwise and each difference is truncated to the element length.
+ The final vector is stored in `wrd`.
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnva
+ mapping:
+ sub: b1
+ mod: b0
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b000
+ wrd: wrd
+
+- mnemonic: bn.subvm
+ synopsis: Pseudo-Modulo subtract vector elementwise
+ operands: *bn-addv-operands
+ syntax: *bn-addv-syntax
+ glued-ops: true
+ doc: |
+ Subtract `wrs2` from `wrs1` interpreted as vectors and reduce the result using the MOD WSR.
+
+ The modulus by which the subtraction results are reduced comes from the bottom word of the MOD WSR at the corresponding element length.
+ For example, for `.8s` the modulus is taken from `MOD[31:0]`.
+ Write `q` for this modulus.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of the same size.
+ The vectors are subtracted elementwise.
+ The intermediate results are treated as a signed number and if an individual result is negative, `q` is added to it.
+ The final results are truncated to the element length and the whole vector is stored in `wrd`.
+
+ This operation correctly implements subtraction modulo `q`, providing that the intermediate result are at least `-q` and at most `q - 1`.
+ This is guaranteed if both inputs are less than `q`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2, mod]
+ encoding:
+ scheme: bnva
+ mapping:
+ sub: b1
+ mod: b1
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b000
+ wrd: wrd
+
+- mnemonic: bn.mulv
+ synopsis: Elementwise vector multiplication
+ operands: &bn-mulv-operands
+ - name: elen
+ # We use an enum here for consistency with other vector instructions which support multiple element lengths.
+ # We maybe should extend the framework such that encodings can be specified and some options can be marked as reserved.
+ type: enum(.8s)
+ doc: |
+ Select the bit-width of the vector elements.
+ - `.8s`: WDRs are interpreted as 8 32-bit elements
+ - name: wrd
+ doc: Name of the destination WDR
+ - name: wrs1
+ doc: Name of the first source WDR
+ - name: wrs2
+ doc: Name of the second source WDR
+ syntax: &bn-mulv-syntax |
+ , ,
+ glued-ops: true
+ doc: |
+ Multiply two WDR registers interpreted as vectors.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`.
+ The vectors are multiplied elementwise and each product is truncated to the element length.
+ The final vector is stored in `wrd`.
+
+ This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR as well as `wrd`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd, acc]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnvm
+ mapping:
+ lane: bxxx
+ use_lane: b0
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b011
+ wrd: wrd
+
+- mnemonic: bn.mulvl
+ synopsis: Elementwise vector multiplication with fixed lane
+ operands: &bn-mulvl-operands
+ - name: elen
+ # We use an enum here for consistency with other vector instructions which support multiple element lengths.
+ # We maybe should extend the framework such that encodings can be specified and some options can be marked as reserved.
+ type: enum(.8s)
+ doc: |
+ Select the bit-width of the vector elements.
+ - `.8s`: WDRs are interpreted as 8 32-bit elements
+ - name: wrd
+ doc: Destination WDR
+ - name: wrs1
+ doc: First source WDR
+ - name: wrs2
+ doc: Second source WDR
+ - name: lane
+ doc: |
+ Select which element of `wrs2` will be multiplied with all elements of `wrs1`.
+ type: uimm3
+ syntax: &bn-mulvl-syntax |
+ , , ,
+ glued-ops: true
+ doc: |
+ Multiply each vector element of `wrs1` with element `lane` of `wrs2`.
+
+ The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`.
+ The elements are indexed with `lane` starting from the lowest bits of the WDR.
+ For example, for `.8s` lane 0 specifies the bits `[31:0]` of `wrs2`.
+ Each product is truncated to the element length and the final vector is stored in `wrd`.
+
+ This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR as well as `wrd`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd, acc]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnvm
+ mapping:
+ lane: lane
+ use_lane: b1
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b011
+ wrd: wrd
+
+- mnemonic: bn.mulvm
+ synopsis: Elementwise vector Montgomery multiplication
+ operands: *bn-mulv-operands
+ syntax: *bn-mulv-syntax
+ glued-ops: true
+ doc: |
+ Performs a Montgomery multiplication without the final conditional subtraction step of the Montgomery algorithm.
+ The values in `wrs1` and `wrs2` are interpreted as vectors with elements of size given by `elen` and are considered unsigned.
+ Each result is truncated to the element length and the final vector is stored in `wrd`.
+ This instruction outputs the result in `[0, 2q[` as the final conditional subtraction step is not performed.
+ For a result in `[0, q[`, a conditional subtraction must be performed after this instruction, e.g., with `BN.ADDVM` and a zero vector.
+
+ This instructions stalls OTBN for 12 cycles and writes the final result to the ACC WSR as well as `wrd`.
+
+ **Montgomery explanation**\\
+ Inputs:
+ - `a, b`: Operands in `[0, q[`
+ - `d`: Bit-width of operands
+ - `q`: Modulus in `]0, 2^d]`
+ - `mu`: Montgomery constant, precomputed, `mu = (-q)^(-1) mod 2^d`
+
+ The operands must be pre-transformed into Montgomery space with `a = a_orig * 2^d mod q`.
+ This must be ensured by the programmer.
+
+ The Montgomery multiplication is defined as:
+ ```
+ r = a*b * 2^(-d) mod q
+ ```
+ where `r` is in `[0, q[`.
+
+ This can be computed with the following steps where `[]_d` are the lower `d` bits and `[]^d` are the higher `d` bits:
+ ```
+ c = a*b
+ r = [c + [[c]_d * mu]_d * q]^d
+ if r >= q then // not implemented in hardware
+ return r - q
+ return r
+ ```
+
+ This instruction computes these steps except the final conditional subtraction (`r >= q`) to optimize area and timing.
+ The result is thus in `[0, 2q[` and must be reduced to `[0, q[` by the programmer.
+ This can be performed in software by using a pseudo modulo addition (`BN.ADDVM`) with a zero vector.
+
+ Note that when chaining multiplications, the conditional subtraction can be postponed until after the last multiplication in case the initial inputs are in `[0, 2q[` and `q < (2^d)/4` holds.
+
+ **Pre-requisites**\\
+ This instruction requires the modulus `q` and its corresponding Montgomery constant `mu` to be placed in the MOD WSR at the following locations:\\
+ For `.8s`: `q @ MOD[31:0]`, `mu @ MOD[63:32]`
+
+ **Micro-architectural remarks**\\
+ The ACC WSR is used to temporarily store the partial result before the complete result is written into the destination WDR.
+ It is reset at the start of the instruction but not reset at the end of the instruction.
+ Similarly, the intermediate multiplication results are stored in two hidden registers.
+ These registers are cleared to all-zero after each Montgomery computation (i.e., every 3 cycles).
+
+ If required for cryptographic reasons, all three registers can be overwritten using a `BN.WSRW` instruction to the ACC WSR.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd, acc]
+ from: [wrs1, wrs2, mod]
+ encoding:
+ scheme: bnvm
+ mapping:
+ lane: bxxx
+ use_lane: b0
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b100
+ wrd: wrd
+
+- mnemonic: bn.mulvml
+ synopsis: Elementwise vector Montgomery multiplication with fixed lane
+ operands: *bn-mulvl-operands
+ syntax: *bn-mulvl-syntax
+ glued-ops: true
+ doc: |
+ Perform a Montgomery multiplication without the final conditional subtraction on each element of `wrs1` with element `lane` of `wrs2`.
+ The elements are indexed with `lane` starting from the lowest bits of the WDR.
+ For example, for `.8s` lane 0 specifies the bits `[31:0]` of `wrs2`.
+
+ See `BN.MULVM` for explanations about the Montgomery algorithm and the implementation details, especially regarding security.
+ For a correct result, a conditional subtraction must be performed after this instruction, e.g., with `BN.ADDVM`.
+
+ This instruction requires the modulus `q` and its corresponding Montgomery constant `mu` to be placed in the MOD WSR at the following location:\\
+ For `.8s`: `q @ MOD[31:0]`, `mu @ MOD[63:32]`
+
+ This instructions stalls OTBN for 12 cycles and writes the final result to the ACC WSR as well as `wrd`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd, acc]
+ from: [wrs1, wrs2, mod]
+ encoding:
+ scheme: bnvm
+ mapping:
+ lane: lane
+ use_lane: b1
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b100
+ wrd: wrd
+
+- mnemonic: bn.trn1
+ synopsis: Interleave vectors in even fashion
+ operands: &bn-trn-operands
+ - name: elen
+ type: enum(.8s,.4d,.2q)
+ doc: |
+ Select the bit-width of the vector elements.
+ - `.8s`: WDRs are interpreted as 8 32 bit elements
+ - `.4d`: WDRs are interpreted as 4 64 bit elements
+ - `.2q`: WDRs are interpreted as 2 128 bit elements
+ - name: wrd
+ doc: Destination WDR
+ - name: wrs1
+ doc: First source WDR
+ - name: wrs2
+ syntax: &bn-trn-syntax
+ , ,
+ glued-ops: true
+ doc: |
+ Even-numbered vector elements from `wrs1` are placed into even-numbered elements of `wrd`.
+ Even-numbered vector elements from `wrs2` are placed into odd-numbered elements of `wrd`.
+
+ For `.8s`, this can be described as `wrd[i] = wrs1[i]` and `wrd[i+1] = wrs2[i]` for `i in [0, 2, .., (256//32)-2]` which results in `wrd = {..., wrs2[2], wrs1[2], wrs2[0], wrs1[0]}`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnvtrn
+ mapping:
+ odd: b0
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b101
+ wrd: wrd
+
+- mnemonic: bn.trn2
+ synopsis: Interleave vectors in odd fashion
+ operands: *bn-trn-operands
+ syntax: *bn-trn-syntax
+ glued-ops: true
+ doc: |
+ Odd-numbered vector elements from `wrs1` are placed into even-numbered elements of `wrd`.
+ Odd-numbered vector elements from `wrs2` are placed into odd-numbered elements of `wrd`.
+
+ For `.8s`, this can be described as `wrd[i] = wrs1[i+1]` and `wrd[i+1] = wrs2[i+1]` for `i in [0, 2, .., (256//32)-2]` which results in `wrd = {..., wrs2[3], wrs1[3], wrs2[1], wrs1[1]}`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnvtrn
+ mapping:
+ odd: b1
+ elen: elen
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b101
+ wrd: wrd
+
+- mnemonic: bn.shv
+ synopsis: Bitwise logical shift of vector elements
+ operands:
+ - name: elen
+ type: enum(.8s)
+ doc: |
+ Select the bit-width of the vector elements.
+ - `.8s`: WDRs are interpreted as 8 32-bit elements
+ - name: wrd
+ doc: Name of the destination WDR
+ - name: wrs
+ doc: Name of the source WDR
+ - name: shift_type
+ abbrev: st
+ type: enum(<<, >>)
+ doc: |
+ The direction of the shift applied to elements of `wrs`.
+ - name: shift_bits
+ abbrev: sb
+ type: uimm5
+ doc: Number of bits by which to shift elements in `wrs`.
+ syntax: ,
+ glued-ops: true
+ doc: |
+ Logically shift each element of vector `wrs` by `shift_bits` in `shift_type` direction.
+ The elements are considered as unsigned and their size is given by `elen`.
+
+ Flags are not used or saved.
+ iflow:
+ - to: [wrd]
+ from: [wrs]
+ encoding:
+ scheme: bnvsh
+ mapping:
+ shift_type: shift_type
+ shift_bits: shift_bits
+ elen: elen
+ wrs: wrs
+ funct3: b111
+ wrd: wrd
+
+- mnemonic: bn.pack
+ synopsis: Pack 32-bit vectors into a 24-bit dense representation
+ operands:
+ - name: wrd
+ doc: Name of the destination WDR
+ - name: wrs1
+ doc: Name of the first source WDR
+ - name: wrs2
+ doc: Name of the second source WDR
+ - name: shift_bits
+ abbrev: sb
+ type: uimm2<<6
+ doc: |
+ The number of bits to shift the concatenated dense vectors.
+ syntax: |
+ , , ,
+ doc: |
+ Compresses the content of the WDRs referenced by `wrs1` and `wrs2` by extracting the lower 24 bits from each of their eight 32-bit elements to form 192-bit dense representations (8 * 24-bit).
+ The dense representations are concatenated (`wrs1` forms the upper part) and 64 zero bits are appended to the left and right to form the 512 bit vector `{64'd0, dense1, dense2, 64'd0}`.
+ This combination is then shifted to the right by `shift_bits` bits and the lowest 256 bits are stored to `wrd`.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnpk
+ mapping:
+ is_pack: b1
+ shift: shift_bits
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b110
+ wrd: wrd
+
+- mnemonic: bn.unpk
+ synopsis: Expand 24-bit packed vectors into 32-bit vectors
+ operands:
+ - name: wrd
+ doc: Name of the destination WDR
+ - name: wrs1
+ doc: Name of the first source WDR
+ - name: wrs2
+ doc: Name of the second source WDR
+ - name: shift_bits
+ abbrev: sb
+ type: uimm2<<6
+ doc: |
+ The number of bits to shift the concatenated WDRs before unpacking.
+ syntax: |
+ , , ,
+ doc: |
+ Expands 8 elements from a 24-bit packed format into a 32-bit vector format.
+ Concatenates the content of WDRs referenced by `wrs1` and `wrs2` (`wrs1` forms the upper part) and shifts it right by the specified number of bits.
+ Finally, the lowest 8 24-bit elements are unpacked and zero extended into 32-bit elements and stored in `wrd`.
+ iflow:
+ - to: [wrd]
+ from: [wrs1, wrs2]
+ encoding:
+ scheme: bnpk
+ mapping:
+ is_pack: b0
+ shift: shift_bits
+ wrs2: wrs2
+ wrs1: wrs1
+ funct3: b110
+ wrd: wrd
diff --git a/hw/ip/otbn/data/enc-schemes.yml b/hw/ip/otbn/data/enc-schemes.yml
index adb992897d47a..b88e9245d1d1e 100644
--- a/hw/ip/otbn/data/enc-schemes.yml
+++ b/hw/ip/otbn/data/enc-schemes.yml
@@ -147,6 +147,11 @@ custom3:
parents:
- rv(opcode=b11110)
+# A partial scheme for custom instructions with opcode b10110
+custom4:
+ parents:
+ - rv(opcode=b10110)
+
# A partial scheme for instructions that produce a dest WDR.
wrd:
fields:
@@ -366,3 +371,93 @@ wcsr:
fixed:
bits: 30-28
value: bxxx
+
+# Bignum vectorized instructions
+#
+# The following encodings describe the vectorized instructions. The encoding foresees possible
+# future extensions regarding the vector element width as well as new instructions.
+
+# A partial scheme for the datatype used by vectorized instructions. Specifies the
+# width/length of a vector element. Defines the element lengths: 32, 64 and 128 bits.
+bnelen:
+ fields:
+ elen: 26-25
+
+# Big number vectorized arithmetic scheme
+# Used by bn.addv, bn.addvm, bn.subv and, bn.subvm
+bnva:
+ parents:
+ - custom4
+ - wdr3
+ - funct3
+ - bnelen
+ fields:
+ sub: 30 # bit to indicate whether operation is subtraction
+ mod: 28 # bit to indicate whether operation is pseudo modulo
+ fixed:
+ # Bit 27 is foreseen to indicate whether operation should consider the carry flag
+ # Bit 29 is unused
+ # Bit 31 is foreseen to select the flag group (use partial scheme `fg` as parent)
+ bits: 31,29,27
+ value: bxxx
+
+# Used by bn.mulv(l) and bn.mulvm(l)
+bnvm:
+ parents:
+ - custom4
+ - wdr3
+ - funct3
+ - bnelen
+ fields:
+ lane: 30-28
+ use_lane: 27
+ unused_lane_bit:
+ # Bit 31 is foreseen to select lane for 16 bit vectors.
+ bits: 31
+ value: bx
+
+# Used by bn.trn1 and bn.trn2
+bnvtrn:
+ parents:
+ - custom4
+ - wdr3
+ - funct3
+ - bnelen
+ fields:
+ odd: 30
+ fixed:
+ bits: 31,29-27
+ value: bxxxx
+
+# Used by bn.shv
+bnvsh:
+ parents:
+ - custom4
+ - wrd
+ - funct3
+ - bnelen
+ fields:
+ shift_type: 30
+ # Bits 28-27,19-15 are foreseen to encode a 7 bit shift immediate to support 128-bit elements.
+ # Currently only 32-bit is supported. Thus, the shift amount is restricted to 5 bits.
+ shift_bits: 19-15
+ wrs: 24-20
+ fixed:
+ # Bit 31 is foreseen to select the flag group (use partial scheme `fg` as parent)
+ # Bit 29 is unused
+ # Bit 28-27 are foreseen for larger shift immediates
+ bits: 31,29-27
+ value: bxxxx
+
+# Used by bn.pack and bn.unpk
+bnpk:
+ parents:
+ - custom4
+ - wdr3
+ - funct3
+ fields:
+ is_pack: 30
+ shift: 28-27
+ fixed:
+ bits: 31,29,26-25
+ value: bxxxx
diff --git a/hw/ip/otbn/data/wsr.yml b/hw/ip/otbn/data/wsr.yml
index 62f8b72a43948..b5fdcdf32b72d 100644
--- a/hw/ip/otbn/data/wsr.yml
+++ b/hw/ip/otbn/data/wsr.yml
@@ -5,7 +5,7 @@
- name: mod
address: 0
doc: |
- The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions.
+ The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions as well as their vectorized variants.
This WSR is also visible as CSRs `MOD0` through to `MOD7`.
- name: rnd
@@ -32,7 +32,7 @@
- name: acc
address: 3
doc: |
- The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction.
+ The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction and the vectorized multiplication instructions like {{#otbn-insn-ref BN.MULV}}.
- name: key_s0_l
address: 4
diff --git a/hw/ip/otbn/doc/bn_trn_illustration.svg b/hw/ip/otbn/doc/bn_trn_illustration.svg
new file mode 100644
index 0000000000000..d50dfffdc5193
--- /dev/null
+++ b/hw/ip/otbn/doc/bn_trn_illustration.svg
@@ -0,0 +1,3 @@
+
+
+
\ No newline at end of file
diff --git a/hw/ip/otbn/doc/isa.md b/hw/ip/otbn/doc/isa.md
index c2d3944474ec0..e5cc7bffa7ce9 100644
--- a/hw/ip/otbn/doc/isa.md
+++ b/hw/ip/otbn/doc/isa.md
@@ -9,6 +9,7 @@ The base subset (described first) is similar to RISC-V's RV32I instruction set.
It also includes a hardware call stack and hardware loop instructions.
The big number subset is designed to operate on 256b WDRs.
It doesn't include any control flow instructions, and just supports load/store, logical and arithmetic operations.
+For some of the logical and arithmetic big number operations exist SIMD variants which interpret the WDRs as vectors.
In the instruction documentation that follows, each instruction has a syntax example.
For example, the `SW` instruction has syntax:
@@ -100,6 +101,8 @@ def extract_quarter_word(value: int, qwsel: int) -> int:
assert 0 <= value < (1 << 256)
assert 0 <= qwsel <= 3
return (value >> (qwsel * 64)) & ((1 << 64) - 1)
+
+TODO Add new helpers for SIMD instructions once the simulator implementation exists.
```
# Errors
diff --git a/hw/ip/otbn/doc/pack_instruction_shifting.svg b/hw/ip/otbn/doc/pack_instruction_shifting.svg
new file mode 100644
index 0000000000000..e96a6760cdfa0
--- /dev/null
+++ b/hw/ip/otbn/doc/pack_instruction_shifting.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/hw/ip/otbn/doc/packed_format.svg b/hw/ip/otbn/doc/packed_format.svg
new file mode 100644
index 0000000000000..2956f799a1e1b
--- /dev/null
+++ b/hw/ip/otbn/doc/packed_format.svg
@@ -0,0 +1,4 @@
+
+
+
+
\ No newline at end of file
diff --git a/hw/ip/otbn/doc/programmers_guide.md b/hw/ip/otbn/doc/programmers_guide.md
index f8eb32f1fef9c..b8b0abc6eff1c 100644
--- a/hw/ip/otbn/doc/programmers_guide.md
+++ b/hw/ip/otbn/doc/programmers_guide.md
@@ -353,6 +353,68 @@ The outlined technique can be extended to arbitrary bit widths but requires unro
Code snippets giving examples of 256x256 and 384x384 multiplies can be found in `sw/otbn/code-snippets/mul256.s` and `sw/otbn/code-snippets/mul384.s`.
+### Packing and unpacking 24-bit element vectors
+The vectorized subset of Bignum instructions enable SIMD computation on 32-bit elements.
+However, some PQC algorithms operate on smaller values.
+To optimize the memory footprint of such programs, vectors can be compressed and then be stored in memory in a compressed 24-bit format.
+The `bn.pack` and `bn.unpk` instructions convert 32-bit vectors into a dense 24-bit representation and vice-versa as described in the [ISA manual](./isa.md).
+These packed vectors can then be stored in the memory as shown below.
+
+
+
+To pack vectors one can use the following snippet:
+```
+/*
+ * Assume we have 4 vectors with 8 32-bit elements currently in WDRs w0-w3
+ * which we want to store in the packed format.
+ * The color in the image corresponds to the WDRs as follows:
+ * w0: Red vector
+ * w1: Yellow vector
+ * w2: Green vector
+ * w3: Blue vector
+ */
+
+/* Pack the vectors into temporary WDRs */
+bn.pack w10, w1, w0, 64
+bn.pack w11, w2, w1, 128
+bn.pack w12, w3, w2, 192
+
+/* Store packed vectors to memory */
+...
+```
+The inner workings of the `bn.pack` instruction are visualized in the following figure for the case of `bn.pack w11, w2, w1, `.
+The two vectors are first converted in a dense format (192 bits each), then concatenated with additional zero bits.
+Finally, the 512 bits are shifted to produce the marked 256 bits which are stored to the destination WDR.
+This allows one to construct all the required packings.
+
+
+
+The unpacking works by concatenating two 256-bit strings loaded from memory and shifting the desired bits to the lower 192 bits.
+These 192 bits are then expanded to 8x 32 bits by inserting zero bytes every 3 bytes.
+```
+/*
+ * Load packed vectors from memory into WDRs w10-w12 such that:
+ * w10 corresponds the 1st line in the first image
+ * w11 corresponds the 2nd line in the first image
+ * w12 corresponds the 3rd line in the first image
+ */
+...
+
+/* Unpack vectors */
+bn.unpk w0, w11, w10, 0 /* unpack the red vector to w0 */
+bn.unpk w1, w11, w10, 192 /* unpack the yellow vector to w1 */
+bn.unpk w2, w12, w11, 128 /* unpack the green vector to w2 */
+bn.unpk w3, wXX, w12, 64 /* unpack the blue vector to w3, wXX represents that any WDR can be used */
+```
+
+### Transposing vector elements
+To efficiently shuffle vectors, one can use the `bn.trn1` and `bn.trn2` instructions.
+These instructions reorder the vector elements as illustrated in the image below for `bn.trn1.4d` and `bn.trn2.4d`.
+- The `bn.trn1 wrd, wrs1, wrs2` instruction places even-indexed vector elements from `wrs1` into even-indexed elements of `wrd` and even-indexed vector elements from `wrs2` are placed into odd-indexed elements of `wrd`.
+- The `bn.trn2 wrd, wrs1, wrs2` instruction places odd-indexed vector elements from `wrs1` into even-indexed elements of `wrd` and odd-indexed vector elements from `wrs2` are placed into odd-indexed elements of `wrd`.
+
+
+
## Device Interface Functions (DIFs)
- [Device Interface Functions](../../../../sw/device/lib/dif/dif_otbn.h)
diff --git a/hw/ip/otbn/doc/theory_of_operation.md b/hw/ip/otbn/doc/theory_of_operation.md
index 96eb1378d4099..e124c068b5279 100644
--- a/hw/ip/otbn/doc/theory_of_operation.md
+++ b/hw/ip/otbn/doc/theory_of_operation.md
@@ -515,7 +515,7 @@ OTBN provides a mechanism to securely wipe all internal state, excluding the ins
The following state is wiped:
* Register files: GPRs and WDRs
-* The accumulator register (also accessible through the ACC WSR)
+* The accumulator register (also accessible through the ACC WSR) and the intermediate result registers for the Montgomery computation (hidden registers).
* Flags (accessible through the FG0, FG1, and FLAGS CSRs)
* The modulus (accessible through the MOD0 to MOD7 CSRs and the MOD WSR)