From 56335b4849d115e6f16b354c273ad09dab6e0d39 Mon Sep 17 00:00:00 2001 From: Pascal Etterli Date: Tue, 10 Feb 2026 10:38:11 +0100 Subject: [PATCH 1/3] [otbn] Define vectorized OTBN bignum instructions This introduces vectorized (SIMD) big number instructions operating on the 256-bit WDRs. These instructions interpret WDRs as vectors of unsigned elements. The width of the elements is for most instructions 32 bits except for a few instructions which support also larger widths. Signed-off-by: Pascal Etterli --- hw/ip/otbn/README.md | 4 +- hw/ip/otbn/data/bignum-insns.yml | 504 ++++++++++++++++++++++++++ hw/ip/otbn/data/enc-schemes.yml | 95 +++++ hw/ip/otbn/data/wsr.yml | 4 +- hw/ip/otbn/doc/isa.md | 3 + hw/ip/otbn/doc/theory_of_operation.md | 2 +- 6 files changed, 607 insertions(+), 5 deletions(-) diff --git a/hw/ip/otbn/README.md b/hw/ip/otbn/README.md index 99f12eeba3408..3d85460ebed9b 100644 --- a/hw/ip/otbn/README.md +++ b/hw/ip/otbn/README.md @@ -707,7 +707,7 @@ All read-write (RW) WSRs are set to 0 when OTBN starts an operation (when 1 is w RW MOD - The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions. + The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions as well as their vectorized variants. This WSR is also visible as CSRs `MOD0` through to `MOD7`. @@ -741,7 +741,7 @@ All read-write (RW) WSRs are set to 0 when OTBN starts an operation (when 1 is w RW ACC - The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction. + The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction and the vectorized multiplication instructions like {{#otbn-insn-ref BN.MULV}}. diff --git a/hw/ip/otbn/data/bignum-insns.yml b/hw/ip/otbn/data/bignum-insns.yml index 8ed94ffa6183d..e9c54ddfbdd45 100644 --- a/hw/ip/otbn/data/bignum-insns.yml +++ b/hw/ip/otbn/data/bignum-insns.yml @@ -957,3 +957,507 @@ lsu: type: wsr-store target: [wsr] + +- mnemonic: bn.addv + synopsis: Add vector elementwise + operands: &bn-addv-operands + - name: elen + # We use an enum here for consistency with other vector instructions which support multiple element lengths. + type: enum(.8s) + doc: | + Select the bit-width of the vector elements. + - `.8s`: WDRs are interpreted as 8 32-bit elements + - name: wrd + doc: Name of the destination WDR + - name: wrs1 + doc: Name of the first source WDR + - name: wrs2 + doc: Name of the second source WDR + syntax: &bn-addv-syntax | + , , + glued-ops: true + doc: | + Add two WDR registers interpreted as vectors. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`. + The vectors are summed elementwise and each sum is truncated to the element length. + The final vector is stored in `wrd`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnva + mapping: + sub: b0 + mod: b0 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b000 + wrd: wrd + +- mnemonic: bn.addvm + synopsis: Pseudo-Modulo add vector elementwise + operands: *bn-addv-operands + syntax: *bn-addv-syntax + glued-ops: true + doc: | + Add two WDR registers interpreted as vectors and reduce the results using the MOD WSR. + + The modulus by which the addition results are reduced comes from the bottom word of the MOD WSR at the corresponding element length. + For example, for `.8s` the modulus is taken from `MOD[31:0]`. + Let `q` be this modulus for the explanation below. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of the same size. + The vectors are summed elementwise. + If an individual result is equal to or larger than `q` then `q` is subtracted from it. + The final results are truncated to the element length and the whole vector is stored in `wrd`. + + This operation correctly implements addition modulo `q`, providing that the intermediate results are less than `2 * q`. + The intermediate results are small enough if both inputs are less than `q`. + Note, this instruction can be used to implement the conditional subtraction step for the Montgomery multiplication (see `bn.mulvm`). + For this case, one input should be zero and the second input can be within `[0, 2*q[` as the instruction will subtract `q` when the result is equal to or greater than `q`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2, mod] + encoding: + scheme: bnva + mapping: + sub: b0 + mod: b1 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b000 + wrd: wrd + +- mnemonic: bn.subv + synopsis: Subtract vector elementwise + operands: *bn-addv-operands + syntax: *bn-addv-syntax + glued-ops: true + doc: | + Subtract `wrs2` from `wrs1` interpreted as vectors. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`. + The vectors are subtracted elementwise and each difference is truncated to the element length. + The final vector is stored in `wrd`. + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnva + mapping: + sub: b1 + mod: b0 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b000 + wrd: wrd + +- mnemonic: bn.subvm + synopsis: Pseudo-Modulo subtract vector elementwise + operands: *bn-addv-operands + syntax: *bn-addv-syntax + glued-ops: true + doc: | + Subtract `wrs2` from `wrs1` interpreted as vectors and reduce the result using the MOD WSR. + + The modulus by which the subtraction results are reduced comes from the bottom word of the MOD WSR at the corresponding element length. + For example, for `.8s` the modulus is taken from `MOD[31:0]`. + Write `q` for this modulus. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of the same size. + The vectors are subtracted elementwise. + The intermediate results are treated as a signed number and if an individual result is negative, `q` is added to it. + The final results are truncated to the element length and the whole vector is stored in `wrd`. + + This operation correctly implements subtraction modulo `q`, providing that the intermediate result are at least `-q` and at most `q - 1`. + This is guaranteed if both inputs are less than `q`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2, mod] + encoding: + scheme: bnva + mapping: + sub: b1 + mod: b1 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b000 + wrd: wrd + +- mnemonic: bn.mulv + synopsis: Elementwise vector multiplication + operands: &bn-mulv-operands + - name: elen + # We use an enum here for consistency with other vector instructions which support multiple element lengths. + # We maybe should extend the framework such that encodings can be specified and some options can be marked as reserved. + type: enum(.8s) + doc: | + Select the bit-width of the vector elements. + - `.8s`: WDRs are interpreted as 8 32-bit elements + - name: wrd + doc: Name of the destination WDR + - name: wrs1 + doc: Name of the first source WDR + - name: wrs2 + doc: Name of the second source WDR + syntax: &bn-mulv-syntax | + , , + glued-ops: true + doc: | + Multiply two WDR registers interpreted as vectors. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`. + The vectors are multiplied elementwise and each product is truncated to the element length. + The final vector is stored in `wrd`. + + This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR as well as `wrd`. + + Flags are not used or saved. + iflow: + - to: [wrd, acc] + from: [wrs1, wrs2] + encoding: + scheme: bnvm + mapping: + lane: bxxx + use_lane: b0 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b011 + wrd: wrd + +- mnemonic: bn.mulvl + synopsis: Elementwise vector multiplication with fixed lane + operands: &bn-mulvl-operands + - name: elen + # We use an enum here for consistency with other vector instructions which support multiple element lengths. + # We maybe should extend the framework such that encodings can be specified and some options can be marked as reserved. + type: enum(.8s) + doc: | + Select the bit-width of the vector elements. + - `.8s`: WDRs are interpreted as 8 32-bit elements + - name: wrd + doc: Destination WDR + - name: wrs1 + doc: First source WDR + - name: wrs2 + doc: Second source WDR + - name: lane + doc: | + Select which element of `wrs2` will be multiplied with all elements of `wrs1`. + type: uimm3 + syntax: &bn-mulvl-syntax | + , , , + glued-ops: true + doc: | + Multiply each vector element of `wrs1` with element `lane` of `wrs2`. + + The WDRs `wrs1` and `wrs2` are interpreted as vectors with unsigned elements of size given by `elen`. + The elements are indexed with `lane` starting from the lowest bits of the WDR. + For example, for `.8s` lane 0 specifies the bits `[31:0]` of `wrs2`. + Each product is truncated to the element length and the final vector is stored in `wrd`. + + This instruction stalls OTBN for 4 cycles and writes the final result to the ACC WSR as well as `wrd`. + + Flags are not used or saved. + iflow: + - to: [wrd, acc] + from: [wrs1, wrs2] + encoding: + scheme: bnvm + mapping: + lane: lane + use_lane: b1 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b011 + wrd: wrd + +- mnemonic: bn.mulvm + synopsis: Elementwise vector Montgomery multiplication + operands: *bn-mulv-operands + syntax: *bn-mulv-syntax + glued-ops: true + doc: | + Performs a Montgomery multiplication without the final conditional subtraction step of the Montgomery algorithm. + The values in `wrs1` and `wrs2` are interpreted as vectors with elements of size given by `elen` and are considered unsigned. + Each result is truncated to the element length and the final vector is stored in `wrd`. + This instruction outputs the result in `[0, 2q[` as the final conditional subtraction step is not performed. + For a result in `[0, q[`, a conditional subtraction must be performed after this instruction, e.g., with `BN.ADDVM` and a zero vector. + + This instructions stalls OTBN for 12 cycles and writes the final result to the ACC WSR as well as `wrd`. + + **Montgomery explanation**\\ + Inputs: + - `a, b`: Operands in `[0, q[` + - `d`: Bit-width of operands + - `q`: Modulus in `]0, 2^d]` + - `mu`: Montgomery constant, precomputed, `mu = (-q)^(-1) mod 2^d` + + The operands must be pre-transformed into Montgomery space with `a = a_orig * 2^d mod q`. + This must be ensured by the programmer. + + The Montgomery multiplication is defined as: + ``` + r = a*b * 2^(-d) mod q + ``` + where `r` is in `[0, q[`. + + This can be computed with the following steps where `[]_d` are the lower `d` bits and `[]^d` are the higher `d` bits: + ``` + c = a*b + r = [c + [[c]_d * mu]_d * q]^d + if r >= q then // not implemented in hardware + return r - q + return r + ``` + + This instruction computes these steps except the final conditional subtraction (`r >= q`) to optimize area and timing. + The result is thus in `[0, 2q[` and must be reduced to `[0, q[` by the programmer. + This can be performed in software by using a pseudo modulo addition (`BN.ADDVM`) with a zero vector. + + Note that when chaining multiplications, the conditional subtraction can be postponed until after the last multiplication in case the initial inputs are in `[0, 2q[` and `q < (2^d)/4` holds. + + **Pre-requisites**\\ + This instruction requires the modulus `q` and its corresponding Montgomery constant `mu` to be placed in the MOD WSR at the following locations:\\ + For `.8s`: `q @ MOD[31:0]`, `mu @ MOD[63:32]` + + **Micro-architectural remarks**\\ + The ACC WSR is used to temporarily store the partial result before the complete result is written into the destination WDR. + It is reset at the start of the instruction but not reset at the end of the instruction. + Similarly, the intermediate multiplication results are stored in two hidden registers. + These registers are cleared to all-zero after each Montgomery computation (i.e., every 3 cycles). + + If required for cryptographic reasons, all three registers can be overwritten using a `BN.WSRW` instruction to the ACC WSR. + + Flags are not used or saved. + iflow: + - to: [wrd, acc] + from: [wrs1, wrs2, mod] + encoding: + scheme: bnvm + mapping: + lane: bxxx + use_lane: b0 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b100 + wrd: wrd + +- mnemonic: bn.mulvml + synopsis: Elementwise vector Montgomery multiplication with fixed lane + operands: *bn-mulvl-operands + syntax: *bn-mulvl-syntax + glued-ops: true + doc: | + Perform a Montgomery multiplication without the final conditional subtraction on each element of `wrs1` with element `lane` of `wrs2`. + The elements are indexed with `lane` starting from the lowest bits of the WDR. + For example, for `.8s` lane 0 specifies the bits `[31:0]` of `wrs2`. + + See `BN.MULVM` for explanations about the Montgomery algorithm and the implementation details, especially regarding security. + For a correct result, a conditional subtraction must be performed after this instruction, e.g., with `BN.ADDVM`. + + This instruction requires the modulus `q` and its corresponding Montgomery constant `mu` to be placed in the MOD WSR at the following location:\\ + For `.8s`: `q @ MOD[31:0]`, `mu @ MOD[63:32]` + + This instructions stalls OTBN for 12 cycles and writes the final result to the ACC WSR as well as `wrd`. + + Flags are not used or saved. + iflow: + - to: [wrd, acc] + from: [wrs1, wrs2, mod] + encoding: + scheme: bnvm + mapping: + lane: lane + use_lane: b1 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b100 + wrd: wrd + +- mnemonic: bn.trn1 + synopsis: Interleave vectors in even fashion + operands: &bn-trn-operands + - name: elen + type: enum(.8s,.4d,.2q) + doc: | + Select the bit-width of the vector elements. + - `.8s`: WDRs are interpreted as 8 32 bit elements + - `.4d`: WDRs are interpreted as 4 64 bit elements + - `.2q`: WDRs are interpreted as 2 128 bit elements + - name: wrd + doc: Destination WDR + - name: wrs1 + doc: First source WDR + - name: wrs2 + syntax: &bn-trn-syntax + , , + glued-ops: true + doc: | + Even-numbered vector elements from `wrs1` are placed into even-numbered elements of `wrd`. + Even-numbered vector elements from `wrs2` are placed into odd-numbered elements of `wrd`. + + For `.8s`, this can be described as `wrd[i] = wrs1[i]` and `wrd[i+1] = wrs2[i]` for `i in [0, 2, .., (256//32)-2]` which results in `wrd = {..., wrs2[2], wrs1[2], wrs2[0], wrs1[0]}`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnvtrn + mapping: + odd: b0 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b101 + wrd: wrd + +- mnemonic: bn.trn2 + synopsis: Interleave vectors in odd fashion + operands: *bn-trn-operands + syntax: *bn-trn-syntax + glued-ops: true + doc: | + Odd-numbered vector elements from `wrs1` are placed into even-numbered elements of `wrd`. + Odd-numbered vector elements from `wrs2` are placed into odd-numbered elements of `wrd`. + + For `.8s`, this can be described as `wrd[i] = wrs1[i+1]` and `wrd[i+1] = wrs2[i+1]` for `i in [0, 2, .., (256//32)-2]` which results in `wrd = {..., wrs2[3], wrs1[3], wrs2[1], wrs1[1]}`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnvtrn + mapping: + odd: b1 + elen: elen + wrs2: wrs2 + wrs1: wrs1 + funct3: b101 + wrd: wrd + +- mnemonic: bn.shv + synopsis: Bitwise logical shift of vector elements + operands: + - name: elen + type: enum(.8s) + doc: | + Select the bit-width of the vector elements. + - `.8s`: WDRs are interpreted as 8 32-bit elements + - name: wrd + doc: Name of the destination WDR + - name: wrs + doc: Name of the source WDR + - name: shift_type + abbrev: st + type: enum(<<, >>) + doc: | + The direction of the shift applied to elements of `wrs`. + - name: shift_bits + abbrev: sb + type: uimm5 + doc: Number of bits by which to shift elements in `wrs`. + syntax: , + glued-ops: true + doc: | + Logically shift each element of vector `wrs` by `shift_bits` in `shift_type` direction. + The elements are considered as unsigned and their size is given by `elen`. + + Flags are not used or saved. + iflow: + - to: [wrd] + from: [wrs] + encoding: + scheme: bnvsh + mapping: + shift_type: shift_type + shift_bits: shift_bits + elen: elen + wrs: wrs + funct3: b111 + wrd: wrd + +- mnemonic: bn.pack + synopsis: Pack 32-bit vectors into a 24-bit dense representation + operands: + - name: wrd + doc: Name of the destination WDR + - name: wrs1 + doc: Name of the first source WDR + - name: wrs2 + doc: Name of the second source WDR + - name: shift_bits + abbrev: sb + type: uimm2<<6 + doc: | + The number of bits to shift the concatenated dense vectors. + syntax: | + , , , + doc: | + Compresses the content of the WDRs referenced by `wrs1` and `wrs2` by extracting the lower 24 bits from each of their eight 32-bit elements to form 192-bit dense representations (8 * 24-bit). + The dense representations are concatenated (`wrs1` forms the upper part) and 64 zero bits are appended to the left and right to form the 512 bit vector `{64'd0, dense1, dense2, 64'd0}`. + This combination is then shifted to the right by `shift_bits` bits and the lowest 256 bits are stored to `wrd`. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnpk + mapping: + is_pack: b1 + shift: shift_bits + wrs2: wrs2 + wrs1: wrs1 + funct3: b110 + wrd: wrd + +- mnemonic: bn.unpk + synopsis: Expand 24-bit packed vectors into 32-bit vectors + operands: + - name: wrd + doc: Name of the destination WDR + - name: wrs1 + doc: Name of the first source WDR + - name: wrs2 + doc: Name of the second source WDR + - name: shift_bits + abbrev: sb + type: uimm2<<6 + doc: | + The number of bits to shift the concatenated WDRs before unpacking. + syntax: | + , , , + doc: | + Expands 8 elements from a 24-bit packed format into a 32-bit vector format. + Concatenates the content of WDRs referenced by `wrs1` and `wrs2` (`wrs1` forms the upper part) and shifts it right by the specified number of bits. + Finally, the lowest 8 24-bit elements are unpacked and zero extended into 32-bit elements and stored in `wrd`. + iflow: + - to: [wrd] + from: [wrs1, wrs2] + encoding: + scheme: bnpk + mapping: + is_pack: b0 + shift: shift_bits + wrs2: wrs2 + wrs1: wrs1 + funct3: b110 + wrd: wrd diff --git a/hw/ip/otbn/data/enc-schemes.yml b/hw/ip/otbn/data/enc-schemes.yml index adb992897d47a..b88e9245d1d1e 100644 --- a/hw/ip/otbn/data/enc-schemes.yml +++ b/hw/ip/otbn/data/enc-schemes.yml @@ -147,6 +147,11 @@ custom3: parents: - rv(opcode=b11110) +# A partial scheme for custom instructions with opcode b10110 +custom4: + parents: + - rv(opcode=b10110) + # A partial scheme for instructions that produce a dest WDR. wrd: fields: @@ -366,3 +371,93 @@ wcsr: fixed: bits: 30-28 value: bxxx + +# Bignum vectorized instructions +# +# The following encodings describe the vectorized instructions. The encoding foresees possible +# future extensions regarding the vector element width as well as new instructions. + +# A partial scheme for the datatype used by vectorized instructions. Specifies the +# width/length of a vector element. Defines the element lengths: 32, 64 and 128 bits. +bnelen: + fields: + elen: 26-25 + +# Big number vectorized arithmetic scheme +# Used by bn.addv, bn.addvm, bn.subv and, bn.subvm +bnva: + parents: + - custom4 + - wdr3 + - funct3 + - bnelen + fields: + sub: 30 # bit to indicate whether operation is subtraction + mod: 28 # bit to indicate whether operation is pseudo modulo + fixed: + # Bit 27 is foreseen to indicate whether operation should consider the carry flag + # Bit 29 is unused + # Bit 31 is foreseen to select the flag group (use partial scheme `fg` as parent) + bits: 31,29,27 + value: bxxx + +# Used by bn.mulv(l) and bn.mulvm(l) +bnvm: + parents: + - custom4 + - wdr3 + - funct3 + - bnelen + fields: + lane: 30-28 + use_lane: 27 + unused_lane_bit: + # Bit 31 is foreseen to select lane for 16 bit vectors. + bits: 31 + value: bx + +# Used by bn.trn1 and bn.trn2 +bnvtrn: + parents: + - custom4 + - wdr3 + - funct3 + - bnelen + fields: + odd: 30 + fixed: + bits: 31,29-27 + value: bxxxx + +# Used by bn.shv +bnvsh: + parents: + - custom4 + - wrd + - funct3 + - bnelen + fields: + shift_type: 30 + # Bits 28-27,19-15 are foreseen to encode a 7 bit shift immediate to support 128-bit elements. + # Currently only 32-bit is supported. Thus, the shift amount is restricted to 5 bits. + shift_bits: 19-15 + wrs: 24-20 + fixed: + # Bit 31 is foreseen to select the flag group (use partial scheme `fg` as parent) + # Bit 29 is unused + # Bit 28-27 are foreseen for larger shift immediates + bits: 31,29-27 + value: bxxxx + +# Used by bn.pack and bn.unpk +bnpk: + parents: + - custom4 + - wdr3 + - funct3 + fields: + is_pack: 30 + shift: 28-27 + fixed: + bits: 31,29,26-25 + value: bxxxx diff --git a/hw/ip/otbn/data/wsr.yml b/hw/ip/otbn/data/wsr.yml index 62f8b72a43948..b5fdcdf32b72d 100644 --- a/hw/ip/otbn/data/wsr.yml +++ b/hw/ip/otbn/data/wsr.yml @@ -5,7 +5,7 @@ - name: mod address: 0 doc: | - The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions. + The modulus used by the {{#otbn-insn-ref BN.ADDM}} and {{#otbn-insn-ref BN.SUBM}} instructions as well as their vectorized variants. This WSR is also visible as CSRs `MOD0` through to `MOD7`. - name: rnd @@ -32,7 +32,7 @@ - name: acc address: 3 doc: | - The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction. + The accumulator register used by the {{#otbn-insn-ref BN.MULQACC}} instruction and the vectorized multiplication instructions like {{#otbn-insn-ref BN.MULV}}. - name: key_s0_l address: 4 diff --git a/hw/ip/otbn/doc/isa.md b/hw/ip/otbn/doc/isa.md index c2d3944474ec0..e5cc7bffa7ce9 100644 --- a/hw/ip/otbn/doc/isa.md +++ b/hw/ip/otbn/doc/isa.md @@ -9,6 +9,7 @@ The base subset (described first) is similar to RISC-V's RV32I instruction set. It also includes a hardware call stack and hardware loop instructions. The big number subset is designed to operate on 256b WDRs. It doesn't include any control flow instructions, and just supports load/store, logical and arithmetic operations. +For some of the logical and arithmetic big number operations exist SIMD variants which interpret the WDRs as vectors. In the instruction documentation that follows, each instruction has a syntax example. For example, the `SW` instruction has syntax: @@ -100,6 +101,8 @@ def extract_quarter_word(value: int, qwsel: int) -> int: assert 0 <= value < (1 << 256) assert 0 <= qwsel <= 3 return (value >> (qwsel * 64)) & ((1 << 64) - 1) + +TODO Add new helpers for SIMD instructions once the simulator implementation exists. ``` # Errors diff --git a/hw/ip/otbn/doc/theory_of_operation.md b/hw/ip/otbn/doc/theory_of_operation.md index 96eb1378d4099..e124c068b5279 100644 --- a/hw/ip/otbn/doc/theory_of_operation.md +++ b/hw/ip/otbn/doc/theory_of_operation.md @@ -515,7 +515,7 @@ OTBN provides a mechanism to securely wipe all internal state, excluding the ins The following state is wiped: * Register files: GPRs and WDRs -* The accumulator register (also accessible through the ACC WSR) +* The accumulator register (also accessible through the ACC WSR) and the intermediate result registers for the Montgomery computation (hidden registers). * Flags (accessible through the FG0, FG1, and FLAGS CSRs) * The modulus (accessible through the MOD0 to MOD7 CSRs and the MOD WSR) From d6e2cf113084ce25b76dd22e9eff0657fec25eb9 Mon Sep 17 00:00:00 2001 From: Pascal Etterli Date: Tue, 10 Feb 2026 10:38:11 +0100 Subject: [PATCH 2/3] [otbn,doc] Extend the programmer's guide with bn.pack and bn.unpk explanations Adds a simple example how to use the bn.pack and bn.unpack instructions. Signed-off-by: Pascal Etterli --- hw/ip/otbn/doc/pack_instruction_shifting.svg | 4 ++ hw/ip/otbn/doc/packed_format.svg | 4 ++ hw/ip/otbn/doc/programmers_guide.md | 55 ++++++++++++++++++++ 3 files changed, 63 insertions(+) create mode 100644 hw/ip/otbn/doc/pack_instruction_shifting.svg create mode 100644 hw/ip/otbn/doc/packed_format.svg diff --git a/hw/ip/otbn/doc/pack_instruction_shifting.svg b/hw/ip/otbn/doc/pack_instruction_shifting.svg new file mode 100644 index 0000000000000..e96a6760cdfa0 --- /dev/null +++ b/hw/ip/otbn/doc/pack_instruction_shifting.svg @@ -0,0 +1,4 @@ + + + +
24 bits
0
Bits stored to WDR when
256 bits
256 bits
256 bits
256 bits
0
7
8
15
16
23
24
31
32
39
40
47
48
55
56
63
Byte
0
shifting by 0
shifting by 64
shifting by 128
shifting by 192
\ No newline at end of file diff --git a/hw/ip/otbn/doc/packed_format.svg b/hw/ip/otbn/doc/packed_format.svg new file mode 100644 index 0000000000000..2956f799a1e1b --- /dev/null +++ b/hw/ip/otbn/doc/packed_format.svg @@ -0,0 +1,4 @@ + + + +
24 bits
Byte
0
7
23
24
2
31
15
16
8
\ No newline at end of file diff --git a/hw/ip/otbn/doc/programmers_guide.md b/hw/ip/otbn/doc/programmers_guide.md index f8eb32f1fef9c..d5621639e7522 100644 --- a/hw/ip/otbn/doc/programmers_guide.md +++ b/hw/ip/otbn/doc/programmers_guide.md @@ -353,6 +353,61 @@ The outlined technique can be extended to arbitrary bit widths but requires unro Code snippets giving examples of 256x256 and 384x384 multiplies can be found in `sw/otbn/code-snippets/mul256.s` and `sw/otbn/code-snippets/mul384.s`. +### Packing and unpacking 24-bit element vectors +The vectorized subset of Bignum instructions enable SIMD computation on 32-bit elements. +However, some PQC algorithms operate on smaller values. +To optimize the memory footprint of such programs, vectors can be compressed and then be stored in memory in a compressed 24-bit format. +The `bn.pack` and `bn.unpk` instructions convert 32-bit vectors into a dense 24-bit representation and vice-versa as described in the [ISA manual](./isa.md). +These packed vectors can then be stored in the memory as shown below. + + + +To pack vectors one can use the following snippet: +``` +/* + * Assume we have 4 vectors with 8 32-bit elements currently in WDRs w0-w3 + * which we want to store in the packed format. + * The color in the image corresponds to the WDRs as follows: + * w0: Red vector + * w1: Yellow vector + * w2: Green vector + * w3: Blue vector + */ + +/* Pack the vectors into temporary WDRs */ +bn.pack w10, w1, w0, 64 +bn.pack w11, w2, w1, 128 +bn.pack w12, w3, w2, 192 + +/* Store packed vectors to memory */ +... +``` +The inner workings of the `bn.pack` instruction are visualized in the following figure for the case of `bn.pack w11, w2, w1, `. +The two vectors are first converted in a dense format (192 bits each), then concatenated with additional zero bits. +Finally, the 512 bits are shifted to produce the marked 256 bits which are stored to the destination WDR. +This allows one to construct all the required packings. + + + +To unpack vectors one can use the following approach. +The unpacking works by concatenating two 256-bit strings loaded from memory and shifting the desired bits to the lower 192 bits. +These 192 bits are then expanded to 8x 32 bits by inserting zero bytes every 3 bytes. +``` +/* + * Load packed vectors from memory into WDRs w10-w12 such that: + * w10 corresponds the 1st line in the first image + * w11 corresponds the 2nd line in the first image + * w12 corresponds the 3rd line in the first image + */ +... + +/* Unpack vectors */ +bn.unpk w0, w11, w10, 0 /* unpack the red vector to w0 */ +bn.unpk w1, w11, w10, 192 /* unpack the yellow vector to w1 */ +bn.unpk w2, w12, w11, 128 /* unpack the green vector to w2 */ +bn.unpk w3, wXX, w12, 64 /* unpack the blue vector to w3, wXX represents that any WDR can be used */ +``` + ## Device Interface Functions (DIFs) - [Device Interface Functions](../../../../sw/device/lib/dif/dif_otbn.h) From 22f20279517cc0298cee66068c63767101ac1b1c Mon Sep 17 00:00:00 2001 From: Pascal Etterli Date: Tue, 10 Feb 2026 10:38:11 +0100 Subject: [PATCH 3/3] [otbn,doc] Extend the programmer's guide with bn.trn explanations This illustrates the functionlaty of the bn.trn1 and bn.trn2 instructions. Signed-off-by: Pascal Etterli --- hw/ip/otbn/doc/bn_trn_illustration.svg | 3 +++ hw/ip/otbn/doc/programmers_guide.md | 9 ++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) create mode 100644 hw/ip/otbn/doc/bn_trn_illustration.svg diff --git a/hw/ip/otbn/doc/bn_trn_illustration.svg b/hw/ip/otbn/doc/bn_trn_illustration.svg new file mode 100644 index 0000000000000..d50dfffdc5193 --- /dev/null +++ b/hw/ip/otbn/doc/bn_trn_illustration.svg @@ -0,0 +1,3 @@ + + +
0
0
1
1
2
3
2
3
bn.trn1.4d wrd, wrs1, wrs2
WRS1
WRS2
WRD
0
0
2
2
0
0
1
1
2
3
2
3
1
1
3
3
64 bit
bn.trn2.4d wrd, wrs1, wrs2
\ No newline at end of file diff --git a/hw/ip/otbn/doc/programmers_guide.md b/hw/ip/otbn/doc/programmers_guide.md index d5621639e7522..b8b0abc6eff1c 100644 --- a/hw/ip/otbn/doc/programmers_guide.md +++ b/hw/ip/otbn/doc/programmers_guide.md @@ -389,7 +389,6 @@ This allows one to construct all the required packings. -To unpack vectors one can use the following approach. The unpacking works by concatenating two 256-bit strings loaded from memory and shifting the desired bits to the lower 192 bits. These 192 bits are then expanded to 8x 32 bits by inserting zero bytes every 3 bytes. ``` @@ -408,6 +407,14 @@ bn.unpk w2, w12, w11, 128 /* unpack the green vector to w2 */ bn.unpk w3, wXX, w12, 64 /* unpack the blue vector to w3, wXX represents that any WDR can be used */ ``` +### Transposing vector elements +To efficiently shuffle vectors, one can use the `bn.trn1` and `bn.trn2` instructions. +These instructions reorder the vector elements as illustrated in the image below for `bn.trn1.4d` and `bn.trn2.4d`. +- The `bn.trn1 wrd, wrs1, wrs2` instruction places even-indexed vector elements from `wrs1` into even-indexed elements of `wrd` and even-indexed vector elements from `wrs2` are placed into odd-indexed elements of `wrd`. +- The `bn.trn2 wrd, wrs1, wrs2` instruction places odd-indexed vector elements from `wrs1` into even-indexed elements of `wrd` and odd-indexed vector elements from `wrs2` are placed into odd-indexed elements of `wrd`. + + + ## Device Interface Functions (DIFs) - [Device Interface Functions](../../../../sw/device/lib/dif/dif_otbn.h)