|
| 1 | +# RealTek AmebaPro2 (RTL8735B) HUK Port |
| 2 | + |
| 3 | +Binds wolfCrypt keys to the RTL8735B silicon Hardware Unique Key (HUK) through |
| 4 | +the AmebaPro2 HAL crypto engine, via the wolfCrypt crypto-callback (CryptoCb) |
| 5 | +framework. A 256-bit "seed" is run through the HAL HKDF key-ladder against the |
| 6 | +HUK to land a device-bound working key in a secure key-storage slot; AES |
| 7 | +(GCM/ECB/CBC/CTR) then runs from that slot and the working key never enters |
| 8 | +software. It is a pure crypto-callback device and adds no wolfSSL core API or |
| 9 | +struct fields: AES reads its seed from the standard `aes->devKey`, and ECDSA |
| 10 | +reads a `wc_AmebaPro2_EccKey` (the HUK-wrapped scalar + seed) the caller attaches |
| 11 | +via the standard `ecc_key->devCtx`. This mirrors the device pattern the STM32 |
| 12 | +DHUK port (`wc_Stm32_DhukRegister`) also uses. |
| 13 | + |
| 14 | +## Hardware |
| 15 | + |
| 16 | +RTL8735B / AmebaPro2 security blocks used by this port (from the |
| 17 | +`Ameba-AIoT/nuwa_hal_realtek` SDK, `rtl8735b` branch, headers under |
| 18 | +`ameba/amebapro2/source/fwlib/rtl8735b/include/`): |
| 19 | + |
| 20 | +- HUK in OTP: `SB_OTP_HIGH_VAL_HUK1` (0x21), `HUK2` (0x22), `HUK_RMA` (0x2F). |
| 21 | +- HKDF key-ladder in secure RAM: `hal_hkdf_hmac_sha256_secure_init`, |
| 22 | + `hal_hkdf_extract_secure_all`, `hal_hkdf_expand_secure_all` -- derive the HUK |
| 23 | + into a secure key-storage slot without exposing the key to software. |
| 24 | +- AES secure-key ops that reference the derived slot by number: |
| 25 | + `hal_crypto_aes_ecb_sk_init`, `hal_crypto_aes_gcm_sk_init` (key never leaves |
| 26 | + hardware). |
| 27 | +- The HUK-bound ECDSA sign path reuses the AES secure-key engine above to unwrap |
| 28 | + the wrapped scalar, then signs in software. The HW ECDSA engine (`hal_ecdsa.h`) |
| 29 | + and OTP-resident ECDSA keys (`hal_otp_ecdsa_key_*`) are follow-ons, not yet |
| 30 | + used. |
| 31 | +- TRNG (`hal_trng.h`); the `ameba-zephyr-pro2-platform` repo provides a Zephyr |
| 32 | + entropy driver (`entropy_amebapro2.c`, DT `realtek,amebapro2-trng`) that feeds |
| 33 | + wolfCrypt's `wc_GenerateSeed` via `sys_rand_get`. |
| 34 | + |
| 35 | +## Enabling |
| 36 | + |
| 37 | +```c |
| 38 | +#define WOLFSSL_REALTEK_HUK /* enable the AmebaPro2 HUK device */ |
| 39 | +#define WOLF_CRYPTO_CB /* required -- HUK routes through crypto callbacks */ |
| 40 | +``` |
| 41 | +
|
| 42 | +Set these in `user_settings.h`. The application/board CMake must add |
| 43 | +the AmebaPro2 HAL include directory (e.g. |
| 44 | +`.../fwlib/rtl8735b/include/`) to the wolfSSL library include path so this port |
| 45 | +can include `hal_crypto.h` and `hal_hkdf.h`. |
| 46 | +
|
| 47 | +Configurable (override in `user_settings.h` before including wolfSSL): |
| 48 | +
|
| 49 | +| Macro | Default | Meaning | |
| 50 | +|--------------------------------|---------|--------------------------------------| |
| 51 | +| `WC_HUK_DEVID` | 809 | CryptoCb device id (STM32 DHUK is 808) | |
| 52 | +| `WC_AMEBAPRO2_HUK_SK_IDX` | 1 | Secure-key slot holding the HUK (HUK1) | |
| 53 | +| `WC_AMEBAPRO2_HKDF_PRK_IDX` | 3 | Intermediate HKDF PRK slot | |
| 54 | +| `WC_AMEBAPRO2_DERIVED_WB_IDX` | 4 | Derived working-key slot (AES uses it) | |
| 55 | +| `WC_AMEBAPRO2_HKDF_CRYPTO_SEL` | 0 | `crypto_sel` for the secure HKDF init | |
| 56 | +| `WC_AMEBAPRO2_MAX_WRAPPED` | 96 | Max wrapped-scalar blob the ECDSA sign path unwraps | |
| 57 | +
|
| 58 | +## API |
| 59 | +
|
| 60 | +```c |
| 61 | +#include <wolfssl/wolfcrypt/port/realtek/amebapro2.h> |
| 62 | +
|
| 63 | +/* One-time: register the AmebaPro2 HUK crypto-callback device. */ |
| 64 | +wc_AmebaPro2_HukRegister(WC_HUK_DEVID); |
| 65 | +
|
| 66 | +/* AES / GCM: enable via devId at init, then pass the 256-bit seed as the key. |
| 67 | + * The seed is HKDF input that diversifies the HUK -- it is NOT the AES key. */ |
| 68 | +Aes aes; |
| 69 | +byte seed[32]; /* per-purpose derivation seed (need not be secret) */ |
| 70 | +wc_AesInit(&aes, NULL, WC_HUK_DEVID); |
| 71 | +wc_AesGcmSetKey(&aes, seed, 32); |
| 72 | +wc_AesGcmEncrypt(&aes, ct, pt, ptSz, iv, 12, tag, tagSz, aad, aadSz); /* full GCM */ |
| 73 | +wc_AesFree(&aes); |
| 74 | +
|
| 75 | +/* AES-ECB / AES-CBC follow the same pattern (wc_AesSetKey + wc_AesEcb*/ |
| 76 | +/* wc_AesCbc* with devId = WC_HUK_DEVID). */ |
| 77 | +
|
| 78 | +wc_AmebaPro2_HukUnRegister(WC_HUK_DEVID); |
| 79 | +``` |
| 80 | + |
| 81 | +The seed maps to a device-bound working key as: |
| 82 | +HUK (slot `WC_AMEBAPRO2_HUK_SK_IDX`) -> `hal_hkdf_extract_secure_all` -> PRK slot |
| 83 | +-> `hal_hkdf_expand_secure_all` -> working key in `WC_AMEBAPRO2_DERIVED_WB_IDX` |
| 84 | +-> `hal_crypto_aes_gcm_sk_init` / `hal_crypto_aes_ecb_sk_init`. The derive and |
| 85 | +the AES op run under one crypto-mutex hold; the working key never enters |
| 86 | +software. Identical seed -> identical working key (deterministic, so GMAC |
| 87 | +verifies and AES round-trips); a wrong seed yields a different key (GCM decrypt |
| 88 | +returns `AES_GCM_AUTH_E`). |
| 89 | + |
| 90 | +HUK-bound ECDSA sign (Stage 3, wrapped-scalar): point the key's crypto-callback |
| 91 | +context at a `wc_AmebaPro2_EccKey` (the scalar AES-wrapped under a HUK-derived |
| 92 | +key, plus its 32-byte seed) -- no dedicated wolfSSL import API: |
| 93 | + |
| 94 | +```c |
| 95 | +#include <wolfssl/wolfcrypt/port/realtek/amebapro2.h> |
| 96 | +wc_AmebaPro2_EccKey hk = { seed, 32, wrapped, wrappedLen, plainLen }; |
| 97 | +ecc_key key; |
| 98 | +wc_ecc_init_ex(&key, NULL, WC_HUK_DEVID); |
| 99 | +wc_ecc_set_curve(&key, plainLen, ECC_SECP256R1); |
| 100 | +key.devCtx = &hk; /* borrowed; must outlive the key */ |
| 101 | +wc_ecc_sign_hash(hash, hashSz, sig, &sigSz, rng, &key); |
| 102 | +``` |
| 103 | +
|
| 104 | +At sign time the port derives the slot key from the seed, ECB-unwraps the scalar |
| 105 | +into a short-lived buffer, signs, and scrubs it. The wrapped blob is device-bound |
| 106 | +(it only unwraps on the silicon whose HUK produced the slot key). The scalar is |
| 107 | +briefly in software during the sign; an OTP-resident model (`hal_ecdsa_select_prk`, |
| 108 | +scalar never in software) and routing the sign itself through the HW ECDSA engine |
| 109 | +(`hal_ecdsa`) are follow-ons. |
| 110 | +
|
| 111 | +## Notes / limitations |
| 112 | +
|
| 113 | +- The HAL GCM path assumes a 96-bit (12-byte) IV (standard J0). A non-12-byte |
| 114 | + IV returns a hard error (not a software fallback, which would key off the seed |
| 115 | + rather than the device-bound key). |
| 116 | +- AES-CBC and AES-CTR chain in software over single-block |
| 117 | + `hal_crypto_aes_ecb_sk_*` calls because the HAL exposes no CBC/CTR secure-key |
| 118 | + variant; the key still stays in hardware. CTR maintains the wolfCrypt counter |
| 119 | + state (`aes->reg`/`tmp`/`left`) so partial blocks continue across calls. |
| 120 | +- The HAL crypto engine DMAs its buffers on 32-byte (cache-line) boundaries and |
| 121 | + rejects an unaligned GCM iv/aad. The port stages key/iv/aad/tag on aligned |
| 122 | + temporaries and bounces unaligned in/out through aligned buffers, so callers |
| 123 | + need not align. |
| 124 | +- Each operation derives the working key from the Aes' own `devKey` seed under |
| 125 | + the crypto mutex (no shared port global), so concurrent `Aes` objects are |
| 126 | + safe. |
| 127 | +- `--enable-amebapro2` builds a host compile-test only: it swaps the HAL headers |
| 128 | + for `amebapro2_shim.h` (sentinel stubs, no real crypto) to exercise the |
| 129 | + crypto-callback dispatch and build wiring without the vendor SDK. All |
| 130 | + functional validation requires RTL8735B hardware. |
| 131 | +
|
| 132 | +## Status |
| 133 | +
|
| 134 | +Validated on RTL8735B silicon (both the RealTek FreeRTOS SDK app and a Zephyr |
| 135 | +image): registration; AES-GCM (encrypt / deterministic tag / decrypt-verify / |
| 136 | +round-trip / wrong-seed -> `AES_GCM_AUTH_E` / unaligned buffers / non-12-byte-IV |
| 137 | +reject); AES-ECB; AES-CBC (incl. in-place, multi-call); AES-CTR; and HUK-bound |
| 138 | +ECDSA (P-256) -- all pass. |
| 139 | +
|
| 140 | +- Stage 0 (skeleton, build wiring, host compile-test): done. |
| 141 | +- Stage 1 (HUK key-ladder + full AES-GCM): done, validated on hardware. |
| 142 | +- Stage 2 (AES-ECB / AES-CBC / AES-CTR): done, validated on hardware. |
| 143 | +- Stage 3 (HUK-bound ECDSA sign, wrapped-scalar): done, validated on RTL8735B |
| 144 | + (P-256 sign verifies against the original public key; tampered hash fails). |
| 145 | + OTP-resident keys and HW-ECDSA-engine signing are follow-ons. |
| 146 | +
|
| 147 | +## Benchmarks (software crypto baseline) |
| 148 | +
|
| 149 | +`wolfcrypt_test` (full self-test, all PASS) and `wolfcrypt_benchmark` were run on |
| 150 | +the RTL8735B EVB to validate the core library and toolchain on this target. The |
| 151 | +figures below are **pure software wolfCrypt** -- they are NOT the HUK device |
| 152 | +(which routes AES through the silicon engine for HUK-derived keys); they serve as |
| 153 | +a reference baseline and to size the benefit of hardware offload. |
| 154 | +
|
| 155 | +- Target: RTL8735B "KM4" Arm Cortex-M33 (ARMv8-M Mainline, TrustZone + DSP) at |
| 156 | + 500 MHz (`CPU_CLK`); DDR at 533 MHz. |
| 157 | +- Toolchain / build: RealTek ASDK 10.3.0 (GCC 10.3.0), SDK default `-Os`, |
| 158 | + FreeRTOS, `WOLFCRYPT_ONLY`, `SINGLE_THREADED`, big-integer math via the generic |
| 159 | + `WOLFSSL_SP_MATH_ALL` (portable C, no Cortex-M assembly), `BENCH_EMBEDDED`. |
| 160 | +- Build options live with the SDK example (not in the wolfSSL tree): |
| 161 | + `component/example/wolfcrypt_test/{user_settings.h, wolfcrypt_test.cmake, |
| 162 | + main.c}` of the AmebaPro2 FreeRTOS SDK. The RNG is seeded from the SDK |
| 163 | + `rtw_get_random_bytes`; `current_time()` uses `hal_read_systime_us()`. |
| 164 | +
|
| 165 | +Symmetric / hash (higher is better): |
| 166 | +
|
| 167 | +| Algorithm | Throughput | |
| 168 | +|---------------------|------------| |
| 169 | +| AES-128-CBC enc/dec | 9.55 / 9.67 MiB/s | |
| 170 | +| AES-256-CBC enc/dec | 7.25 / 7.02 MiB/s | |
| 171 | +| AES-128-GCM enc/dec | 5.35 / 5.33 MiB/s | |
| 172 | +| AES-256-GCM enc/dec | 4.53 / 4.52 MiB/s | |
| 173 | +| AES-128-CTR | 9.75 MiB/s | |
| 174 | +| AES-128-ECB enc/dec | 10.42 / 10.56 MiB/s | |
| 175 | +| AES-CCM enc/dec | 4.73 / 4.65 MiB/s | |
| 176 | +| GMAC (4-bit table) | 13.43 MiB/s | |
| 177 | +| AES-128-CMAC | 8.84 MiB/s | |
| 178 | +| ChaCha20 | 24.79 MiB/s | |
| 179 | +| ChaCha20-Poly1305 | 15.83 MiB/s | |
| 180 | +| Poly1305 | 64.77 MiB/s | |
| 181 | +| SHA-1 | 29.19 MiB/s | |
| 182 | +| SHA-256 | 10.94 MiB/s | |
| 183 | +| SHA-512 | 7.29 MiB/s | |
| 184 | +| SHA3-256 | 6.61 MiB/s | |
| 185 | +| HMAC-SHA256 | 10.85 MiB/s | |
| 186 | +
|
| 187 | +Public key (higher is better): |
| 188 | +
|
| 189 | +| Operation | Rate | |
| 190 | +|-----------------------|------| |
| 191 | +| RSA-2048 public | 214.7 ops/s | |
| 192 | +| RSA-2048 private | 6.14 ops/s | |
| 193 | +| RSA-2048 key gen | 0.40 ops/s | |
| 194 | +| DH-2048 key gen/agree | 17.67 / 15.23 ops/s | |
| 195 | +| ECDSA P-256 sign/verify | 40.03 / 29.81 ops/s | |
| 196 | +| ECDHE P-256 agree | 40.69 ops/s | |
| 197 | +| Curve25519 key gen/agree | 414.8 / 419.4 ops/s | |
| 198 | +| Ed25519 sign/verify | 788.3 / 397.0 ops/s | |
| 199 | +
|
| 200 | +The tables above are the portable-C baseline. The assembly backends below raise |
| 201 | +these substantially. Curve25519/Ed25519 already use the dedicated |
| 202 | +`curve25519.c`/`ed25519.c` fast code. |
| 203 | +
|
| 204 | +## Optimizations (measured on RTL8735B @ 500 MHz, -Os) |
| 205 | +
|
| 206 | +Two wolfCrypt assembly backends apply to this Cortex-M33 and were validated on |
| 207 | +hardware (both keep `wolfcrypt_test` all-PASS). Neither needs wolfSSL source |
| 208 | +changes -- they are build-config selections plus adding the relevant asm files. |
| 209 | +
|
| 210 | +### 1. Public key -- `sp_cortexm.c` (Thumb-2/DSP single-precision) |
| 211 | +
|
| 212 | +Enable with `WOLFSSL_SP_ARM_CORTEX_M_ASM` + `WOLFSSL_HAVE_SP_RSA` + |
| 213 | +`WOLFSSL_HAVE_SP_ECC` + `WOLFSSL_HAVE_SP_DH`, and add `wolfcrypt/src/sp_cortexm.c` |
| 214 | +to the build (alongside the generic `sp_int.c` for sizes without an asm path). |
| 215 | +
|
| 216 | +| Operation | Generic C | sp_cortexm | Speedup | |
| 217 | +|------------------------|-----------|------------|---------| |
| 218 | +| ECC P-256 key gen | 40.7 | 541.2 ops/s | 13.3x | |
| 219 | +| ECDSA P-256 sign | 40.0 | 427.6 ops/s | 10.7x | |
| 220 | +| ECDSA P-256 verify | 29.8 | 292.7 ops/s | 9.8x | |
| 221 | +| ECDHE P-256 agree | 40.7 | 318.1 ops/s | 7.8x | |
| 222 | +| RSA-2048 public | 214.7 | 618.4 ops/s | 2.9x | |
| 223 | +| RSA-2048 private | 6.14 | 19.0 ops/s | 3.1x | |
| 224 | +| DH-2048 agree | 15.2 | 38.3 ops/s | 2.5x | |
| 225 | +
|
| 226 | +### 2. Symmetric -- Thumb-2 asm (`port/arm/thumb2-*-asm.S`) |
| 227 | +
|
| 228 | +Enable with `WOLFSSL_ARMASM` + `WOLFSSL_ARMASM_THUMB2` + |
| 229 | +`WOLFSSL_ARMASM_NO_HW_CRYPTO` + `WOLFSSL_ARMASM_NO_NEON` + `WOLFSSL_ARM_ARCH=7`, |
| 230 | +and add `thumb2-aes-asm.S`, `thumb2-sha256-asm.S`, `thumb2-sha512-asm.S`, |
| 231 | +`thumb2-sha3-asm.S`, `thumb2-chacha-asm.S`, `thumb2-poly1305-asm.S`. |
| 232 | +`WOLFSSL_ARMASM` is a global switch, so provide the `.S` for every covered |
| 233 | +module. (Curve25519/Ed25519 also have Thumb-2 asm but their `ge_operations.c` |
| 234 | +integration assumes 64-bit and was left on the C path here.) |
| 235 | +
|
| 236 | +| Algorithm | Generic C | Thumb-2 asm | Speedup | |
| 237 | +|---------------------|-----------|-------------|---------| |
| 238 | +| AES-128-CBC enc | 9.55 | 20.85 MiB/s | 2.2x | |
| 239 | +| AES-128-ECB enc | 10.42 | 20.82 MiB/s | 2.0x | |
| 240 | +| AES-128-CTR | 9.75 | 20.47 MiB/s | 2.1x | |
| 241 | +| AES-128-GCM enc | 5.35 | 10.30 MiB/s | 1.9x | |
| 242 | +| GMAC | 13.43 | 20.81 MiB/s | 1.5x | |
| 243 | +| AES-128-CMAC | 8.84 | 14.67 MiB/s | 1.7x | |
| 244 | +| ChaCha20 | 24.79 | 46.44 MiB/s | 1.9x | |
| 245 | +| ChaCha20-Poly1305 | 15.83 | 25.38 MiB/s | 1.6x | |
| 246 | +| SHA-256 | 10.94 | 17.83 MiB/s | 1.6x | |
| 247 | +| SHA3-256 | 6.61 | 8.64 MiB/s | 1.3x | |
| 248 | +| HMAC-SHA256 | 10.85 | 17.66 MiB/s | 1.6x | |
| 249 | +
|
| 250 | +### Note on hardware offload |
| 251 | +
|
| 252 | +For AES, hashing and ECDSA the RTL8735B has a dedicated crypto engine (the HAL |
| 253 | +`hal_crypto_*` / `hal_ecdsa` blocks this HUK port already uses for HUK-derived |
| 254 | +keys). A general (any-key) HW crypto-callback port over that engine would beat |
| 255 | +the Thumb-2 software figures above and is the recommended production path for |
| 256 | +symmetric throughput; the Thumb-2 asm is the portable software fallback. The |
| 257 | +`sp_cortexm.c` PK speedup is worth taking regardless, since it needs no silicon |
| 258 | +support. |
0 commit comments