fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW by dscho · Pull Request #238 · Terraspace/UASM

dscho · 2026-05-04T17:30:56Z

Three separate fixes for AVX-512BW instruction encoding:

vpbroadcastw zmm,r32: codegen.c unconditionally rejects GPR sources for VPBROADCAST at line 3399. CodeGenV2 handles VPBROADCASTD but not VPBROADCASTW, which falls through to legacy codegen that has a blanket rejection. Fix: guard with !evex. Also fixed the operand class in instruct.h from R16 to R32 per Intel SDM (EVEX.512.66.0F38.W0 7B /r takes r32/r64).
vpmovzxbw zmm,ymm/m256: two issues. (a) OpCls(ZMM, YMM_M256, NONE) was commented out in opndcls.h since the initial 2.16 commit (c71541f, 2016-11-06), but codegen at line 1268 already handles it. (b) codegen.c line 1268 gates ZMM on decoflags (opmask decoration required), but the instruction is valid without decoration. This was a regression from commit 5d9acd2 (2019-01-09, starting 2.48 codegen v2) that re-added the decoflags guard after it had been correctly removed in 0a19c01 (2018-12-10). Fix: change decoflags to evex.

Includes a new regression test (avx512bw.asm) that verifies both instructions assemble correctly.

These fixes are needed to assemble onnxruntime's MLAS AVX-512 Core kernel files (ConvSymKernelAvx512Core.asm, QgemmU8X8KernelAvx512Core.asm, QgemvU8S8KernelAvx512Core.asm).

Several AVX-512BW instructions that use ZMM registers were rejected by the assembler even with `option evex:1` enabled. This surfaced while building Microsoft's ONNX Runtime MLAS (Machine Learning Algebra Subprograms) kernels, specifically QgemmU8X8KernelAvx512Core.asm, ConvSymKernelAvx512Core.asm, and QgemvU8S8KernelAvx512Core.asm, which use `vpbroadcastw zmm, r32` and `vpmovzxbw zmm, ymmword ptr [mem]` extensively in their quantized integer GEMM inner loops. Three separate issues were involved, each in a different layer of the assembler pipeline: The first issue is in the parser's operand class table. The operand class entries `OpCls(ZMM, YMM_M256, NONE)` and `OpCls(ZMM, XMM_M128, NONE)` were scaffolded but commented out in H/opndcls.h since the initial 2.16 commit (c71541f, 2016-11-06). The comment attributes them to habran and labels them "AVX2". Without these entries, the parser cannot match an instruction with a ZMM destination and a YMM-sized or XMM-sized source operand, so `vpmovzxbw zmm, ymmword ptr [mem]` is rejected before the codegen ever sees it. Enabling these entries is now safe because the codegen already handles the ZMM forms. The relevant validation at codegen.c line 1268 checks for `r1type == OP_ZMM` with `r2type == OP_YMM || mem_type == MT_YMMWORD` and clears the W bit in the EVEX prefix accordingly. This codegen support has been present since the initial 2.16 commit. Alongside the opndcls.h change, a corresponding instruction table entry was added to H/instruct.h for the ZMM form of PMOVZXBW (opcode 0x30, prefix 66.0F38, tuple type BSIZE|HVM), matching the Intel SDM definition of EVEX.512.66.0F38.WIG 30 /r. The opcode, prefix byte, and flags are identical to the existing XMM and YMM forms; only the EVEX vector length differs, which the encoder sets automatically from the destination register width. The second issue is in codegen.c, where a validation added in UASM 2.56 unconditionally rejects GPR (general-purpose register) sources for all VPBROADCAST variants. The check's own comment reads "legacy CODEGEN only handles AVX2 (not 512)", and indeed the legacy encoder cannot produce the EVEX-encoded GPR broadcast form. However, when `option evex:1` is active, the EVEX encoder further down the same legacy code path handles the GPR form correctly. The check was therefore narrowed to only fire when the global `evex` flag is off, i.e. when the legacy AVX2 path is genuinely the only option. In addition, the VPBROADCASTW instruction table entry in H/instruct.h specified R16 as the source operand class for the GPR form. The Intel SDM defines this form (EVEX.512.66.0F38.W0 7B /r) as taking r32/r64, not r16. The low 16 bits of the 32-bit register are used, but the encoding references a 32-bit register. The operand class was corrected from R16 to R32 to match the specification and MASM behavior. The third issue is also in codegen.c, where the ZMM path for VPMOVZXBW (and the related VPMOVSX/VPMOVZX family) was gated behind `decoflags` being nonzero. The `decoflags` variable tracks EVEX decorations such as opmask registers (`{k1}`). Requiring a decoration for ZMM PMOVZX is incorrect: the instruction is valid without an opmask. This was confirmed empirically: `vpmovzxbw zmm0{k1}, [rax]` assembled successfully while `vpmovzxbw zmm0, [rax]` was rejected. The `decoflags` guard appears to be an accidental regression. Commit 0a19c01 ("codegen fixes", 2018-12-10) had removed it, making the ZMM path work unconditionally. But the subsequent commit 5d9acd2 ("starting 2.48 codegen v2", 2019-01-09) re-added it as part of a larger refactor, restoring the broken behavior. The fix replaces `decoflags` with `evex`, which is the correct guard: it checks whether the user has opted into EVEX encoding, not whether this particular instruction carries a decoration. Assisted-by: Claude Opus 4.6 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>

dscho mentioned this pull request May 4, 2026

mingw-w64-onnxruntime: new package (v1.24.4) msys2/MINGW-packages#29328

Merged

dscho mentioned this pull request May 14, 2026

uasm: update to 2.58, add AVX-512BW instruction support msys2/MINGW-packages#29484

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238

fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238
dscho wants to merge 1 commit into
Terraspace:v2.58from
dscho:fix-avx512bw-vpbroadcastw-vpmovzxbw

dscho commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dscho commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant