Skip to content

fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238

Open
dscho wants to merge 1 commit into
Terraspace:v2.58from
dscho:fix-avx512bw-vpbroadcastw-vpmovzxbw
Open

fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238
dscho wants to merge 1 commit into
Terraspace:v2.58from
dscho:fix-avx512bw-vpbroadcastw-vpmovzxbw

Conversation

@dscho

@dscho dscho commented May 4, 2026

Copy link
Copy Markdown

Three separate fixes for AVX-512BW instruction encoding:

  1. vpbroadcastw zmm,r32: codegen.c unconditionally rejects GPR sources for VPBROADCAST at line 3399. CodeGenV2 handles VPBROADCASTD but not VPBROADCASTW, which falls through to legacy codegen that has a blanket rejection. Fix: guard with !evex. Also fixed the operand class in instruct.h from R16 to R32 per Intel SDM (EVEX.512.66.0F38.W0 7B /r takes r32/r64).

  2. vpmovzxbw zmm,ymm/m256: two issues. (a) OpCls(ZMM, YMM_M256, NONE) was commented out in opndcls.h since the initial 2.16 commit (c71541f, 2016-11-06), but codegen at line 1268 already handles it. (b) codegen.c line 1268 gates ZMM on decoflags (opmask decoration required), but the instruction is valid without decoration. This was a regression from commit 5d9acd2 (2019-01-09, starting 2.48 codegen v2) that re-added the decoflags guard after it had been correctly removed in 0a19c01 (2018-12-10). Fix: change decoflags to evex.

Includes a new regression test (avx512bw.asm) that verifies both instructions assemble correctly.

These fixes are needed to assemble onnxruntime's MLAS AVX-512 Core kernel files (ConvSymKernelAvx512Core.asm, QgemmU8X8KernelAvx512Core.asm, QgemvU8S8KernelAvx512Core.asm).

Several AVX-512BW instructions that use ZMM registers were rejected by
the assembler even with `option evex:1` enabled. This surfaced while
building Microsoft's ONNX Runtime MLAS (Machine Learning Algebra
Subprograms) kernels, specifically QgemmU8X8KernelAvx512Core.asm,
ConvSymKernelAvx512Core.asm, and QgemvU8S8KernelAvx512Core.asm,
which use `vpbroadcastw zmm, r32` and `vpmovzxbw zmm, ymmword ptr
[mem]` extensively in their quantized integer GEMM inner loops.

Three separate issues were involved, each in a different layer of the
assembler pipeline:

The first issue is in the parser's operand class table. The operand
class entries `OpCls(ZMM, YMM_M256, NONE)` and `OpCls(ZMM, XMM_M128,
NONE)` were scaffolded but commented out in H/opndcls.h since the
initial 2.16 commit (c71541f, 2016-11-06). The comment attributes them
to habran and labels them "AVX2". Without these entries, the parser
cannot match an instruction with a ZMM destination and a YMM-sized
or XMM-sized source operand, so `vpmovzxbw zmm, ymmword ptr [mem]`
is rejected before the codegen ever sees it.

Enabling these entries is now safe because the codegen already handles
the ZMM forms. The relevant validation at codegen.c line 1268 checks
for `r1type == OP_ZMM` with `r2type == OP_YMM || mem_type ==
MT_YMMWORD` and clears the W bit in the EVEX prefix accordingly. This
codegen support has been present since the initial 2.16 commit.

Alongside the opndcls.h change, a corresponding instruction table
entry was added to H/instruct.h for the ZMM form of PMOVZXBW
(opcode 0x30, prefix 66.0F38, tuple type BSIZE|HVM), matching the
Intel SDM definition of EVEX.512.66.0F38.WIG 30 /r. The opcode,
prefix byte, and flags are identical to the existing XMM and YMM
forms; only the EVEX vector length differs, which the encoder sets
automatically from the destination register width.

The second issue is in codegen.c, where a validation added in UASM
2.56 unconditionally rejects GPR (general-purpose register) sources
for all VPBROADCAST variants. The check's own comment reads "legacy
CODEGEN only handles AVX2 (not 512)", and indeed the legacy encoder
cannot produce the EVEX-encoded GPR broadcast form. However, when
`option evex:1` is active, the EVEX encoder further down the same
legacy code path handles the GPR form correctly. The check was
therefore narrowed to only fire when the global `evex` flag is off,
i.e. when the legacy AVX2 path is genuinely the only option.

In addition, the VPBROADCASTW instruction table entry in
H/instruct.h specified R16 as the source operand class for the GPR
form. The Intel SDM defines this form (EVEX.512.66.0F38.W0 7B /r) as
taking r32/r64, not r16. The low 16 bits of the 32-bit register are
used, but the encoding references a 32-bit register. The operand
class was corrected from R16 to R32 to match the specification and
MASM behavior.

The third issue is also in codegen.c, where the ZMM path for
VPMOVZXBW (and the related VPMOVSX/VPMOVZX family) was gated behind
`decoflags` being nonzero. The `decoflags` variable tracks EVEX
decorations such as opmask registers (`{k1}`). Requiring a decoration
for ZMM PMOVZX is incorrect: the instruction is valid without an
opmask. This was confirmed empirically: `vpmovzxbw zmm0{k1}, [rax]`
assembled successfully while `vpmovzxbw zmm0, [rax]` was rejected.

The `decoflags` guard appears to be an accidental regression. Commit
0a19c01 ("codegen fixes", 2018-12-10) had removed it, making the ZMM
path work unconditionally. But the subsequent commit 5d9acd2
("starting 2.48 codegen v2", 2019-01-09) re-added it as part of a
larger refactor, restoring the broken behavior. The fix replaces
`decoflags` with `evex`, which is the correct guard: it checks
whether the user has opted into EVEX encoding, not whether this
particular instruction carries a decoration.

Assisted-by: Claude Opus 4.6
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant