fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238
Open
dscho wants to merge 1 commit into
Open
fix AVX-512BW instruction support for VPBROADCASTW and VPMOVZXBW#238dscho wants to merge 1 commit into
dscho wants to merge 1 commit into
Conversation
Several AVX-512BW instructions that use ZMM registers were rejected by the assembler even with `option evex:1` enabled. This surfaced while building Microsoft's ONNX Runtime MLAS (Machine Learning Algebra Subprograms) kernels, specifically QgemmU8X8KernelAvx512Core.asm, ConvSymKernelAvx512Core.asm, and QgemvU8S8KernelAvx512Core.asm, which use `vpbroadcastw zmm, r32` and `vpmovzxbw zmm, ymmword ptr [mem]` extensively in their quantized integer GEMM inner loops. Three separate issues were involved, each in a different layer of the assembler pipeline: The first issue is in the parser's operand class table. The operand class entries `OpCls(ZMM, YMM_M256, NONE)` and `OpCls(ZMM, XMM_M128, NONE)` were scaffolded but commented out in H/opndcls.h since the initial 2.16 commit (c71541f, 2016-11-06). The comment attributes them to habran and labels them "AVX2". Without these entries, the parser cannot match an instruction with a ZMM destination and a YMM-sized or XMM-sized source operand, so `vpmovzxbw zmm, ymmword ptr [mem]` is rejected before the codegen ever sees it. Enabling these entries is now safe because the codegen already handles the ZMM forms. The relevant validation at codegen.c line 1268 checks for `r1type == OP_ZMM` with `r2type == OP_YMM || mem_type == MT_YMMWORD` and clears the W bit in the EVEX prefix accordingly. This codegen support has been present since the initial 2.16 commit. Alongside the opndcls.h change, a corresponding instruction table entry was added to H/instruct.h for the ZMM form of PMOVZXBW (opcode 0x30, prefix 66.0F38, tuple type BSIZE|HVM), matching the Intel SDM definition of EVEX.512.66.0F38.WIG 30 /r. The opcode, prefix byte, and flags are identical to the existing XMM and YMM forms; only the EVEX vector length differs, which the encoder sets automatically from the destination register width. The second issue is in codegen.c, where a validation added in UASM 2.56 unconditionally rejects GPR (general-purpose register) sources for all VPBROADCAST variants. The check's own comment reads "legacy CODEGEN only handles AVX2 (not 512)", and indeed the legacy encoder cannot produce the EVEX-encoded GPR broadcast form. However, when `option evex:1` is active, the EVEX encoder further down the same legacy code path handles the GPR form correctly. The check was therefore narrowed to only fire when the global `evex` flag is off, i.e. when the legacy AVX2 path is genuinely the only option. In addition, the VPBROADCASTW instruction table entry in H/instruct.h specified R16 as the source operand class for the GPR form. The Intel SDM defines this form (EVEX.512.66.0F38.W0 7B /r) as taking r32/r64, not r16. The low 16 bits of the 32-bit register are used, but the encoding references a 32-bit register. The operand class was corrected from R16 to R32 to match the specification and MASM behavior. The third issue is also in codegen.c, where the ZMM path for VPMOVZXBW (and the related VPMOVSX/VPMOVZX family) was gated behind `decoflags` being nonzero. The `decoflags` variable tracks EVEX decorations such as opmask registers (`{k1}`). Requiring a decoration for ZMM PMOVZX is incorrect: the instruction is valid without an opmask. This was confirmed empirically: `vpmovzxbw zmm0{k1}, [rax]` assembled successfully while `vpmovzxbw zmm0, [rax]` was rejected. The `decoflags` guard appears to be an accidental regression. Commit 0a19c01 ("codegen fixes", 2018-12-10) had removed it, making the ZMM path work unconditionally. But the subsequent commit 5d9acd2 ("starting 2.48 codegen v2", 2019-01-09) re-added it as part of a larger refactor, restoring the broken behavior. The fix replaces `decoflags` with `evex`, which is the correct guard: it checks whether the user has opted into EVEX encoding, not whether this particular instruction carries a decoration. Assisted-by: Claude Opus 4.6 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three separate fixes for AVX-512BW instruction encoding:
vpbroadcastw zmm,r32: codegen.c unconditionally rejects GPR sources for VPBROADCAST at line 3399. CodeGenV2 handles VPBROADCASTD but not VPBROADCASTW, which falls through to legacy codegen that has a blanket rejection. Fix: guard with !evex. Also fixed the operand class in instruct.h from R16 to R32 per Intel SDM (EVEX.512.66.0F38.W0 7B /r takes r32/r64).
vpmovzxbw zmm,ymm/m256: two issues. (a) OpCls(ZMM, YMM_M256, NONE) was commented out in opndcls.h since the initial 2.16 commit (c71541f, 2016-11-06), but codegen at line 1268 already handles it. (b) codegen.c line 1268 gates ZMM on decoflags (opmask decoration required), but the instruction is valid without decoration. This was a regression from commit 5d9acd2 (2019-01-09, starting 2.48 codegen v2) that re-added the decoflags guard after it had been correctly removed in 0a19c01 (2018-12-10). Fix: change decoflags to evex.
Includes a new regression test (avx512bw.asm) that verifies both instructions assemble correctly.
These fixes are needed to assemble onnxruntime's MLAS AVX-512 Core kernel files (ConvSymKernelAvx512Core.asm, QgemmU8X8KernelAvx512Core.asm, QgemvU8S8KernelAvx512Core.asm).