x86lint examines x86-64 machine code to find suboptimal encodings and sequences.
For example, add eax, 1 can encode with either an 8- or 32-bit immediate:
83C0 01
81C0 01000000
Using the former can result in smaller and faster code. x86lint can help compiler writers generate better code and documents the complexity of x86.
- missing LOCK prefix on CMPXCHG and XADD
- oversized ADD/SUB 128
05 80000000instead of83E8 80(ADD EAX, 128 -> SUB EAX, -128)2D 80000000instead of83C0 80(SUB EAX, 128 -> ADD EAX, -128)
- oversized ADD/SUB one
83C0 01instead ofFFC0(ADD EAX, 1 -> INC EAX, when CF is unused)836B10 01instead ofFF4B10(SUB DWORD [RBX+0x10], 1 -> DEC DWORD [RBX+0x10])
- oversized branch displacement
E9 00000000instead ofEB 03(JMP rel32 that fits in rel8)
- oversized displacement
8B 83 10000000instead of8B 43 10(MOV EAX, [RBX+0x10]; disp32 that fits in disp8)
- oversized EVEX encoding
62F1FD286FCAinstead ofC5FD6FCA(VMOVDQA64 YMM1, YMM2 -> VMOVDQA; without an opmask, broadcast, rounding, 512-bit length, or xmm16-31, the 4-byte EVEX prefix wastes 1-2 bytes over VEX)
- oversized immediates
81C0 01000000instead of83C0 01(ADD EAX, 1)68 01000000instead of6A 01(PUSH 1)
- oversized MOV encoding
C7C0 01000000instead ofB8 01000000(MOV EAX, 1)48 C7C0 01000000instead ofB8 01000000(MOV RAX, 1; the 32-bit form zero-extends)
- oversized TEST immediate
A9 01000000instead ofA8 01(TEST EAX, 1 -> TEST AL, 1; TEST has no sign-extended imm8 form, and a mask within the low seven bits sets identical flags at byte width)
- oversized XCHG encoding
87C8instead of91(XCHG EAX, ECX; the 90+r accumulator form is one byte)
- redundant ADD/SUB zero
83C0 00(ADD EAX, 0) -- use TEST or remove
- redundant MOV reg, reg
4889C0(MOV RAX, RAX)
- redundant shift/rotate by zero
C1E0 00(SHL EAX, 0) -- no-op, flags also unchanged
- suboptimal AND immediate
83E0 FF(AND EAX, 0xFF) -- use MOVZBL
- suboptimal CMP zero
83F8 00instead of85C0(CMP EAX, 0 -> TEST EAX, EAX)
- suboptimal IMUL constant
6BC0 03(IMUL EAX, EAX, 3) -- use LEA or SHL
- suboptimal LEA
488D03(LEA RAX, [RBX]) -- use MOV RAX, RBX (more ports, no AGU)
suboptimal NOP sequence, see #9- multiple
90instead of a single66 90, etc.
- multiple
- suboptimal OR/AND reg, reg
09C0(OR EAX, EAX) -- use TEST EAX, EAX (same flags, no register write)
- suboptimal SSE MOV opcode
660F6FCAinstead of0F28CA(MOVDQA XMM1, XMM2 -> MOVAPS; the legacy 66/F3-prefixed copies movapd/movdqa/movupd/movdqu waste a byte over movaps/movups)
- suboptimal SUB reg, reg
29C0(SUB EAX, EAX) -- use XOR for dependency-breaking
suboptimal zero register, see #7- MOV EAX, 0 instead of XOR EAX, EAX
- unneeded explicit immediate
C1D0 01instead ofD1D0(RCL EAX, 1)
- unneeded explicit register
81C0 00010000instead of05 00010000(ADD EAX, 0x100)
- unneeded LOCK prefix on XCHG
- unneeded MOVSX
0FBFC0instead of98(MOVSX EAX, AX -> CWDE)66 0FBEC0instead of66 98(MOVSX AX, AL -> CBW)
- unneeded MOVSXD
48 63 C0instead of48 98(MOVSXD RAX, EAX -> CDQE)
- unneeded REX prefix
- XOR RAX, RAX
4831C0instead of XOR EAX, EAX31C0 40C9instead ofC9(LEAVE)48 0FB6C3instead of0FB6C3(MOVZX RAX, BL -> MOVZX EAX, BL; the r32 form zero-extends to 64)
- XOR RAX, RAX
- unneeded SIB byte
C64465 04 05instead ofC645 04 05(MOV byte [RBP+4], 5)
- unneeded zero displacement
017E 00instead of013E(ADD [RSI], EDI)
First install the Intel x86 encoder decoder:
git clone https://github.com/intelxed/xed.git xed
git clone https://github.com/intelxed/mbuild.git mbuild
cd xed
./mfile.py install --install-dir=kits/xed-install
Next build x86lint:
git clone https://github.com/gaul/x86lint.git x86lint
cd x86lint
XED_PATH=/path/to/xed make all
x86lint is intended to be part of compiler test suites, which should
#include "x86lint.h" and link libx86lint.a. Pass the just-emitted machine
code to check_instructions; its return value is the number of opportunities
found, which a test can assert is zero:
#include "x86lint.h"
// inst/len: the x86-64 bytes to check (e.g. a function the compiler just
// emitted). Returns the opportunity count (0 == clean). An undecodable byte
// is skipped and the scan resyncs rather than failing, since executable input
// can interleave data with code.
int lint(const uint8_t *inst, size_t len)
{
xed_tables_init();
return check_instructions(inst, len, /*verbose=*/false, /*summary=*/NULL);
}The optional summary accumulates a by-type tally across one or more runs
(x86lint_summary_create / _print / _destroy; pass NULL to skip it),
and verbose controls whether each opportunity is printed as it is found.
x86lint_summary_skipped reports how many undecodable bytes were skipped, so
incomplete coverage of a data-laden section is not mistaken for a clean scan.
x86lint can also read arbitrary 64-bit ELF executables directly. By default it prints only a summary -- the opportunities grouped by type and sorted by prevalence -- followed by a total and the number of instructions scanned:
$ ./x86lint /bin/ls
Optimization opportunities by type:
169 oversized ADD/SUB one
105 oversized immediate
8 oversized branch displacement
282 optimization opportunities in 22705 instructionsPass -v to also print each opportunity -- its one-line disassembly plus the
offending encoding -- ahead of the summary:
$ ./x86lint -v /bin/ls
oversized immediate at offset: 0x14: push 0x0
68 00 00 00 00
...The process exits non-zero when any opportunity is found, so x86lint can gate a compiler test suite.
- Agner Fog optimization guide
- Intel instruction set reference
- Intel optimization manual
- Intel x86 encoder decoder - library to parse instructions
- armlint - AArch64 equivalent of armlint
Copyright (C) 2018 Andrew Gaul
Licensed under the Apache License, Version 2.0