Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[asm] Enable x86_64 asm for windows builds #4246

Merged
merged 2 commits into from
Jan 19, 2025
Merged

Conversation

pps83
Copy link
Contributor

@pps83 pps83 commented Jan 18, 2025

No description provided.

@Cyan4973
Copy link
Contributor

minor question:
any performance number, that would illustrate the benefit of this patch ?

This is notably useful for future reference, when people will look at this PR.

@pps83
Copy link
Contributor Author

pps83 commented Jan 19, 2025

any performance number, that would illustrate the benefit of this patch ?

I get consistently 2-3% better times with ms compiler when I enable asm. I also build asm with yasm (but building with ming64 also works).

Even though unrelated to the change, curiously, I get roughly 10-15% better decoding times with any other compiler compared to ms compiler (with or without the asm). I tried clang-cl, I tried mingw64 (ver 14.2.0), I tried intel icx (2024, 2025) compiler. In my experience, ms compiler is usually pretty good with default optimization settings. I tried all kinds of stuff and it's always 10-15% slower at decoding. No idea why, perhaps hits some bad case somehwere.

@Cyan4973
Copy link
Contributor

It is also our experience that Visual Studio compiler gets lower performance compared to gcc or clang.
We are unsure why, it's likely an accumulation of several minor details,
though fixing them would require some dedicated attention,
and unfortunately, MS Visual is not one of our targets so far.
We just ensure correctness for it, but we don't have time allocation to optimize specifically for MS Visual.

@Cyan4973
Copy link
Contributor

I can confirm a fairly decent decompression speed gain on Windows,
measured at level 1 (which is the most favorable scenario) on a Lunar Lake cpu
and compiling zstd under mingw64, using bothgcc and clang.
The gains when using gcc are larger, partially because it starts from a lower speed.
I would expect the gains to also be present with MS Visual, and likely larger.

compiler dataset dev PR improvement
gcc 14.2.0 silesia.tar 1826 MB/s 1962 MB/s +7.4%
gcc 14.2.0 enwik7 1656 MB/s 1800 MB/s +8.7 %
clang 19.1.6 silesia.tar 1956 MB/s 1994 MB/s +1.9%
clang 19.1.6 enwik7 1807 MB/s 1867 MB/s +3.3 %

@Cyan4973 Cyan4973 merged commit 167b004 into facebook:dev Jan 19, 2025
94 checks passed
@pps83
Copy link
Contributor Author

pps83 commented Jan 20, 2025

It is also our experience that Visual Studio compiler gets lower performance compared to gcc or clang. We are unsure why, it's likely an accumulation of several minor details, though fixing them would require some dedicated attention, and unfortunately,

In my project I use multiple perf critical libs, and none of them run slower with ms compiler than with gcc/clang. On the contrary, I've seen it perform better in some cases. 15-20% slower with zstd decoding stands out.

@pps83 pps83 deleted the dev-asmx64-win branch January 20, 2025 00:48
@Cyan4973
Copy link
Contributor

I suspect it depends for what target a library is primarily developed for.
If it's primarily meant to be used with MS Visual, then I would expect the developers to use MS Visual regularly, benchmark and optimize this compiler regularly, and essentially iron out little details, better exploit compiler specifics over time, etc. As a consequence, the code would run faster on MS Visual, rather than any other less tested compiler.
And of course, this effect can also happen in any other direction.

@pps83
Copy link
Contributor Author

pps83 commented Jan 20, 2025

Can you recommend what functions/blocks of code (or logical blocks) I could instrument to see where the delay comes from?
These are the runtimes I get with CL, CLANG-CL, and ICX (intel 2025 compiler).
all of these are not using asm, as you can see I test 1, 5, 10, 15, 22 levels (smallest decoding time per run is the last column).

image

These are the readings with intel vtune profiler (I test with hardware perf counters):

CL:
image

CLANG-CL:
image

ICX:
image

@pps83
Copy link
Contributor Author

pps83 commented Jan 21, 2025

One other interesting observation. If I run longer test (for all compression levels using a few different samples of data to compress) and then sum all encoding and decoding times, I get these results:

CL:       zstd etime:6857685.50us, dtime:59257.70us
ICX:      zstd etime:6892638.80us, dtime:50652.00us
CLANG-CL: zstd etime:7386940.70us, dtime:51042.60us
MINGW64:  zstd etime:7180041.60us, dtime:51300.20us

what a surprise. All compilers (except ms compiler) are within 1% when decoding, while ms compiler is 16.2% slower. When encoding, ms compiler produces best results. Almost 10% better than clang.
That's what I usually saw with other libs: cl, icx, gcc are usually best, while clang is slower. Feels something fishy is going on with zstd decoding when built with ms compiler.

@pps83
Copy link
Contributor Author

pps83 commented Jan 22, 2025

I tried to take a look. It looks like some sort of optimizer issue that ends up using stack more often with ms compiler vs clang specifically inside ZSTD_decodeSequence and ZSTD_execSequence functions. ZSTD_decodeSequence has higher impact, and looking inside generated asm I don't see much except that clang due to register pressure ends up using xmm register once, while ms compiler seem to be using regular registers only and uses stack more often. I think some minor code refactoring to avoid register pressure should fix the bug.

This is asm from clang:

ZSTD_decodeSequence:                    # @ZSTD_decodeSequence
.seh_proc ZSTD_decodeSequence
# %bb.0:
	push	r15
	.seh_pushreg r15
	push	r14
	.seh_pushreg r14
	push	r13
	.seh_pushreg r13
	push	r12
	.seh_pushreg r12
	push	rsi
	.seh_pushreg rsi
	push	rdi
	.seh_pushreg rdi
	push	rbp
	.seh_pushreg rbp
	push	rbx
	.seh_pushreg rbx
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	mov	rax, rcx
	mov	r10, qword ptr [rdx + 40]
	mov	r11, qword ptr [rdx + 48]
	mov	rsi, qword ptr [rdx + 80]
	mov	r13, qword ptr [rdx + 72]
	mov	r8, qword ptr [rdx + 64]
	mov	rbp, qword ptr [rdx + 56]
	mov	r15d, dword ptr [rsi + 8*r13 + 4]
	mov	qword ptr [rcx + 8], r15
	mov	ebx, dword ptr [r11 + 8*r10 + 4]
	mov	qword ptr [rcx], rbx
	mov	ecx, dword ptr [r8 + 8*rbp + 4]
	movzx	r14d, byte ptr [r11 + 8*r10 + 2]
	movzx	r12d, byte ptr [rsi + 8*r13 + 2]
	movzx	edi, word ptr [r11 + 8*r10]
	mov	qword ptr [rsp], rdi            # 8-byte Spill
	movzx	edi, byte ptr [r11 + 8*r10 + 3]
	movzx	r10d, word ptr [rsi + 8*r13]
	mov	qword ptr [rsp + 8], r10        # 8-byte Spill
	movzx	r11d, byte ptr [rsi + 8*r13 + 3]
	movzx	r13d, byte ptr [r8 + 8*rbp + 2]
	movzx	r10d, word ptr [r8 + 8*rbp]
	mov	qword ptr [rsp + 24], r10       # 8-byte Spill
	movzx	r8d, byte ptr [r8 + 8*rbp + 3]
	mov	qword ptr [rsp + 16], r8        # 8-byte Spill
	lea	ebp, [r12 + r14]
	cmp	r13d, 2
	jb	.LBB0_2
# %bb.1:
	mov	esi, dword ptr [rdx + 8]
	shlx	r10, qword ptr [rdx], rsi
	mov	r8d, r13d
	neg	r8b
	shrx	r10, r10, r8
	add	esi, r13d
	mov	dword ptr [rdx + 8], esi
	add	r10, rcx
	vmovups	xmm0, xmmword ptr [rdx + 88]
	vmovups	xmmword ptr [rdx + 96], xmm0
	jmp	.LBB0_10
.LBB0_2:
	test	r13d, r13d
	je	.LBB0_3
# %bb.4:
	cmp	ebx, 1
	adc	ecx, 0
	mov	r8d, dword ptr [rdx + 8]
	shlx	rsi, qword ptr [rdx], r8
	shr	rsi, 63
	inc	r8d
	mov	dword ptr [rdx + 8], r8d
	add	rsi, rcx
	cmp	rsi, 3
	jne	.LBB0_6
# %bb.5:
	mov	r10, qword ptr [rdx + 88]
	dec	r10
	cmp	r10, 1
	sbb	r10, 0
	jmp	.LBB0_7
.LBB0_3:
	xor	ecx, ecx
	xor	r8d, r8d
	test	ebx, ebx
	setne	cl
	sete	r8b
	mov	r10, qword ptr [rdx + 8*r8 + 88]
	mov	rcx, qword ptr [rdx + 8*rcx + 88]
	jmp	.LBB0_9
.LBB0_6:
	mov	r10, qword ptr [rdx + 8*rsi + 88]
	cmp	r10, 1
	sbb	r10, 0
	cmp	rsi, 1
	je	.LBB0_8
.LBB0_7:
	mov	rcx, qword ptr [rdx + 96]
	mov	qword ptr [rdx + 104], rcx
.LBB0_8:
	mov	rcx, qword ptr [rdx + 88]
.LBB0_9:
	mov	qword ptr [rdx + 96], rcx
.LBB0_10:
	mov	qword ptr [rdx + 88], r10
	add	bpl, r13b
	mov	qword ptr [rax + 16], r10
	test	r12d, r12d
	je	.LBB0_12
# %bb.11:
	mov	ecx, dword ptr [rdx + 8]
	shlx	r8, qword ptr [rdx], rcx
	mov	r10d, r12d
	neg	r10b
	shrx	r8, r8, r10
	add	ecx, r12d
	mov	dword ptr [rdx + 8], ecx
	add	r8, r15
	mov	qword ptr [rax + 8], r8
.LBB0_12:
	cmp	bpl, 31
	jb	.LBB0_20
# %bb.13:
	mov	ecx, dword ptr [rdx + 8]
	cmp	rcx, 65
	jb	.LBB0_15
# %bb.14:
	lea	rcx, [rip + BIT_reloadDStream.zeroFilled]
	mov	qword ptr [rdx + 16], rcx
	jmp	.LBB0_20
.LBB0_15:
	mov	r10, qword ptr [rdx + 16]
	cmp	r10, qword ptr [rdx + 32]
	jae	.LBB0_16
# %bb.17:
	mov	rsi, qword ptr [rdx + 24]
	cmp	r10, rsi
	je	.LBB0_20
# %bb.18:
	mov	r8d, ecx
	shr	r8d, 3
	mov	r15, r10
	sub	r15, r8
	mov	r12d, r10d
	sub	r12d, esi
	cmp	r15, rsi
	cmovae	r12d, r8d
	sub	r10, r12
	mov	qword ptr [rdx + 16], r10
	shl	r12d, 3
	sub	ecx, r12d
	jmp	.LBB0_19
.LBB0_16:
	mov	r8d, ecx
	shr	r8d, 3
	sub	r10, r8
	mov	qword ptr [rdx + 16], r10
	and	ecx, 7
.LBB0_19:
	mov	dword ptr [rdx + 8], ecx
	mov	rcx, qword ptr [r10]
	mov	qword ptr [rdx], rcx
.LBB0_20:
	test	r14d, r14d
	je	.LBB0_22
# %bb.21:
	mov	ecx, dword ptr [rdx + 8]
	shlx	r8, qword ptr [rdx], rcx
	mov	r10d, r14d
	neg	r10b
	shrx	r8, r8, r10
	add	ecx, r14d
	mov	dword ptr [rdx + 8], ecx
	add	r8, rbx
	mov	qword ptr [rax], r8
.LBB0_22:
	test	r9d, r9d
	jne	.LBB0_30
# %bb.23:
	mov	r9d, edi
	add	r9d, dword ptr [rdx + 8]
	mov	rcx, qword ptr [rdx]
	mov	r8d, r9d
	neg	r8b
	shrx	r8, rcx, r8
	bzhi	r8, r8, rdi
	add	r8, qword ptr [rsp]             # 8-byte Folded Reload
	mov	qword ptr [rdx + 40], r8
	add	r9d, r11d
	mov	r8d, r9d
	neg	r8b
	shrx	r8, rcx, r8
	bzhi	r8, r8, r11
	add	r8, qword ptr [rsp + 8]         # 8-byte Folded Reload
	mov	qword ptr [rdx + 72], r8
	mov	r10, qword ptr [rsp + 16]       # 8-byte Reload
	add	r9d, r10d
	mov	r8d, r9d
	neg	r8b
	shrx	rcx, rcx, r8
	bzhi	rcx, rcx, r10
	add	rcx, qword ptr [rsp + 24]       # 8-byte Folded Reload
	mov	dword ptr [rdx + 8], r9d
	mov	qword ptr [rdx + 56], rcx
	cmp	r9d, 65
	jb	.LBB0_25
# %bb.24:
	lea	rcx, [rip + BIT_reloadDStream.zeroFilled]
	mov	qword ptr [rdx + 16], rcx
	jmp	.LBB0_30
.LBB0_25:
	mov	rcx, qword ptr [rdx + 16]
	cmp	rcx, qword ptr [rdx + 32]
	jae	.LBB0_26
# %bb.27:
	mov	r8, qword ptr [rdx + 24]
	cmp	rcx, r8
	je	.LBB0_30
# %bb.28:
	mov	r10d, r9d
	shr	r10d, 3
	mov	r11, rcx
	sub	r11, r10
	mov	esi, ecx
	sub	esi, r8d
	cmp	r11, r8
	cmovae	esi, r10d
	sub	rcx, rsi
	mov	qword ptr [rdx + 16], rcx
	shl	esi, 3
	sub	r9d, esi
	jmp	.LBB0_29
.LBB0_26:
	mov	r8d, r9d
	shr	r8d, 3
	sub	rcx, r8
	mov	qword ptr [rdx + 16], rcx
	and	r9d, 7
.LBB0_29:
	mov	dword ptr [rdx + 8], r9d
	mov	rcx, qword ptr [rcx]
	mov	qword ptr [rdx], rcx
.LBB0_30:
	add	rsp, 32
	pop	rbx
	pop	rbp
	pop	rdi
	pop	rsi
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	ret

and this is asm from ms compiler:

ZSTD_decodeSequence PROC
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 111
$LN178:
	mov	QWORD PTR [rsp+24], rbx
	push	rbp
	push	rsi
	push	rdi
	push	r12
	push	r13
	push	r14
	push	r15
	sub	rsp, 16
; Line 130
	mov	r8, QWORD PTR [rdx+40]
	mov	rdi, rdx
	mov	rax, QWORD PTR [rdx+48]
	mov	rsi, rcx
; Line 137
	movzx	r12d, BYTE PTR [rax+r8*8+2]
	lea	rbx, QWORD PTR [rax+r8*8]
	mov	r8, QWORD PTR [rdx+72]
	mov	r13d, r9d
	mov	rax, QWORD PTR [rdx+80]
; Line 138
	movzx	r14d, BYTE PTR [rax+r8*8+2]
	lea	r11, QWORD PTR [rax+r8*8]
	mov	r8, QWORD PTR [rdx+56]
	mov	rax, QWORD PTR [rdx+64]
; Line 142
	movzx	edx, WORD PTR [rbx]
	lea	r10, QWORD PTR [rax+r8*8]
	mov	eax, DWORD PTR [r11+4]
; Line 160
	lea	r8, QWORD PTR [rdi+88]
	mov	r15d, DWORD PTR [r10+4]
	mov	WORD PTR llNext$1$[rsp], dx
	movzx	edx, WORD PTR [r11]
	movzx	r11d, BYTE PTR [r11+3]
	mov	QWORD PTR [rcx+8], rax
	movzx	eax, BYTE PTR [r10+2]
	mov	ecx, DWORD PTR [rbx+4]
	movzx	ebx, BYTE PTR [rbx+3]
	mov	WORD PTR mlNext$1$[rsp], dx
	movzx	edx, WORD PTR [r10]
	lea	ebp, DWORD PTR [r14+rax]
	add	bpl, r12b
	mov	WORD PTR ofNext$1$[rsp], dx
	movzx	edx, BYTE PTR [r10+3]
	mov	QWORD PTR [rsi], rcx
	mov	DWORD PTR mlnbBits$1$[rsp], r11d
	mov	DWORD PTR ofnbBits$1$[rsp], edx
	cmp	al, 1
	jbe	SHORT $LN5@ZSTD_decod
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	mov	ecx, DWORD PTR [rdi+8]
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 174
	mov	edx, eax
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	add	eax, ecx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 174
	neg	edx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	shlx	rcx, QWORD PTR [rdi], rcx
; Line 355
	mov	DWORD PTR [rdi+8], eax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 177
	mov	rax, QWORD PTR [rdi+96]
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	shrx	rdx, rcx, rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 179
	mov	rcx, QWORD PTR [r8]
	add	rdx, r15
	mov	QWORD PTR [rdi+104], rax
; Line 180
	jmp	$LN11@ZSTD_decod
$LN5@ZSTD_decod:
; Line 181
	xor	r11d, r11d
	test	ecx, ecx
	mov	edx, r11d
	sete	dl
; Line 182
	test	al, al
	jne	SHORT $LN10@ZSTD_decod
; Line 183
	mov	eax, edx
; Line 184
	test	ecx, ecx
	mov	r10d, 96				; 00000060H
	mov	rdx, QWORD PTR [rdi+rax*8+88]
	mov	eax, 88					; 00000058H
	cmovne	eax, r10d
; Line 185
	mov	rcx, QWORD PTR [rax+rdi]
; Line 186
	jmp	SHORT $LN152@ZSTD_decod
$LN10@ZSTD_decod:
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	mov	ecx, DWORD PTR [rdi+8]
	shlx	r10, QWORD PTR [rdi], rcx
	shr	r10, 63					; 0000003fH
; Line 355
	lea	eax, DWORD PTR [rcx+1]
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 187
	lea	ecx, DWORD PTR [rdx+r15]
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	mov	DWORD PTR [rdi+8], eax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 187
	add	r10, rcx
; Line 188
	mov	rcx, QWORD PTR [r8]
	cmp	r10, 3
	jne	SHORT $LN21@ZSTD_decod
	lea	rdx, QWORD PTR [rcx-1]
	test	rdx, rdx
	sete	r11b
	sub	rdx, r11
	jmp	SHORT $LN157@ZSTD_decod
$LN21@ZSTD_decod:
	mov	rdx, QWORD PTR [rdi+r10*8+88]
; Line 189
	test	rdx, rdx
	sete	r11b
	sub	rdx, r11
; Line 190
	cmp	r10, 1
	je	SHORT $LN152@ZSTD_decod
$LN157@ZSTD_decod:
	mov	rax, QWORD PTR [rdi+96]
	mov	QWORD PTR [rdi+104], rax
$LN152@ZSTD_decod:
; Line 194
	mov	r11d, DWORD PTR mlnbBits$1$[rsp]
$LN11@ZSTD_decod:
	mov	QWORD PTR [rdi+96], rcx
	mov	QWORD PTR [r8], rdx
	mov	QWORD PTR [rsi+16], rdx
; Line 197
	test	r14b, r14b
	je	SHORT $LN155@ZSTD_decod
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	mov	ecx, DWORD PTR [rdi+8]
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 198
	mov	edx, r14d
	neg	edx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	lea	eax, DWORD PTR [rcx+r14]
; Line 350
	shlx	rcx, QWORD PTR [rdi], rcx
	shrx	rcx, rcx, rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 198
	add	QWORD PTR [rsi+8], rcx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	mov	DWORD PTR [rdi+8], eax
$LN155@ZSTD_decod:
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 202
	lea	r14, OFFSET FLAT:?zeroFilled@?2??BIT_reloadDStream@@9@9
	cmp	bpl, 31
	jb	SHORT $LN104@ZSTD_decod
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 415
	mov	r8d, DWORD PTR [rdi+8]
	cmp	r8d, 64					; 00000040H
	jbe	SHORT $LN105@ZSTD_decod
; Line 417
	mov	QWORD PTR [rdi+16], r14
; Line 419
	jmp	SHORT $LN104@ZSTD_decod
$LN105@ZSTD_decod:
; Line 424
	mov	rdx, QWORD PTR [rdi+16]
	cmp	rdx, QWORD PTR [rdi+32]
	jb	SHORT $LN106@ZSTD_decod
; Line 387
	mov	rax, r8
	shr	rax, 3
	sub	rdx, rax
; Line 389
	and	r8d, 7
; Line 425
	jmp	SHORT $LN175@ZSTD_decod
$LN106@ZSTD_decod:
; Line 427
	mov	r10, QWORD PTR [rdi+24]
	cmp	rdx, r10
	je	SHORT $LN104@ZSTD_decod
; Line 433
	mov	r9d, r8d
; Line 435
	mov	rcx, rdx
	shr	r9d, 3
	mov	eax, r9d
	sub	rcx, rax
	cmp	rcx, r10
	jae	SHORT $LN109@ZSTD_decod
; Line 436
	mov	r9d, edx
	sub	r9d, r10d
$LN109@ZSTD_decod:
; Line 439
	mov	eax, r9d
	sub	rdx, rax
; Line 440
	lea	eax, DWORD PTR [r9*8]
	sub	r8d, eax
$LN175@ZSTD_decod:
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 207
	mov	QWORD PTR [rdi+16], rdx
	mov	DWORD PTR [rdi+8], r8d
	mov	rax, QWORD PTR [rdx]
	mov	QWORD PTR [rdi], rax
$LN104@ZSTD_decod:
	test	r12b, r12b
	je	SHORT $LN4@ZSTD_decod
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 350
	mov	ecx, DWORD PTR [rdi+8]
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 208
	mov	edx, r12d
	neg	edx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	lea	eax, DWORD PTR [rcx+r12]
; Line 350
	shlx	rcx, QWORD PTR [rdi], rcx
	shrx	rcx, rcx, rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 208
	add	QWORD PTR [rsi], rcx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 355
	mov	DWORD PTR [rdi+8], eax
$LN4@ZSTD_decod:
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 216
	test	r13d, r13d
	jne	$LN158@ZSTD_decod
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 318
	mov	r9, QWORD PTR [rdi]
; Line 336
	mov	r8d, DWORD PTR [rdi+8]
	add	r8d, ebx
; Line 318
	mov	eax, r8d
; Line 336
	add	r8d, r11d
; Line 318
	neg	eax
	shrx	rcx, r9, rax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	movzx	eax, WORD PTR llNext$1$[rsp]
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 318
	bzhi	rdx, rcx, rbx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	add	rdx, rax
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 318
	mov	eax, r8d
	neg	eax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	mov	QWORD PTR [rdi+40], rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 318
	shrx	rcx, r9, rax
	mov	eax, r11d
	bzhi	rdx, rcx, rax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	movzx	eax, WORD PTR mlNext$1$[rsp]
	add	rdx, rax
	mov	QWORD PTR [rdi+72], rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 336
	mov	edx, DWORD PTR ofnbBits$1$[rsp]
	lea	r10d, DWORD PTR [r8+rdx]
; Line 318
	mov	eax, r10d
; Line 355
	mov	DWORD PTR [rdi+8], r10d
; Line 318
	neg	eax
	shrx	rcx, r9, rax
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	movzx	eax, WORD PTR ofNext$1$[rsp]
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 318
	bzhi	rdx, rcx, rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 88
	add	rdx, rax
	mov	QWORD PTR [rdi+56], rdx
; File D:\work-pps\backtest-engine\ext\zstd\lib\common\bitstream.h
; Line 415
	cmp	r10d, 64				; 00000040H
	jbe	SHORT $LN75@ZSTD_decod
; Line 417
	mov	QWORD PTR [rdi+16], r14
; Line 419
	jmp	SHORT $LN158@ZSTD_decod
$LN75@ZSTD_decod:
; Line 424
	mov	rdx, QWORD PTR [rdi+16]
	cmp	rdx, QWORD PTR [rdi+32]
	jb	SHORT $LN76@ZSTD_decod
; Line 387
	mov	eax, r10d
	shr	rax, 3
	sub	rdx, rax
; Line 389
	and	r10d, 7
; Line 425
	jmp	SHORT $LN176@ZSTD_decod
$LN76@ZSTD_decod:
; Line 427
	mov	r9, QWORD PTR [rdi+24]
	cmp	rdx, r9
	je	SHORT $LN158@ZSTD_decod
; Line 433
	mov	r8d, r10d
; Line 435
	mov	rcx, rdx
	shr	r8d, 3
	mov	eax, r8d
	sub	rcx, rax
	cmp	rcx, r9
	jae	SHORT $LN79@ZSTD_decod
; Line 436
	mov	r8d, edx
	sub	r8d, r9d
$LN79@ZSTD_decod:
; Line 439
	mov	eax, r8d
	sub	rdx, rax
; Line 440
	lea	eax, DWORD PTR [r8*8]
	sub	r10d, eax
$LN176@ZSTD_decod:
; File D:\work-pps\backtest-engine\ext\zstd\lib\decompress\ZSTD_decompressSequences_body_decodeSequence.c
; Line 227
	mov	QWORD PTR [rdi+16], rdx
	mov	DWORD PTR [rdi+8], r10d
	mov	rax, QWORD PTR [rdx]
	mov	QWORD PTR [rdi], rax
$LN158@ZSTD_decod:
	mov	rbx, QWORD PTR [rsp+96]
	mov	rax, rsi
	add	rsp, 16
	pop	r15
	pop	r14
	pop	r13
	pop	r12
	pop	rdi
	pop	rsi
	pop	rbp
	ret	0
ZSTD_decodeSequence ENDP

@Cyan4973
Copy link
Contributor

Indeed,
register pressure is known to be a complex issue during compilation of decoding functions.
And that's where compiler optimizations can make a big difference.
clang used to be a lot worse than gcc on this topic, but have improved since, using zstd decoding function as one of their benchmark. I suspect MS visual hasn't done the same optimization effort.
In which case, the assembly file would help msvc more, by taking over the transformation into assembly from the compiler.
Though it only impacts the decoding of literals, therefore the sequences decoder still has to be compiled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants