BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

Whatcookie · 2025-03-26T22:08:05Z

Uses vpcompress to vectorize this otherwise unvectorizeable loop
the u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512
the u32 path is untested as I couldn't find any games that hit it

We use vcompress register to register, rather than directly to memory since there's a bug with vcompress to memory on zen4, which makes it exceedingly slow. In the future, we could detect this and emit the optimal instructions in the jit instead. But the code is already so fast that it might not be worth the effort.

The code is overall nearly 10x faster than the scalar version on my zen4 machine.

Before:

After:

Megamouse

I think a lot of variables can be made const

rpcs3/Emu/RSX/Common/BufferUtils.cpp

- u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512 - the u32 path is untested as I couldn't find any games that hit it

AniLeo · 2025-03-27T00:57:18Z

Tried NieR Replicant Mailbox and Fountain, Minecraft Menu and Tutorial, Diva F 2nd Menu, no performance difference on my side on any of these cases with 9800X3D + 6800XT

kd-11

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

Move the different feature levels to separate files leaving the generic implementation here.
We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

Whatcookie · 2025-03-29T00:10:59Z

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

Move the different feature levels to separate files leaving the generic implementation here.

We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.

When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

I think I need to recover the old sse4.1 paths since neko removed them in favor of emitting x86 instructions directly in the jit, which won't work on arm

kd-11 · 2025-03-29T00:46:50Z

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

Whatcookie · 2025-03-29T00:51:13Z

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

They're guarded by x86_64 ifdefs in this file, aren't they?

Megamouse reviewed Mar 26, 2025

View reviewed changes

rpcs3/Emu/RSX/Common/BufferUtils.cpp Outdated Show resolved Hide resolved

Megamouse added CPU Optimization Optimizes existing code labels Mar 26, 2025

BufferUtils: Optimize upload_untoucheed_skip_restart with AVX-512 paths

430c3ed

- u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512 - the u32 path is untested as I couldn't find any games that hit it

Whatcookie force-pushed the RSX branch from a26e34b to 430c3ed Compare March 26, 2025 22:22

elad335 and others added 5 commits March 27, 2025 08:51

Merge branch 'master' into RSX

d8b96b9

Merge branch 'master' into RSX

7870ac1

Merge branch 'master' into RSX

65cc596

Merge branch 'master' into RSX

3a0f650

Merge branch 'master' into RSX

97b8979

kd-11 reviewed Mar 28, 2025

View reviewed changes

elad335 added RSX and removed CPU labels Mar 30, 2025

Merge branch 'master' into RSX

1e1bbf7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

Whatcookie commented Mar 26, 2025

Megamouse left a comment

AniLeo commented Mar 27, 2025 •

edited

Loading

kd-11 left a comment

Whatcookie commented Mar 29, 2025

kd-11 commented Mar 29, 2025

Whatcookie commented Mar 29, 2025

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

Are you sure you want to change the base?

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

Conversation

Whatcookie commented Mar 26, 2025

Megamouse left a comment

Choose a reason for hiding this comment

AniLeo commented Mar 27, 2025 • edited Loading

kd-11 left a comment

Choose a reason for hiding this comment

Whatcookie commented Mar 29, 2025

kd-11 commented Mar 29, 2025

Whatcookie commented Mar 29, 2025

AniLeo commented Mar 27, 2025 •

edited

Loading