Skip to content

BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Whatcookie
Copy link
Member

Uses vpcompress to vectorize this otherwise unvectorizeable loop
the u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512
the u32 path is untested as I couldn't find any games that hit it

We use vcompress register to register, rather than directly to memory since there's a bug with vcompress to memory on zen4, which makes it exceedingly slow. In the future, we could detect this and emit the optimal instructions in the jit instead. But the code is already so fast that it might not be worth the effort.

The code is overall nearly 10x faster than the scalar version on my zen4 machine.

Before:
image
After:
image

Before:
image
After:
image

Copy link
Contributor

@Megamouse Megamouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of variables can be made const

@Megamouse Megamouse added CPU Optimization Optimizes existing code labels Mar 26, 2025
- u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512
- the u32 path is untested as I couldn't find any games that hit it
@AniLeo
Copy link
Member

AniLeo commented Mar 27, 2025

Tried NieR Replicant Mailbox and Fountain, Minecraft Menu and Tutorial, Diva F 2nd Menu, no performance difference on my side on any of these cases with 9800X3D + 6800XT

Copy link
Contributor

@kd-11 kd-11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

  1. Move the different feature levels to separate files leaving the generic implementation here.
  2. We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
  3. When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

@Whatcookie
Copy link
Member Author

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

  1. Move the different feature levels to separate files leaving the generic implementation here.
  2. We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
  3. When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.

Let's do this instead:

  1. Move the different feature levels to separate files leaving the generic implementation here.
  2. We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
  3. When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.

It's a lot more work but that's just how it is when you need to maintain a project.

I think I need to recover the old sse4.1 paths since neko removed them in favor of emitting x86 instructions directly in the jit, which won't work on arm

@kd-11
Copy link
Contributor

kd-11 commented Mar 29, 2025

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

@Whatcookie
Copy link
Member Author

The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic.

They're guarded by x86_64 ifdefs in this file, aren't they?

@elad335 elad335 added RSX and removed CPU labels Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Optimization Optimizes existing code RSX
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants