-
-
Notifications
You must be signed in to change notification settings - Fork 2k
BufferUtils: Optimize upload_untouched_skip_restart with AVX-512 paths #16932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a lot of variables can be made const
- u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512 - the u32 path is untested as I couldn't find any games that hit it
Tried NieR Replicant Mailbox and Fountain, Minecraft Menu and Tutorial, Diva F 2nd Menu, no performance difference on my side on any of these cases with 9800X3D + 6800XT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these paths have made this module which was meant to be a simple utils wrapper into an unmaintainable mess. The function names are also getting messy as intel adds more and more weird levels to the ISA.
Let's do this instead:
- Move the different feature levels to separate files leaving the generic implementation here.
- We expose a dispatch table for each feature level and pick the one to use at the start using a static lambda initializer. Some things to watch out for - Arm64 is actually sse4.2 compatible (including ssse3) when using sse2neon. Generic path is only required for validation as well as future architectures as a reference.
- When intel releases avx9000 or whatever we just create a file for that featureset and don't keep adding to this file.
It's a lot more work but that's just how it is when you need to maintain a project.
I think I need to recover the old sse4.1 paths since neko removed them in favor of emitting x86 instructions directly in the jit, which won't work on arm |
The jit asm backend emits (or tries to) different instructions based on the hardware. It is supposed to be platform agnostic. |
They're guarded by x86_64 ifdefs in this file, aren't they? |
Uses vpcompress to vectorize this otherwise unvectorizeable loop
the u16 path needs AVX-512-ICL because vpcompressw isn't included in skylake-x level AVX-512
the u32 path is untested as I couldn't find any games that hit it
We use vcompress register to register, rather than directly to memory since there's a bug with vcompress to memory on zen4, which makes it exceedingly slow. In the future, we could detect this and emit the optimal instructions in the jit instead. But the code is already so fast that it might not be worth the effort.
The code is overall nearly 10x faster than the scalar version on my zen4 machine.
Before:


After:
Before:


After: