Setting Instruction Set Floor #1795

Hoikas · 2025-09-17T19:41:59Z

Hoikas
Sep 17, 2025
Maintainer

Currently, we have quite a bit of code and restrictions to support SIMD instructions on x86 flavored architectures. For example, code that uses SIMD instructions explicitly has to be in its own file with some specific CMake trickery done for that file. The only code that currently uses SIMD instructions, however, is the hsMatrix44 multiplication operator and the DX9 pipeline's skinning functionality. The goal with these SIMD paths is to optimize the code, but in order to do this and still support older processors, we gate the code behind a runtime check with basically a virtual function call. Further, our runtime dispatch code only supports x86 instruction sets. It doesn't know anything about, for example, NEON.

My concern is that there is a lot of code complexity here to support a lot of processors in a very few cases. I would like to propose that we simplify our approach for SIMD instructions by setting a "minimum supported instruction set" for x86 and amd64. This would allow us to delete a lot of our runtime detection hacks, no longer require separate files for SIMD instructions, and potentially get more optimizations for free from the compiler. If we narrow ourselves to supporting a certain minimum, we could offload some of our math to DirectXMath, which will automatically use SIMD intrinsics for whatever /arch we pass to cl.exe, or we could have bespoke optimizations in line, guarded by the preprocessor. Further, the MSVC STL has been doing a lot of work lately to vectorize algorithms that we could potentially benefit from by turning up our minimum requirement.

I really see this as a drop-down in CMake with the options presented below as the default.

Note that the suggestions below are for Windows. I'll need some feedback for what Intel macs should be doing, for sure. I suspect we can be a little more stringent there, but I'm not sure what benefit we'd get since pretty much none of our code explicitly uses SIMD instructions on macOS.

x86 (32-bit): SSE2

Since a 32-bit client is what Cyan has always distributed for MOULa, I think looking at Cyan's system requirements is a good guideline. They are:

Intel CPU: Pentium IV (Prescott) or better OR AMD CPU: Athlon 64 or better processor

The CPUs that Cyan mention support SSE3. SSE3 is the instruction set that our skinning support in the DX9 pipeline requires. OTOH, despite this requirement being fairly well documented, there were some players who had trouble when we began to unconditionally require SSE3 on the Gehn Shard in 2012, causing us to implement the current runtime detection code. It seems that some CPUs released after Cyan's recommended CPUs may not support SSE3. This makes me somewhat wary of setting a floor of SSE3.

If we go this route, we'd need to remove our SSE3 code, replace it with DirectXMath, or refactor our uses of horizontal adds to shuffles.

As of at least Visual Studio 2015, if no /arch compiler flag is passed, cl.exe assumes the target architecture is SSE2. That means that setting a floor of SSE2 would result in effectively no change from our current behavior. That would probably be for the best considering the 32-bit client is something like "old faithful" - we need to be careful about breaking it.

AMD64 (64-bit): AVX

Currently, 64-bit clients exist and work, but distributing them is a bit of an unsolved problem. I realize some shards might be distributing them already, and some people might be using them via GetUru, but these are more enthusiast level fans, so we get some leeway to make a decision, IMO. Further, if the 64-bit client instruction set is too high, a Windows user can just use the 32-bit client. That would basically make (by default) the 64-bit client our high performance client.

All AMD64 CPUs support SSE2. But, since I'm proposing SSE2 as a minimum for the 32-bit client, we already have an SSE2 binary, suggesting we can use a more recent instruction set here. AVX512 is poorly supported on consumer Intel CPUs, so that is not an option, which leaving us with AVX or AVX2. AVX2 was added to Intel CPUs in 2013 with Haswell and AMD CPUs with Excavator in 2015.. AVX became available with Sandy Bridge in 2011 and Bulldozer in 2011.

I feel slightly more comfortable with only requiring AVX here. I don't know that AVX2 adds any new instructions on top of AVX that we would find ground-breaking. Of course, if we go this route, we'd want to port our code to DirectXMath or write AVX implementations of the code we have in SSE3.

dgelessus · 2025-09-17T20:51:18Z

dgelessus
Sep 17, 2025

Intel CPU: Pentium IV (Prescott) or better OR AMD CPU: Athlon 64 or better processor

The CPUs that Cyan mention support SSE3.

Strictly speaking, not all revisions of those processors have SSE3, only the ones starting from around 2004/2005 (as I understand it). There's also the Pentium M from around that time, which doesn't have SSE3 either and (according to Wikipedia) continued to be sold until 2009. This might explain why some Gehn players in 2012 were still using processors without SSE3.

That said, nowadays there should be much fewer people left who don't have SSE3. It's not completely implausible though - even Windows 10 only requires SSE2 (for the 32-bit version).

SSE2 on the other hand should be a safe requirement. AFAIK, later Windows 7 updates started requiring SSE2, and Windows 8 has always required it, so there should be basically no machines left that can run the current H'uru client, but don't have SSE2. From what I've seen, SSE2 is also the established de facto minimum requirement for modern x86 software. For example, Firefox has required SSE2 since 2016.

Further, if the 64-bit client instruction set is too high, a Windows user can just use the 32-bit client. That would basically make (by default) the 64-bit client our high performance client.

That seems like a good strategy.

Requiring only AVX gets a vote from me, because I have one or two laptops from the early 2010s that have AVX, but not AVX2. Though perhaps it should be decided based on the practical performance gains - if for some reason AVX2 gets us noticeably better FPS, we can require that for the x86_64 client and let older machines use the x86 client, as discussed.

Regarding macOS: as I understand it, the compiler will automatically choose the appropriate instruction sets based on the declared minimum macOS version. We target macOS 10.14, which still supported "Mid 2012" Macs with Sandy Bridge Intel Core processors, and apparently even 2010 Mac Pros, which have even older Xeons. So the default required instruction set would be at most AVX, possibly only SSE4.2.

Interestingly, macOS universal binaries also support an architecture variant called "x86_64h", which stands for "x86_64 Haswell". Apparently, you can make a universal binary containing both x86_64 and x86_64h "architectures", to provide better optimized code for Haswell and newer processors while still supporting older processors. Of course, like any universal build, this will make the binary noticeably larger.

1 reply

dpogue Sep 17, 2025
Maintainer

On macOS we link to the Accelerate.framework to do most of our matrix optimizations, which is a system framework that should be built to take advantage of whatever it can. I'm sorta hoping we just don't have to deal with this there.

dpogue · 2025-09-17T20:55:26Z

dpogue
Sep 17, 2025
Maintainer

We don't currently have any optimized instruction sets for ARM machines, but NEON seems to be a requirement for both 32-bit and 64-bit Windows on ARM, and that's for sure supported on all Apple Silicon versions, so it seems like a reasonable default to use when we do get around to implementing it (runtime detection for NEON support is a nightmare so we should avoid that).

0 replies

Deledrius · 2025-09-19T03:35:46Z

Deledrius
Sep 19, 2025
Maintainer

Reducing the maintenance and making it easier for future optimizations sounds like a good plan. It would be interesting to see benchmark results on this, but I'm not expecting a lot of gains from this specifically given the current code. I also like retaining the 32-bit client as the lowest-common-denominator client.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Setting Instruction Set Floor #1795

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Setting Instruction Set Floor #1795

Uh oh!

Hoikas Sep 17, 2025 Maintainer

x86 (32-bit): SSE2

AMD64 (64-bit): AVX

Replies: 3 comments · 1 reply

Uh oh!

dgelessus Sep 17, 2025

Uh oh!

dpogue Sep 17, 2025 Maintainer

Uh oh!

dpogue Sep 17, 2025 Maintainer

Uh oh!

Deledrius Sep 19, 2025 Maintainer

Hoikas
Sep 17, 2025
Maintainer

Replies: 3 comments 1 reply

dgelessus
Sep 17, 2025

dpogue Sep 17, 2025
Maintainer

dpogue
Sep 17, 2025
Maintainer

Deledrius
Sep 19, 2025
Maintainer