-
Notifications
You must be signed in to change notification settings - Fork 34
Description
This is a bit of a follow up to this other perfomance thread I posted at #561.
I've been looking a little deeper into the compiler's ability to unroll loops and autovectorize Pallene generated C code.
For context, the era of instruction level parallelism, aka SIMD, is already well upon us, and this is one of the few ways to harvest out-sized gains on CPUs that are no longer ramping up clock speeds. This trend only seems to be continuing. So we should be on the constant look out for ways to take advantage of the hardware, when there are opportunities (especially if the C compiler/optimizer is willing to do most of the hardwork for us if we just do the "right" things.)
TLDR: I believe the checks generated in Pallene for-loops are preventing the compiler/optimizer from generating much better code, and thus sacrificing non-trivial amounts of performance gains we otherwise might be able to get for free. I'm hoping we can find a way to change things, such that the compiler can apply its much better optimizations.
So I experimented with a very simple for-loop that iterates through an array of doubles and adds up all the values in the array.
I wrote 2 reference programs in C, and a progam in Pallene.
- The first reference program is the most basic, ideal C program that iterates through an array of doubles.
- The second progam is the same as above, except that instead of an array of doubles, it is an array of structs:
struct ContainerElement
{
double number;
int type;
};
This is my simplified interpretation of what the internal Lua array looks like. (The real Lua uses a union as the first field, so this is simplified to reduce the union down to a double.)
So I believe this case should represent the theoretical max of what we hope the emitted Pallene code can achieve.
- The third is a Pallene program which I experiment with by either changing the compiler optimization flags, or by directly hand-editing the generated C code to try to get better performance.
In all the results, the "loop time" is the important part. Startup time doesn't count.
- Base C program:
$ gcc -Wall -O3 -ftree-vectorize -funroll-loops base.c
$ time ./a.out
startup time (doesn't count) 1.937357
result: 576460751296790656.000000
c loop time 0.824896
real 0m3.035s
user 0m1.311s
sys 0m1.723s
- Container C program:
$ gcc -O3 -ftree-vectorize -funroll-loops container.c
$ time ./a.out
startup time (doesn't count) 3.736625
result: 576460751296790656.000000
c loop time 0.843595
real 0m5.075s
user 0m1.905s
sys 0m3.170s
- Pallene:
3a. pallenec, no extra optimizations
$ pallenec basic_pln.pln
$ time lua basic_lua.lua
startup time (doesn't count) 17.048125
result: 5.7646075237053e+17
pln loop time 1.068943
real 0m19.432s
user 0m15.983s
sys 0m3.443s
3b. pallenec with CFLAGS:
$ export CFLAGS='-O3 -ftree-vectorize -funroll-loops'
$ pallenec basic_pln.pln
$ time lua basic_lua.lua
startup time (doesn't count) 17.031442
result: 5.7646075237053e+17
pln loop time 1.063149
real 0m19.425s
user 0m15.915s
sys 0m3.506s
3c. hand modified --emit-c file removing both checks inside for-loop:
// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9);
/*
if (l_unlikely(!ttisfloat(slot))) {
pallene_runtime_tag_check_error(L,
"basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot),
"array element");
}
*/
$ pallenec --emit-c basic_pln.pln
$ cp basic_pln.c basic_pln_handmodify.c
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc -shared -fpic -o basic_pln.so basic_pln.o
$ time lua basic_lua.lua
startup time (doesn't count) 17.113667
result: 5.7646075237053e+17
pln loop time 0.84382
real 0m19.298s
user 0m15.777s
sys 0m3.520s
3d. hand modified --emit-c, removing only first check:
// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9);
$ pallenec --emit-c basic_pln.pln
$ cp basic_pln.c basic_pln_handmodify.c
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc -shared -fpic -o basic_pln.so basic_pln.o
$ time lua basic_lua.lua
startup time (doesn't count) 17.080507
result: 5.7646075237053e+17
pln loop time 0.928175
real 0m19.333s
user 0m15.920s
sys 0m3.412s
3e. hand modified --emit-c, removing only second check:
/*
if (l_unlikely(!ttisfloat(slot))) {
pallene_runtime_tag_check_error(L,
"basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot),
"array element");
}
*/
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc -shared -fpic -o basic_pln.so basic_pln.o
$ time lua basic_lua.lua
startup time (doesn't count) 17.226762
result: 5.7646075237053e+17
pln loop time 1.066411
real 0m19.626s
user 0m16.137s
sys 0m3.488s
To sum up, after I remove both checks inside the for-loop in the Pallene generated code, we get performance that matches the pure C counterpart. But leaving in the checks, the optimizer is not able to unroll and vectorize.
So we are looking at ~23% slow down caused by the two checks inside the loop.
*Additional note: I didn't benchmark with more aggressive processor optimization flags, such as -mavx2, -mavx512f, but I did look at the generated assembly, and you can definitely see an impactful change in how it uses the SIMD registers. I think gcc by default is using something that looks like -msse2 on my computer. This makes sense because every x86-64 CPU can do SSE2.
Some possible ideas to improve this:
-
Add a compiler optimization flag (e.g. "--unchecked-loops") that removes these checks. This would be only for users who know what they are doing to opt into.
-
Is there a way we can generate these checks that don't cause the compiler to give up on the optimizations?
a. Assuming there isn't a different way to write the same exact behavior that doesn't cause the optimizer to give up, one idea is that maybe for the check, we just keep track of a "found_an_error" boolean through the loop. Then after the end of the loop, we check the flag, and then outside the loop, we finally trigger the error. It comes a little late, but maybe this can be part of an optional compiler flag.
b. Additionally, with this deferred error idea, maybe we only apply it to certain types that are less likely to cause dangerous errors until the end of the loop. For example, floats and integers might be able to defer the error, but other types we always keep the checks. The idea is that the program will eventually trigger an error anyway, so garbage numbers won't hopefully be able to do real damage, especially with an opted-in compiler flag. -
Is there a way we can hoist these checks outside the loop?
a. Is there something clever we can do before or after the loop to check?
b. Maybe we can iterate through the loop twice. The first time is just the checks. The second time is the code without the checks. We'll need to benchmark, but the hope/presumption is that the performance gains won from the optimizer will vastly outweight the cost of looping through the array twice. -
This will probably require cooperation with Lua, so down the line, if Lua decides Pallene will be an official feature moving forward, maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not. If the array is known to be homogeneous, then checks can be skipped. If Lua adds or changes a value, which results in a heterogenous array, it can set a flag that Pallene can read. This can allow Pallene to choose between two different code paths, with-checks and without-checks. Presumably, once Lua determines the array is no longer homogeneous, it can't unset the flag unless the array is emptied (or down to 1 element).
-
Pallene compiler/optimizer manually unrolls loops itself. Rather than completely relying on the underlying C compiler, Pallene could attempt to unroll some loops itself and try to generate output that might make it easier for the C autovectorizer to find operations it can combine into SIMD registers.
Most modern CPUs (both desktop and mobile) support at least 128-bit wide instructions. So assuming Pallene/Lua is using 64-bit ints and floats, I think an unrolling of 2 might be sufficient. Some flexibility for more powerful architectures or Lua compiled with using 32-bit sizes or just performance experimentation, to unroll more loops (e.g. 4 or 8 or 16) might be a nice to have (controlled by compiler flag options). Even if the C compiler isn't able to fully vectorize, sometimes the loop unrolling itself gives a big enough performance boost by itself to make it worth it. Since Lua is also used for embedded hardware which may benefit more from small code size and lack SIMD, compiler flags to turn on/off unrolling is desirable.
I'm attaching my example files, plus the assembly output of the Pallene code, with and without the checks removed.
You can use the -S flag to generate the assembly output.
https://godbolt.org is also great, and I recommend pasting in the C programs there. (The Pallene program is harder to use with Compiler Explorer because of the extra headers needed for lua.h.)