Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization)

This is a bit of a follow up to this other perfomance thread I posted at https://github.com/pallene-lang/pallene/discussions/561.

I've been looking a little deeper into the compiler's ability to unroll loops and autovectorize Pallene generated C code.
For context, the era of instruction level parallelism, aka SIMD, is already well upon us, and this is one of the few ways to harvest out-sized gains on CPUs that are no longer ramping up clock speeds. This trend only seems to be continuing. So we should be on the constant look out for ways to take advantage of the hardware, when there are opportunities (especially if the C compiler/optimizer is willing to do most of the hardwork for us if we just do the "right" things.)

**TLDR:** I believe the checks generated in Pallene for-loops are preventing the compiler/optimizer from generating much better code, and thus sacrificing non-trivial amounts of performance gains we otherwise might be able to get for free. I'm hoping we can find a way to change things, such that the compiler can apply its much better optimizations.


So I experimented with a very simple for-loop that iterates through an array of doubles and adds up all the values in the array.
I wrote 2 reference programs in C, and a progam in Pallene.

- The first reference program is the most basic, ideal C program that iterates through an array of doubles.
- The second progam is the same as above, except that instead of an array of doubles, it is an array of structs:
```
struct ContainerElement
{
	double number;
	int type;
};
```

This is my simplified interpretation of what the internal Lua array looks like. (The real Lua uses a union as the first field, so this is simplified to reduce the union down to a double.)
So I believe this case should represent the theoretical max of what we hope the emitted Pallene code can achieve.

- The third is a Pallene program which I experiment with by either changing the compiler optimization flags, or by directly hand-editing the generated C code to try to get better performance.


In all the results, the "loop time" is the important part. Startup time doesn't count.


1. Base C program:
```
$ gcc -Wall -O3 -ftree-vectorize -funroll-loops base.c 
$ time ./a.out 
startup time (doesn't count)	1.937357
result: 576460751296790656.000000
c loop time	0.824896

real	0m3.035s
user	0m1.311s
sys	0m1.723s
```

3. Container C program:
```
$ gcc -O3 -ftree-vectorize -funroll-loops container.c 
$ time ./a.out 
startup time (doesn't count)	3.736625
result: 576460751296790656.000000
c loop time	0.843595

real	0m5.075s
user	0m1.905s
sys	0m3.170s

```


5. Pallene:


3a. pallenec, no extra optimizations
```
$ pallenec basic_pln.pln
$ time lua basic_lua.lua 
startup time (doesn't count)	17.048125
result:	5.7646075237053e+17
pln loop time	1.068943

real	0m19.432s
user	0m15.983s
sys	0m3.443s
```





3b. pallenec with CFLAGS:
```
$ export CFLAGS='-O3 -ftree-vectorize -funroll-loops'
$ pallenec basic_pln.pln
$ time lua basic_lua.lua 
startup time (doesn't count)	17.031442
result:	5.7646075237053e+17
pln loop time	1.063149

real	0m19.425s
user	0m15.915s
sys	0m3.506s
```




3c. hand modified --emit-c file removing both checks inside for-loop:
```
//            pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9);
/*
            if (l_unlikely(!ttisfloat(slot))) {
                pallene_runtime_tag_check_error(L,
                    "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot),
                    "array element");
            }
*/

$ pallenec --emit-c basic_pln.pln 
$ cp basic_pln.c basic_pln_handmodify.c 
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc  -shared -fpic -o basic_pln.so basic_pln.o 
$ time lua basic_lua.lua 
startup time (doesn't count)	17.113667
result:	5.7646075237053e+17
pln loop time	0.84382

real	0m19.298s
user	0m15.777s
sys	0m3.520s
```



3d. hand modified --emit-c, removing only first check:
```
//            pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9);
$ pallenec --emit-c basic_pln.pln 
$ cp basic_pln.c basic_pln_handmodify.c 
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc  -shared -fpic -o basic_pln.so basic_pln.o 
$ time lua basic_lua.lua 
startup time (doesn't count)	17.080507
result:	5.7646075237053e+17
pln loop time	0.928175

real	0m19.333s
user	0m15.920s
sys	0m3.412s
```



3e. hand modified --emit-c, removing only second check:
```
/*
            if (l_unlikely(!ttisfloat(slot))) {
                pallene_runtime_tag_check_error(L,
                    "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot),
                    "array element");
            }
*/
$ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c
$ gcc  -shared -fpic -o basic_pln.so basic_pln.o 
$ time lua basic_lua.lua 
startup time (doesn't count)	17.226762
result:	5.7646075237053e+17
pln loop time	1.066411

real	0m19.626s
user	0m16.137s
sys	0m3.488s
```

**To sum up**, _**after I remove both checks**_ inside the for-loop in the Pallene generated code, **_we get performance that matches the pure C counterpart._** **_But leaving in the checks, the optimizer is not able to unroll and vectorize._**
So we are looking at **_~23% slow down caused by the two checks inside the loop_**.



*Additional note: I didn't benchmark with more aggressive processor optimization flags, such as -mavx2, -mavx512f, but I did look at the generated assembly, and you can definitely see an impactful change in how it uses the SIMD registers. I think gcc by default is using something that looks like -msse2 on my computer. This makes sense because every x86-64 CPU can do SSE2.


Some possible ideas to improve this:

1. Add a compiler optimization flag (e.g. "--unchecked-loops") that removes these checks. This would be only for users who know what they are doing to opt into.

2. Is there a way we can generate these checks that don't cause the compiler to give up on the optimizations?

	a. Assuming there isn't a different way to write the same exact behavior that doesn't cause the optimizer to give up, one idea is that maybe for the check, we just keep track of a "found_an_error" boolean through the loop. Then after the end of the loop, we check the flag, and then outside the loop, we finally trigger the error. It comes a little late, but maybe this can be part of an optional compiler flag.
	b. Additionally, with this deferred error idea, maybe we only apply it to certain types that are less likely to cause dangerous errors until the end of the loop. For example, floats and integers might be able to defer the error, but other types we always keep the checks. The idea is that the program will eventually trigger an error anyway, so garbage numbers won't hopefully be able to do real damage, especially with an opted-in compiler flag.

3. Is there a way we can hoist these checks outside the loop?
	a. Is there something clever we can do before or after the loop to check?
	b. Maybe we can iterate through the loop twice. The first time is just the checks. The second time is the code without the checks. We'll need to benchmark, but the hope/presumption is that the performance gains won from the optimizer will vastly outweight the cost of looping through the array twice.

6. This will probably require cooperation with Lua, so down the line, if Lua decides Pallene will be an official feature moving forward, maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not. If the array is known to be homogeneous, then checks can be skipped. If Lua adds or changes a value, which results in a heterogenous array, it can set a flag that Pallene can read. This can allow Pallene to choose between two different code paths, with-checks and without-checks. Presumably, once Lua determines the array is no longer homogeneous, it can't unset the flag unless the array is emptied (or down to 1 element).


7. Pallene compiler/optimizer manually unrolls loops itself. Rather than completely relying on the underlying C compiler, Pallene could attempt to unroll some loops itself and try to generate output that might make it easier for the C autovectorizer to find operations it can combine into SIMD registers.
Most modern CPUs (both desktop and mobile) support at least 128-bit wide instructions. So assuming Pallene/Lua is using 64-bit ints and floats, I think an unrolling of 2 might be sufficient. Some flexibility for more powerful architectures or Lua compiled with using 32-bit sizes or just performance experimentation, to unroll more loops (e.g. 4 or 8 or 16) might be a nice to have (controlled by compiler flag options). Even if the C compiler isn't able to fully vectorize, sometimes the loop unrolling itself gives a big enough performance boost by itself to make it worth it. Since Lua is also used for embedded hardware which may benefit more from small code size and lack SIMD, compiler flags to turn on/off unrolling is desirable.



I'm attaching my example files, plus the assembly output of the Pallene code, with and without the checks removed.

You can use the -S flag to generate the assembly output.
https://godbolt.org is also great, and I recommend pasting in the C programs there. (The Pallene program is harder to use with Compiler Explorer because of the extra headers needed for lua.h.)

[UNROLL_VEC.zip](https://github.com/pallene-lang/pallene/files/10926240/UNROLL_VEC.zip)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization) #564

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization) #564

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions