Problem:
I found that after upgrading my local aws-lc build on my development machine, regression tests started to randomly time out on haproxy. Looking closer with perf record, I found a lot of "jent*" functions clogging the CPU, the heaviest one being "rol64", making me think about what is normally a single x86 instruction.
I measured the time it takes to start haproxy 100 times when built with aws-lc 1.39 and got this, indicating roughly 400 microseconds of user CPU per startup:
$ time for i in {1..100}; do ./haproxy-awslc-1.39 -v >/dev/null ; done
real 0m0.425s
user 0m0.039s
sys 0m0.074s
With 1.71 it was almost multiplied by 100, at 36 milliseconds per process:
$ time for i in {1..100}; do ./haproxy-awslc-1.71 -v >/dev/null ; done
real 0m4.063s
user 0m3.586s
sys 0m0.154s
The difference is clearly visible in perf report:
Overhead Command Shared Object Symbol
23.07% haproxy-awslc-1 haproxy-awslc-1.71 [.] rol64
14.84% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_chi
12.51% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_theta
8.57% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_rho
8.44% haproxy-awslc-1 haproxy-awslc-1.71 [.] xoshiro128starstar
6.35% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_pi
5.69% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_memaccess
2.16% haproxy-awslc-1 haproxy-awslc-1.71 [.] uint32rotl
2.08% haproxy-awslc-1 haproxy-awslc-1.71 [.] ptr_to_le32
1.53% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_sha3_update
1.31% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_1600
1.14% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_sha3_init
1.05% haproxy-awslc-1 libc-2.33.so [.] __memset_avx2_unaligned_erms
0.79% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_keccakp_iota
0.67% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_get_nstime
0.60% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_sha3_fill_state
0.54% haproxy-awslc-1 haproxy-awslc-1.71 [.] le32_to_ptr
0.53% haproxy-awslc-1 haproxy-awslc-1.71 [.] jent_hash_time
0.47% haproxy-awslc-1 haproxy-awslc-1.71 [.] ptr_to_le64
There's a massive problem here. Grepping for "rol64" drove me to the jitterentropy third party lib, in sha3 code. All of these are supposed to be single-instructions on a CPU, but being forced at -O0, they result in an immense amount of garbage (huge and slow functions) that take 100 to 1000 times the equivalent number of cycles to execute, to the point of slowing down operations by that much.
Solution:
I don't know at this stage. I'm still investigating. I've found that the lib contains a test for optimizations and asks to run without:
third_party/jitterentropy/jitterentropy-library/src/jitterentropy-base.c: #error "The CPU Jitter random number generator must not be compiled with optimizations. See documentation. Use the compiler switch -O0 for compiling jitterentropy.c."
I just do not understand why it is done this way, the doc say "see documentation" but nothing explains this in the doc (and there's even a link that returns a 404). Something that relies on -O0 is seriously broken given that it will solely depend on how dumb the compiler itself tries to be and we have no guarantees on this.
I have not found yet what the call path for these functions is, nor if we can avoid it under certain conditions. What if I pass -Dinline="__inline __attribute__((always_inline))" to at least save the basic asm-level operations from being turned into functions ?
And even if this thing was really needed for obscure reasons, can't we make it run for way less than 40 ms on modern systems ? 40ms is an eternity at a computer scale, some operating systems boot in less than that. Not even thinking about smaller machines like armv7 where it may take ages.
Requirements / Acceptance Criteria:
At least get back to the normal CPU usage of previous versions (not yet bisected, and I don't think it will be useful as someone probably had a good reason at some point to think it would be nice to rely on this) or have a way to easily opt out for this CPU-hungry mechanism which already impacts stability (I haven't found how for now).
Thanks!
Problem:
I found that after upgrading my local aws-lc build on my development machine, regression tests started to randomly time out on haproxy. Looking closer with perf record, I found a lot of "jent*" functions clogging the CPU, the heaviest one being "rol64", making me think about what is normally a single x86 instruction.
I measured the time it takes to start haproxy 100 times when built with aws-lc 1.39 and got this, indicating roughly 400 microseconds of user CPU per startup:
With 1.71 it was almost multiplied by 100, at 36 milliseconds per process:
The difference is clearly visible in perf report:
There's a massive problem here. Grepping for "rol64" drove me to the jitterentropy third party lib, in sha3 code. All of these are supposed to be single-instructions on a CPU, but being forced at -O0, they result in an immense amount of garbage (huge and slow functions) that take 100 to 1000 times the equivalent number of cycles to execute, to the point of slowing down operations by that much.
Solution:
I don't know at this stage. I'm still investigating. I've found that the lib contains a test for optimizations and asks to run without:
I just do not understand why it is done this way, the doc say "see documentation" but nothing explains this in the doc (and there's even a link that returns a 404). Something that relies on -O0 is seriously broken given that it will solely depend on how dumb the compiler itself tries to be and we have no guarantees on this.
I have not found yet what the call path for these functions is, nor if we can avoid it under certain conditions. What if I pass
-Dinline="__inline __attribute__((always_inline))"to at least save the basic asm-level operations from being turned into functions ?And even if this thing was really needed for obscure reasons, can't we make it run for way less than 40 ms on modern systems ? 40ms is an eternity at a computer scale, some operating systems boot in less than that. Not even thinking about smaller machines like armv7 where it may take ages.
Requirements / Acceptance Criteria:
At least get back to the normal CPU usage of previous versions (not yet bisected, and I don't think it will be useful as someone probably had a good reason at some point to think it would be nice to rely on this) or have a way to easily opt out for this CPU-hungry mechanism which already impacts stability (I haven't found how for now).
Thanks!