Skip to content

Serious startup slowdown caused by unoptimized jitterentropy code #3180

@wtarreau

Description

@wtarreau

Problem:

I found that after upgrading my local aws-lc build on my development machine, regression tests started to randomly time out on haproxy. Looking closer with perf record, I found a lot of "jent*" functions clogging the CPU, the heaviest one being "rol64", making me think about what is normally a single x86 instruction.

I measured the time it takes to start haproxy 100 times when built with aws-lc 1.39 and got this, indicating roughly 400 microseconds of user CPU per startup:

$ time for i in {1..100}; do ./haproxy-awslc-1.39 -v >/dev/null ; done

real    0m0.425s
user    0m0.039s
sys     0m0.074s

With 1.71 it was almost multiplied by 100, at 36 milliseconds per process:

$ time for i in {1..100}; do ./haproxy-awslc-1.71 -v >/dev/null ; done

real    0m4.063s
user    0m3.586s
sys     0m0.154s

The difference is clearly visible in perf report:

Overhead  Command          Shared Object       Symbol                                                              
  23.07%  haproxy-awslc-1  haproxy-awslc-1.71  [.] rol64
  14.84%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_chi
  12.51%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_theta
   8.57%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_rho
   8.44%  haproxy-awslc-1  haproxy-awslc-1.71  [.] xoshiro128starstar
   6.35%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_pi
   5.69%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_memaccess
   2.16%  haproxy-awslc-1  haproxy-awslc-1.71  [.] uint32rotl
   2.08%  haproxy-awslc-1  haproxy-awslc-1.71  [.] ptr_to_le32
   1.53%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_sha3_update
   1.31%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_1600
   1.14%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_sha3_init
   1.05%  haproxy-awslc-1  libc-2.33.so        [.] __memset_avx2_unaligned_erms
   0.79%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_keccakp_iota
   0.67%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_get_nstime
   0.60%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_sha3_fill_state
   0.54%  haproxy-awslc-1  haproxy-awslc-1.71  [.] le32_to_ptr
   0.53%  haproxy-awslc-1  haproxy-awslc-1.71  [.] jent_hash_time
   0.47%  haproxy-awslc-1  haproxy-awslc-1.71  [.] ptr_to_le64

There's a massive problem here. Grepping for "rol64" drove me to the jitterentropy third party lib, in sha3 code. All of these are supposed to be single-instructions on a CPU, but being forced at -O0, they result in an immense amount of garbage (huge and slow functions) that take 100 to 1000 times the equivalent number of cycles to execute, to the point of slowing down operations by that much.

Solution:

I don't know at this stage. I'm still investigating. I've found that the lib contains a test for optimizations and asks to run without:

third_party/jitterentropy/jitterentropy-library/src/jitterentropy-base.c: #error "The CPU Jitter random number generator must not be compiled with optimizations. See documentation. Use the compiler switch -O0 for compiling jitterentropy.c."

I just do not understand why it is done this way, the doc say "see documentation" but nothing explains this in the doc (and there's even a link that returns a 404). Something that relies on -O0 is seriously broken given that it will solely depend on how dumb the compiler itself tries to be and we have no guarantees on this.

I have not found yet what the call path for these functions is, nor if we can avoid it under certain conditions. What if I pass -Dinline="__inline __attribute__((always_inline))" to at least save the basic asm-level operations from being turned into functions ?

And even if this thing was really needed for obscure reasons, can't we make it run for way less than 40 ms on modern systems ? 40ms is an eternity at a computer scale, some operating systems boot in less than that. Not even thinking about smaller machines like armv7 where it may take ages.

Requirements / Acceptance Criteria:

At least get back to the normal CPU usage of previous versions (not yet bisected, and I don't think it will be useful as someone probably had a good reason at some point to think it would be nice to rely on this) or have a way to easily opt out for this CPU-hungry mechanism which already impacts stability (I haven't found how for now).

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions