-
Notifications
You must be signed in to change notification settings - Fork 532
Description
The library constructor usc_init is problematic as it adds a fixed startup cost / delay to all executables that transitively link to libusc.
I suggest implementing a lazy initialization strategy with initializing on first use to avoid the cost of the library constructor.
This is observable with version 1.19.0 (Ubuntu 25.10 questing) and to my understanding, this problem is getting even worse with #11112.
Motivation:
I have a test executable that transitively depends on libucs.so via OpenMPI, without using any related functionality.
We run our 1k (increasing) tests in isolation using ctest, calling the executable at least once per test as some are run using mpirun.
I measure an overhead of 42ms per invocation, so we are dealing with a lower bound of 42 seconds of usc_init for each of our 10+ test configurations.
MRE
I measure the invocation overhead by linking the library to a do-nothing main as follows:
$ echo 'int main() {}' > nothing.c
$ gcc nothing.c -c -o nothing.o
$ gcc -Wl,--no-as-needed -lucs nothing.o -o nothing
$ ldd nothing
linux-vdso.so.1 (0x00007cc9738de000)
libucs.so.0 => /lib/x86_64-linux-gnu/libucs.so.0 (0x00007cc973831000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007cc973400000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007cc973724000)
libucm.so.0 => /lib/x86_64-linux-gnu/libucm.so.0 (0x00007cc973705000)
/lib64/ld-linux-x86-64.so.2 (0x00007cc9738e0000)
$ hyperfine -w 1 ./nothing
Benchmark 1: ./nothing
Time (mean ± σ): 42.2 ms ± 0.2 ms [User: 20.4 ms, System: 21.8 ms]
Range (min … max): 41.9 ms … 42.9 ms 69 runsSanity check with perf record -Fmax -g ./nothing:
The perf records suggest that the time is spent in:
Lines 86 to 89 in 9117b1b
| deadline = ucm_get_time() + ucm_global_opts.bistro_grace_duration; | |
| while (ucm_get_time() < deadline) { | |
| sched_yield(); | |
| } |
However, the deadline mentions a timeout of 5ms in 1.19, which doesn't match up with the measured 42ms as
sched_yield() covers 79% of samples in the perf trace.