--train-cover
performance on Windows experiences a critical bottleneck at Constructing partial suffix array
#4185
Description
Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of --train-cover
seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?
Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls std::stable_sort(std::execution::par_unseq, ...)
completes the sort step in seconds.
To Reproduce
Steps to reproduce the behavior:
- Get a 100+mb training dataset.
- Run
zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
- Observe functional hang at
Constructing partial suffix array
. - Compare against same training set on linux.
Expected behavior
- --train-cover can be used on larger datasets without issue on all operating systems.
Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the Constructing partial suffix array
step:
c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):
c++ singlethreaded (~30 seconds wall time):
Desktop (please complete the following information):
- OS: Windows
- Version: 11
- Compiler: msvc VS2022 (also reproduces with zstd prebuilts in msys2 environments built with clang, gcc against cygwin, msvc, and ucrt c runtimes https://www.msys2.org/docs/environments/)
- Flags: "Release" visual studio configuration
- Other relevant hardware specs: 32/64 threadripper
- Build system: visual studio
Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721