Skip to content

--train-cover performance on Windows experiences a critical bottleneck at Constructing partial suffix array #4185

Open
@akrieger

Description

Describe the bug
For whatever reason, across multiple different compilers (msvc, clang, gcc) and C runtime libraries (ucrt, cygwin, msvcrt), the qsort step of --train-cover seems to take forever. Profiling of an msvc-built zstd shows that the qsort function body is not inline-able. This results in excessive numbers of forced function calls from qsort to zstd and then back to non-inlined memcmp. That said, I could be misinterpreting the data and need to look at the actual asm to be sure. Maybe just pessimal pivot selection and worst case performance?

Training on a dataset of 100MB takes almost 45 minutes even with full parallelism. A naive reimplementation in C++ which calls std::stable_sort(std::execution::par_unseq, ...) completes the sort step in seconds.

To Reproduce
Steps to reproduce the behavior:

  1. Get a 100+mb training dataset.
  2. Run zstd.exe --train-cover=steps=512,d=10 -T0 -r data --maxdict=100KB -o data.dict --memory-limit 100MB -v
  3. Observe functional hang at Constructing partial suffix array.
  4. Compare against same training set on linux.

Expected behavior

  • --train-cover can be used on larger datasets without issue on all operating systems.

Screenshots and charts
qsort implementation performance (training did not finish, this is a 3 minute 21 second snapshot of the Constructing partial suffix array step:
image

c++ implementation with par_unseq (completes in scant seconds with less than half the total cpu time. RtlUserThreadStart is the rollup of the threads created to do the parallel sort):
image

c++ singlethreaded (~30 seconds wall time):
image

Desktop (please complete the following information):

  • OS: Windows
  • Version: 11
  • Compiler: msvc VS2022 (also reproduces with zstd prebuilts in msys2 environments built with clang, gcc against cygwin, msvc, and ucrt c runtimes https://www.msys2.org/docs/environments/)
  • Flags: "Release" visual studio configuration
  • Other relevant hardware specs: 32/64 threadripper
  • Build system: visual studio

Additional context
I don't think I botched the C++ implementation, but I may have broken something with all the void* casts happening. The biggest difference is I had to change the 'tiebreaker' because the pointer value is not passed to the comparator. https://gist.github.com/akrieger/c023dad6ffe5eac7d44a3a34a3dc7721

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions