Description
Not sure this is necessarily an issue with OpenBLAS vs users of OpenBLAS (numpy, pytorch).
I'm seeing slow python imports of pytorch; literally import pytorch
is taking multiple seconds on my system.
When I record the python interpreter with linux perf record
, perf report
shows most cycles are spent in blas_thread_server
via BOTH liblapack.so.3 and libcblas.so.3. i.e.
Overhead Command Shared Object Symbol
40.31% python liblapack.so.3 [.] blas_thread_server
36.85% python libcblas.so.3 [.] blas_thread_server
If I annotate either, it seems both are near reading the time stamp counter:
0.31 │3c:┌─→mov (%r15),%rax ▒
│ │ cmp $0x1,%rax ▒
│ │↓ ja b0 ▒
│ │ nop ▒
│ │ nop ▒
│ │ nop ▒
│ │ nop ▒
│ │ nop ▒
│ │ nop ▒
5.29 │ │ nop ▒
│ │ nop ▒
│ │ rdtsc ◆
91.82 │ │ sub %ecx,%eax ▒
│ │ cmp %eax,thread_timeout ▒
2.59 │ └──jae 3c
I'm guessing that's corresponding to code around here.
numpy/numpy#24639 seems like someone else hit this, too, but...https://xkcd.com/979/.
How do I even go about debugging this further? Is it an issue in pytorch? numpy? openblas? PEBKAC?
Importing numpy alone doesn't seem problematic, though I suspect that it's part of the chain of dependencies here. Perhaps related to how pytorch is (mis)using numpy then???