Skip to content

perf report shows most cycles spent in blas_thread_server #5171

Open
@nickdesaulniers

Description

@nickdesaulniers

Not sure this is necessarily an issue with OpenBLAS vs users of OpenBLAS (numpy, pytorch).

I'm seeing slow python imports of pytorch; literally import pytorch is taking multiple seconds on my system.

When I record the python interpreter with linux perf record, perf report shows most cycles are spent in blas_thread_server via BOTH liblapack.so.3 and libcblas.so.3. i.e.

Overhead  Command  Shared Object             Symbol
  40.31%  python   liblapack.so.3            [.] blas_thread_server
  36.85%  python   libcblas.so.3             [.] blas_thread_server

If I annotate either, it seems both are near reading the time stamp counter:

  0.31 │3c:┌─→mov   (%r15),%rax                                                                                      ▒
       │   │  cmp   $0x1,%rax                                                                                        ▒
       │   │↓ ja    b0                                                                                               ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
  5.29 │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  rdtsc                                                                                                  ◆
 91.82 │   │  sub   %ecx,%eax                                                                                        ▒
       │   │  cmp   %eax,thread_timeout                                                                              ▒
  2.59 │   └──jae   3c

I'm guessing that's corresponding to code around here.

numpy/numpy#24639 seems like someone else hit this, too, but...https://xkcd.com/979/.

How do I even go about debugging this further? Is it an issue in pytorch? numpy? openblas? PEBKAC?

Importing numpy alone doesn't seem problematic, though I suspect that it's part of the chain of dependencies here. Perhaps related to how pytorch is (mis)using numpy then???

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions