Skip to content

[Issue]: svd,cholesky,eigh 40x times slower on MI250 than A100 #278

@PhilipVinc

Description

@PhilipVinc

Problem Description

Linear algebra solvers like svd, cholesky or eigh are 2-30x slower than Nvidia counterparts across all floating point data types.

I am comparing MI250X vs A100

AMD MI250X (dtype=<class 'jax.numpy.float32'>):

N     chol [ms]  svd [ms]   eigh [ms]  cg [ms]
----  ---------  ---------  ---------  ---------
256       1.029    245.825      7.309      6.401
512       2.121   1241.506     13.595      6.554
1024      4.187   3324.264     28.191      7.130
1536      6.426   7432.767     45.737      8.120
2048      8.986  12868.906     71.927      9.296
NVIDIA A100 (64Gb) (dtype=<class 'jax.numpy.float32'>):

N     chol [ms]  svd [ms]   eigh [ms]  cg [ms]
----  ---------  ---------  ---------  ---------
256       0.266     10.987      2.544      3.848
512       0.446     30.270      5.977      5.235
1024      0.863     79.555     13.146      5.457
1536      1.243    252.512     22.850      5.460
2048      1.693    435.350     33.981      5.478

Operating System

Red Hat Enterprise Linux 8.10 (Ootpa)

CPU

AMD EPYC 7A53 64-Core Processor

GPU

AMD Instinct MI250X (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)

ROCm Version

Rocm 7.1.1

ROCm Component

hipSOLVER

Steps to Reproduce

Run the following (ChatGPT generated) benchmark solver script.
benchmark_solver.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

*******
Agent 5
*******
  Name:                    gfx90a
  Uuid:                    GPU-a87b9af407bafdbc
  Marketing Name:          AMD Instinct MI250X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    8
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29704(0x7408)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   53504
  Internal Node ID:        8
  Compute Unit:            110
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    TRUE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*** Done ***

Metadata

Metadata

Assignees

Labels

status: triageIndicates an issue has been assigned for investigation.

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions