[Issue]: svd,cholesky,eigh 40x times slower on MI250 than A100

### Problem Description

Linear algebra solvers like svd, cholesky or eigh are 2-30x slower than Nvidia counterparts across all floating point data types.

I am comparing MI250X vs A100

```bash
AMD MI250X (dtype=<class 'jax.numpy.float32'>):

N     chol [ms]  svd [ms]   eigh [ms]  cg [ms]
----  ---------  ---------  ---------  ---------
256       1.029    245.825      7.309      6.401
512       2.121   1241.506     13.595      6.554
1024      4.187   3324.264     28.191      7.130
1536      6.426   7432.767     45.737      8.120
2048      8.986  12868.906     71.927      9.296
```

```bash
NVIDIA A100 (64Gb) (dtype=<class 'jax.numpy.float32'>):

N     chol [ms]  svd [ms]   eigh [ms]  cg [ms]
----  ---------  ---------  ---------  ---------
256       0.266     10.987      2.544      3.848
512       0.446     30.270      5.977      5.235
1024      0.863     79.555     13.146      5.457
1536      1.243    252.512     22.850      5.460
2048      1.693    435.350     33.981      5.478
```

### Operating System

Red Hat Enterprise Linux 8.10 (Ootpa)

### CPU

AMD EPYC 7A53 64-Core Processor

### GPU

AMD Instinct MI250X (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)

### ROCm Version

Rocm 7.1.1

### ROCm Component

hipSOLVER

### Steps to Reproduce

Run the following (ChatGPT generated) benchmark solver script.
[benchmark_solver.py](https://github.com/user-attachments/files/24826846/benchmark_solver.py)


### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

```bash
*******
Agent 5
*******
  Name:                    gfx90a
  Uuid:                    GPU-a87b9af407bafdbc
  Marketing Name:          AMD Instinct MI250X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    8
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29704(0x7408)
  ASIC Revision:           1(0x1)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1700
  BDFID:                   53504
  Internal Node ID:        8
  Compute Unit:            110
  SIMDs per CU:            4
  Shader Engines:          8
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    TRUE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 92
  SDMA engine uCode::      9
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    67092480(0x3ffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*** Done ***
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: svd,cholesky,eigh 40x times slower on MI250 than A100 #278

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Issue]: svd,cholesky,eigh 40x times slower on MI250 than A100 #278

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions