-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
status: triageIndicates an issue has been assigned for investigation.Indicates an issue has been assigned for investigation.
Description
Problem Description
Linear algebra solvers like svd, cholesky or eigh are 2-30x slower than Nvidia counterparts across all floating point data types.
I am comparing MI250X vs A100
AMD MI250X (dtype=<class 'jax.numpy.float32'>):
N chol [ms] svd [ms] eigh [ms] cg [ms]
---- --------- --------- --------- ---------
256 1.029 245.825 7.309 6.401
512 2.121 1241.506 13.595 6.554
1024 4.187 3324.264 28.191 7.130
1536 6.426 7432.767 45.737 8.120
2048 8.986 12868.906 71.927 9.296NVIDIA A100 (64Gb) (dtype=<class 'jax.numpy.float32'>):
N chol [ms] svd [ms] eigh [ms] cg [ms]
---- --------- --------- --------- ---------
256 0.266 10.987 2.544 3.848
512 0.446 30.270 5.977 5.235
1024 0.863 79.555 13.146 5.457
1536 1.243 252.512 22.850 5.460
2048 1.693 435.350 33.981 5.478Operating System
Red Hat Enterprise Linux 8.10 (Ootpa)
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X (amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-)
ROCm Version
Rocm 7.1.1
ROCm Component
hipSOLVER
Steps to Reproduce
Run the following (ChatGPT generated) benchmark solver script.
benchmark_solver.py
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
*******
Agent 5
*******
Name: gfx90a
Uuid: GPU-a87b9af407bafdbc
Marketing Name: AMD Instinct MI250X
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 8
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29704(0x7408)
ASIC Revision: 1(0x1)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 1700
BDFID: 53504
Internal Node ID: 8
Compute Unit: 110
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: TRUE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 2048(0x800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 92
SDMA engine uCode:: 9
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 67092480(0x3ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 4
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*** Done ***Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
status: triageIndicates an issue has been assigned for investigation.Indicates an issue has been assigned for investigation.
Type
Projects
Status
Todo