Skip to content

Task03 Амир Батыров ITMO#1047

Open
c5xheavy wants to merge 1 commit intoGPGPUCourse:task03from
c5xheavy:task03
Open

Task03 Амир Батыров ITMO#1047
c5xheavy wants to merge 1 commit intoGPGPUCourse:task03from
c5xheavy:task03

Conversation

@c5xheavy
Copy link

@c5xheavy c5xheavy commented Feb 26, 2026

Локальный вывод

$ ./main_matrix_transpose 1
Found 3 GPUs in 0.0466199 sec (OpenCL: 0.0275665 sec, Vulkan: 0.0190034 sec)
Available devices:
  Device #0: API: Vulkan. iGPU. Intel(R) Arc(tm) Graphics (MTL). Free memory: 5406/7750 Mb.
  Device #1: API: OpenCL. CPU. Intel(R) Core(TM) Ultra 5 125H. Intel(R) Corporation. Total memory: 15501 Mb.
  Device #2: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15501/15501 Mb.
Using device #1: API: OpenCL. CPU. Intel(R) Core(TM) Ultra 5 125H. Intel(R) Corporation. Total memory: 15501 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 0.0794389 seconds
algorithm times (in seconds) - 10 values (min=0.268004 10%=0.281873 median=0.293419 90%=0.408015 max=0.408015)
median effective algorithm bandwidth: 3.40809 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.031347 seconds
algorithm times (in seconds) - 10 values (min=0.0461301 10%=0.0461855 median=0.0465298 90%=0.0790783 max=0.0790783)
median effective algorithm bandwidth: 21.4916 GB/s


$ ./main_matrix_multiply 1
Found 3 GPUs in 0.0602803 sec (OpenCL: 0.0295763 sec, Vulkan: 0.0306527 sec)
Available devices:
  Device #0: API: Vulkan. iGPU. Intel(R) Arc(tm) Graphics (MTL). Free memory: 5456/7750 Mb.
  Device #1: API: OpenCL. CPU. Intel(R) Core(TM) Ultra 5 125H. Intel(R) Corporation. Total memory: 15501 Mb.
  Device #2: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15501/15501 Mb.
Using device #1: API: OpenCL. CPU. Intel(R) Core(TM) Ultra 5 125H. Intel(R) Corporation. Total memory: 15501 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=8.46722 10%=8.46722 median=8.46722 90%=8.46722 max=8.46722)
algorithm GFlops: 2.02799 GFlops
algorithm effective memory bandwidth: 0.00645873 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.0982374 seconds
algorithm times (in seconds) - 10 values (min=0.795038 10%=0.798513 median=0.845341 90%=1.18228 max=1.18228)
algorithm GFlops: 20.3131 GFlops
algorithm effective memory bandwidth: 0.0646928 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.0551597 seconds
algorithm times (in seconds) - 10 values (min=0.591612 10%=0.627633 median=0.661366 90%=0.754793 max=0.754793)
algorithm GFlops: 25.9637 GFlops
algorithm effective memory bandwidth: 0.0826887 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05

Вывод Github CI

Run ./main_matrix_transpose 0
Found 2 GPUs in 0.0531353 sec (CUDA: 8.5881e-05 sec, OpenCL: 0.0247909 sec, Vulkan: 0.0282084 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 0.124981 seconds
algorithm times (in seconds) - 10 values (min=0.501865 10%=0.531583 median=0.574656 90%=0.678069 max=0.678069)
median effective algorithm bandwidth: 1.74017 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.0448901 seconds
algorithm times (in seconds) - 10 values (min=0.169914 10%=0.170557 median=0.170789 90%=0.216634 max=0.216634)
median effective algorithm bandwidth: 5.85519 GB/s

Run ./main_matrix_multiply 0
Found 2 GPUs in 0.0521385 sec (CUDA: 0.00010718 sec, OpenCL: 0.02417 sec, Vulkan: 0.0278133 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=17.4311 10%=17.4311 median=17.4311 90%=17.4311 max=17.4311)
algorithm GFlops: 0.985106 GFlops
algorithm effective memory bandwidth: 0.00313735 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.128409 seconds
algorithm times (in seconds) - 10 values (min=1.61385 10%=1.67209 median=1.67886 90%=1.78617 max=1.78617)
algorithm GFlops: 10.2281 GFlops
algorithm effective memory bandwidth: 0.0325742 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.0707914 seconds
algorithm times (in seconds) - 10 values (min=2.79739 10%=2.79834 median=2.80095 90%=2.87392 max=2.87392)
algorithm GFlops: 6.13059 GFlops
algorithm effective memory bandwidth: 0.0195246 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05

@GPUcourseBOT
Copy link
Collaborator

Результаты тестирования PR #1047

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_matrix_transpose, main_matrix_multiply ===
=== main_matrix_transpose stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 8.72174 sec (CUDA: 0.11048 sec, OpenCL: 0.8089 sec, Vulkan: 7.80229 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
Matrix size: rows=H=8192 x cols=W=16384 (512 MB)
______________________________________________________
Evaluating algorithm #1/2: 01 naive transpose (non-coalesced)
Kernels compilation done in 3.46203 seconds
algorithm times (in seconds) - 10 values (min=0.028699 10%=0.0297832 median=0.0298146 90%=3.4951 max=3.4951)
median effective algorithm bandwidth: 33.5407 GB/s
______________________________________________________
Evaluating algorithm #2/2: 02 transpose via local memory (coalesced)
Kernels compilation done in 0.0991758 seconds
algorithm times (in seconds) - 10 values (min=0.00847714 10%=0.00847844 median=0.00849441 90%=0.107754 max=0.107754)
median effective algorithm bandwidth: 117.724 GB/s
=== main_matrix_multiply stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.320729 sec (CUDA: 0.125026 sec, OpenCL: 0.0377321 sec, Vulkan: 0.157902 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
C = A x B, matrices size: C (rows=H=2048 x cols=W=4096) = A (rows=H=2048 x cols=K=1024) x B (rows=K=1024 x cols=W=4096)
matrices data size: A - 8 MB, B - 16 MB, C - 16 MB
______________________________________________________
Evaluating algorithm #1/3: CPU with OpenMP
algorithm times (in seconds) - 1 values (min=11.9457 10%=11.9457 median=11.9457 90%=11.9457 max=11.9457)
algorithm GFlops: 1.43746 GFlops
algorithm effective memory bandwidth: 0.00457799 GB/s
______________________________________________________
Evaluating algorithm #2/3: 01 naive
Kernels compilation done in 0.107665 seconds
algorithm times (in seconds) - 10 values (min=0.061319 10%=0.0617409 median=0.0631913 90%=0.172276 max=0.172276)
algorithm GFlops: 271.738 GFlops
algorithm effective memory bandwidth: 0.865428 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05
______________________________________________________
Evaluating algorithm #3/3: 02 using local memory
Kernels compilation done in 0.111219 seconds
algorithm times (in seconds) - 10 values (min=0.0288905 10%=0.0315843 median=0.031748 90%=0.136915 max=0.136915)
algorithm GFlops: 540.869 GFlops
algorithm effective memory bandwidth: 1.72255 GB/s
relative differences with CPU: 8388608 values (min=0 10%=0 median=2.21073e-07 90%=1.12363e-06 max=2.77294)
median relative difference with CPU: 2.21073e-07
99% percentile relative difference with CPU: 1.09303e-05

Посмотреть полные логи

@c5xheavy c5xheavy changed the title Task03 done Task03 Амир Батыров ITMO Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants