Skip to content

Task02 Максим Синицын HSE#1046

Open
c0ldzy17 wants to merge 1 commit intoGPGPUCourse:task02from
c0ldzy17:task02
Open

Task02 Максим Синицын HSE#1046
c0ldzy17 wants to merge 1 commit intoGPGPUCourse:task02from
c0ldzy17:task02

Conversation

@c0ldzy17
Copy link

@c0ldzy17 c0ldzy17 commented Feb 25, 2026

Локальный вывод

$ ./main_mandelbrot
Found 1 GPUs in 0.0867177 sec (OpenCL: 0.0719731 sec, Vulkan: 0.014714 sec)
Available devices:
  Device #0: API: OpenCL+Vulkan. GPU. Apple M3 Pro. Free memory: 27648/27648 Mb.
Using device #0: API: OpenCL+Vulkan. GPU. Apple M3 Pro. Free memory: 27648/27648 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=3.02797 10%=3.02797 median=3.02797 90%=3.02797 max=3.02797)
Mandelbrot effective algorithm GFlops: 3.30254 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x1 threads
algorithm times (in seconds) - 10 values (min=2.88476 10%=2.9021 median=2.99328 90%=3.01572 max=3.01572)
Mandelbrot effective algorithm GFlops: 3.34081 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 0.046684 seconds
algorithm times (in seconds) - 10 values (min=0.00277083 10%=0.00283254 median=0.00326779 90%=0.104893 max=0.104893)
Mandelbrot effective algorithm GFlops: 3060.17 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0%

$ ./main_sum
Found 1 GPUs in 0.0781408 sec (OpenCL: 0.0697241 sec, Vulkan: 0.00838258 sec)
Available devices:
  Device #0: API: OpenCL+Vulkan. GPU. Apple M3 Pro. Free memory: 27648/27648 Mb.
Using device #0: API: OpenCL+Vulkan. GPU. Apple M3 Pro. Free memory: 27648/27648 Mb.
Using OpenCL API...

PCI-E median bandwidth, gb/s26.7932

______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.074479 10%=0.0744963 median=0.0753583 90%=0.079414 max=0.079414)
sum median effective algorithm bandwidth: 4.94344 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.0750597 10%=0.0751141 median=0.0753517 90%=0.0761223 max=0.0761223)
sum median effective algorithm bandwidth: 4.94387 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.00644254 seconds
algorithm times (in seconds) - 10 values (min=0.00513563 10%=0.00516717 median=0.00523929 90%=0.0340748 max=0.0340748)
sum median effective algorithm bandwidth: 71.1029 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.00472746 seconds
algorithm times (in seconds) - 10 values (min=0.003268 10%=0.00328471 median=0.00332446 90%=0.0192198 max=0.0192198)
sum median effective algorithm bandwidth: 112.057 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.00745 seconds
algorithm times (in seconds) - 10 values (min=0.00969388 10%=0.00970875 median=0.00979133 90%=0.0354206 max=0.0354206)
sum median effective algorithm bandwidth: 38.0468 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.00688663 seconds
algorithm times (in seconds) - 10 values (min=0.0100121 10%=0.010054 median=0.0101882 90%=0.0344988 max=0.0344988)
sum median effective algorithm bandwidth: 36.5647 GB/s

*P.S. Вроде как на MacBook атомики соптимизированы, так что AtomicAdd работает даже лучше, чем алгоритм 4

Вывод Github CI

10s
Run ./main_mandelbrot 0
Found 2 GPUs in 0.0542028 sec (CUDA: 8.1131e-05 sec, OpenCL: 0.024507 sec, Vulkan: 0.0295653 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=2.00634 10%=2.00634 median=2.00634 90%=2.00634 max=2.00634)
Mandelbrot effective algorithm GFlops: 4.98421 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x4 threads
algorithm times (in seconds) - 10 values (min=0.602773 10%=0.603107 median=0.607703 90%=0.610405 max=0.610405)
Mandelbrot effective algorithm GFlops: 16.4554 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 0.143955 seconds
algorithm times (in seconds) - 10 values (min=0.151749 10%=0.151773 median=0.151998 90%=0.298492 max=0.298492)
Mandelbrot effective algorithm GFlops: 65.7902 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0.942446%
30s
Run ./main_sum 0
Found 2 GPUs in 0.0540634 sec (CUDA: 0.00011851 sec, OpenCL: 0.0249612 sec, Vulkan: 0.0289137 sec)
Available devices:
  Device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
  Device #1: API: Vulkan. CPU. llvmpipe (LLVM 20.1.2, 256 bits). Free memory: 15990/15990 Mb.
Using device #0: API: OpenCL. CPU. AMD EPYC 7763 64-Core Processor                . Intel(R) Corporation. Total memory: 15990 Mb.
Using OpenCL API...

PCI-E median bandwidth, gb/s16.5902

______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.0328222 10%=0.0328458 median=0.0328911 90%=0.0367345 max=0.0367345)
sum median effective algorithm bandwidth: 11.3261 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.0208002 10%=0.0208316 median=0.0209392 90%=0.0214348 max=0.0214348)
sum median effective algorithm bandwidth: 17.791 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.116622 seconds
algorithm times (in seconds) - 10 values (min=1.46003 10%=1.46178 median=1.46396 90%=1.61607 max=1.61607)
sum median effective algorithm bandwidth: 0.254466 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.0315337 seconds
algorithm times (in seconds) - 10 values (min=0.734464 10%=0.734623 median=0.735698 90%=0.762192 max=0.762192)
sum median effective algorithm bandwidth: 0.506362 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.0508751 seconds
algorithm times (in seconds) - 10 values (min=0.0573873 10%=0.0574206 median=0.0576103 90%=0.109071 max=0.109071)
sum median effective algorithm bandwidth: 6.46636 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.0619403 seconds
algorithm times (in seconds) - 10 values (min=0.478458 10%=0.478631 median=0.479299 90%=0.543107 max=0.543107)
sum median effective algorithm bandwidth: 0.777237 GB/s

@GPUcourseBOT
Copy link
Collaborator

Результаты тестирования PR #1046

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum ===
=== main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 8.49291 sec (CUDA: 0.109427 sec, OpenCL: 0.710506 sec, Vulkan: 7.67292 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=3.32708 10%=3.32708 median=3.32708 90%=3.32708 max=3.32708)
Mandelbrot effective algorithm GFlops: 3.00564 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x4 threads
algorithm times (in seconds) - 10 values (min=1.03053 10%=1.03241 median=1.03983 90%=1.04567 max=1.04567)
Mandelbrot effective algorithm GFlops: 9.61692 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 2.64049 seconds
algorithm times (in seconds) - 10 values (min=0.00427276 10%=0.00427696 median=0.00428057 90%=2.64484 max=2.64484)
Mandelbrot effective algorithm GFlops: 2336.14 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0.942446%
=== main_sum stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.335302 sec (CUDA: 0.124155 sec, OpenCL: 0.0378121 sec, Vulkan: 0.173276 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
PCI-E median bandwidth, gb/s9.10427
______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.0337624 10%=0.0339747 median=0.0347294 90%=0.0349411 max=0.0349411)
sum median effective algorithm bandwidth: 10.7266 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.0159719 10%=0.0160234 median=0.0161723 90%=0.0166088 max=0.0166088)
sum median effective algorithm bandwidth: 23.035 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.0958129 seconds
algorithm times (in seconds) - 10 values (min=0.00275287 10%=0.00275343 median=0.00275491 90%=0.0986844 max=0.0986844)
sum median effective algorithm bandwidth: 135.223 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.0561924 seconds
algorithm times (in seconds) - 10 values (min=0.00146443 10%=0.00146472 median=0.00146585 90%=0.0577716 max=0.0577716)
sum median effective algorithm bandwidth: 254.138 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.325482 seconds
algorithm times (in seconds) - 10 values (min=0.00880772 10%=0.00919009 median=0.00919206 90%=0.334179 max=0.334179)
sum median effective algorithm bandwidth: 40.5273 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.0863545 seconds
algorithm times (in seconds) - 10 values (min=0.008671 10%=0.00867141 median=0.00867457 90%=0.0951242 max=0.0951242)
sum median effective algorithm bandwidth: 42.9449 GB/s

Посмотреть полные логи

@GPUcourseBOT
Copy link
Collaborator

Результаты тестирования PR #1046

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum ===
=== main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.307631 sec (CUDA: 0.117623 sec, OpenCL: 0.0376242 sec, Vulkan: 0.152323 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=3.3262 10%=3.3262 median=3.3262 90%=3.3262 max=3.3262)
Mandelbrot effective algorithm GFlops: 3.00643 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x4 threads
algorithm times (in seconds) - 10 values (min=1.03202 10%=1.03632 median=1.03806 90%=1.0433 max=1.0433)
Mandelbrot effective algorithm GFlops: 9.63333 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 0.0598296 seconds
algorithm times (in seconds) - 10 values (min=0.00427458 10%=0.00427503 median=0.00428001 90%=0.0641682 max=0.0641682)
Mandelbrot effective algorithm GFlops: 2336.44 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0.942446%
=== main_sum stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.303042 sec (CUDA: 0.125126 sec, OpenCL: 0.0383245 sec, Vulkan: 0.139534 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
PCI-E median bandwidth, gb/s8.08702
______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.0366085 10%=0.0366685 median=0.0368502 90%=0.038256 max=0.038256)
sum median effective algorithm bandwidth: 10.1093 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.0159507 10%=0.0160822 median=0.0164099 90%=0.0169195 max=0.0169195)
sum median effective algorithm bandwidth: 22.7014 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.0494233 seconds
algorithm times (in seconds) - 10 values (min=0.00275441 10%=0.00275442 median=0.00275591 90%=0.0522888 max=0.0522888)
sum median effective algorithm bandwidth: 135.175 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.0420174 seconds
algorithm times (in seconds) - 10 values (min=0.00146484 10%=0.00146498 median=0.00146661 90%=0.0435887 max=0.0435887)
sum median effective algorithm bandwidth: 254.006 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.0811999 seconds
algorithm times (in seconds) - 10 values (min=0.00789586 10%=0.00789755 median=0.0082293 90%=0.0892047 max=0.0892047)
sum median effective algorithm bandwidth: 45.2686 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.0493637 seconds
algorithm times (in seconds) - 10 values (min=0.00807723 10%=0.00807878 median=0.00808285 90%=0.0575303 max=0.0575303)
sum median effective algorithm bandwidth: 46.0888 GB/s

Посмотреть полные логи

@GPUcourseBOT
Copy link
Collaborator

Результаты тестирования PR #1046

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum ===
=== main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.313181 sec (CUDA: 0.120893 sec, OpenCL: 0.0372148 sec, Vulkan: 0.155016 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=3.33717 10%=3.33717 median=3.33717 90%=3.33717 max=3.33717)
Mandelbrot effective algorithm GFlops: 2.99655 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x4 threads
algorithm times (in seconds) - 10 values (min=1.03331 10%=1.03481 median=1.03838 90%=1.05374 max=1.05374)
Mandelbrot effective algorithm GFlops: 9.63036 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 0.058324 seconds
algorithm times (in seconds) - 10 values (min=0.00427603 10%=0.00427617 median=0.00427915 90%=0.062661 max=0.062661)
Mandelbrot effective algorithm GFlops: 2336.91 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0.942446%
=== main_sum stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.317838 sec (CUDA: 0.124509 sec, OpenCL: 0.0381234 sec, Vulkan: 0.155145 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
PCI-E median bandwidth, gb/s8.1416
______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.0369011 10%=0.036915 median=0.0370287 90%=0.0381435 max=0.0381435)
sum median effective algorithm bandwidth: 10.0606 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.01611 10%=0.0165244 median=0.0166567 90%=0.0173817 max=0.0173817)
sum median effective algorithm bandwidth: 22.3651 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.051149 seconds
algorithm times (in seconds) - 10 values (min=0.00275406 10%=0.00275426 median=0.00275525 90%=0.0540118 max=0.0540118)
sum median effective algorithm bandwidth: 135.207 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.04374 seconds
algorithm times (in seconds) - 10 values (min=0.00146493 10%=0.0014651 median=0.00146799 90%=0.04532 max=0.04532)
sum median effective algorithm bandwidth: 253.769 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.0802507 seconds
algorithm times (in seconds) - 10 values (min=0.00773843 10%=0.00773869 median=0.0077421 90%=0.0880916 max=0.0880916)
sum median effective algorithm bandwidth: 48.1173 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.0596682 seconds
algorithm times (in seconds) - 10 values (min=0.0075666 10%=0.00756916 median=0.00790055 90%=0.0673285 max=0.0673285)
sum median effective algorithm bandwidth: 47.1523 GB/s

Посмотреть полные логи

@GPUcourseBOT
Copy link
Collaborator

Результаты тестирования PR #1046

Логи тестирования (нажмите чтобы развернуть)
=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum ===
=== main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.31047 sec (CUDA: 0.120406 sec, OpenCL: 0.0375735 sec, Vulkan: 0.152432 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
______________________________________________________
Evaluating algorithm #1/3: CPU
algorithm times (in seconds) - 1 values (min=3.32178 10%=3.32178 median=3.32178 90%=3.32178 max=3.32178)
Mandelbrot effective algorithm GFlops: 3.01044 GFlops
saving image to 'mandelbrot CPU.bmp'...
CPU vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #2/3: CPU with OpenMP
OpenMP threads: x4 threads
algorithm times (in seconds) - 10 values (min=1.02463 10%=1.02675 median=1.03189 90%=1.03494 max=1.03494)
Mandelbrot effective algorithm GFlops: 9.69094 GFlops
saving image to 'mandelbrot CPU with OpenMP.bmp'...
CPU with OpenMP vs CPU average results difference: 0%
______________________________________________________
Evaluating algorithm #3/3: GPU
Kernels compilation done in 0.0589184 seconds
algorithm times (in seconds) - 10 values (min=0.00427635 10%=0.00427861 median=0.00428206 90%=0.0632552 max=0.0632552)
Mandelbrot effective algorithm GFlops: 2335.33 GFlops
saving image to 'mandelbrot GPU.bmp'...
GPU vs CPU average results difference: 0.942446%
=== main_sum stdout (exit code: -11 (segfault после выполнения)) ===
Found 1 GPUs in 0.289337 sec (CUDA: 0.124282 sec, OpenCL: 0.0400632 sec, Vulkan: 0.124931 sec)
Available devices:
Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb.
Using OpenCL API...
PCI-E median bandwidth, gb/s8.39736
______________________________________________________
Evaluating algorithm #1/6: CPU
algorithm times (in seconds) - 10 values (min=0.0362301 10%=0.0365077 median=0.0370137 90%=0.0376394 max=0.0376394)
sum median effective algorithm bandwidth: 10.0646 GB/s
______________________________________________________
Evaluating algorithm #2/6: CPU with OpenMP
algorithm times (in seconds) - 10 values (min=0.0157405 10%=0.0161578 median=0.0166301 90%=0.0170072 max=0.0170072)
sum median effective algorithm bandwidth: 22.4009 GB/s
______________________________________________________
Evaluating algorithm #3/6: 01 atomicAdd from each workItem
Kernels compilation done in 0.0540067 seconds
algorithm times (in seconds) - 10 values (min=0.00275269 10%=0.00275309 median=0.00275517 90%=0.0568703 max=0.0568703)
sum median effective algorithm bandwidth: 135.211 GB/s
______________________________________________________
Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values
Kernels compilation done in 0.0452495 seconds
algorithm times (in seconds) - 10 values (min=0.00146507 10%=0.00146518 median=0.00146651 90%=0.0468228 max=0.0468228)
sum median effective algorithm bandwidth: 254.024 GB/s
______________________________________________________
Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread
Kernels compilation done in 0.0814909 seconds
algorithm times (in seconds) - 10 values (min=0.00789574 10%=0.00789604 median=0.00789708 90%=0.089494 max=0.089494)
sum median effective algorithm bandwidth: 47.173 GB/s
______________________________________________________
Evaluating algorithm #6/6: 04 local reduction
Kernels compilation done in 0.0469491 seconds
algorithm times (in seconds) - 10 values (min=0.00740907 10%=0.00741016 median=0.0074135 90%=0.0544471 max=0.0544471)
sum median effective algorithm bandwidth: 50.2501 GB/s

Посмотреть полные логи

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants