Open
Conversation
Collaborator
|
✅ Результаты тестирования PR #1046 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum === === main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 8.49291 sec (CUDA: 0.109427 sec, OpenCL: 0.710506 sec, Vulkan: 7.67292 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... ______________________________________________________ Evaluating algorithm #1/3: CPU algorithm times (in seconds) - 1 values (min=3.32708 10%=3.32708 median=3.32708 90%=3.32708 max=3.32708) Mandelbrot effective algorithm GFlops: 3.00564 GFlops saving image to 'mandelbrot CPU.bmp'... CPU vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #2/3: CPU with OpenMP OpenMP threads: x4 threads algorithm times (in seconds) - 10 values (min=1.03053 10%=1.03241 median=1.03983 90%=1.04567 max=1.04567) Mandelbrot effective algorithm GFlops: 9.61692 GFlops saving image to 'mandelbrot CPU with OpenMP.bmp'... CPU with OpenMP vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #3/3: GPU Kernels compilation done in 2.64049 seconds algorithm times (in seconds) - 10 values (min=0.00427276 10%=0.00427696 median=0.00428057 90%=2.64484 max=2.64484) Mandelbrot effective algorithm GFlops: 2336.14 GFlops saving image to 'mandelbrot GPU.bmp'... GPU vs CPU average results difference: 0.942446% === main_sum stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.335302 sec (CUDA: 0.124155 sec, OpenCL: 0.0378121 sec, Vulkan: 0.173276 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... PCI-E median bandwidth, gb/s9.10427 ______________________________________________________ Evaluating algorithm #1/6: CPU algorithm times (in seconds) - 10 values (min=0.0337624 10%=0.0339747 median=0.0347294 90%=0.0349411 max=0.0349411) sum median effective algorithm bandwidth: 10.7266 GB/s ______________________________________________________ Evaluating algorithm #2/6: CPU with OpenMP algorithm times (in seconds) - 10 values (min=0.0159719 10%=0.0160234 median=0.0161723 90%=0.0166088 max=0.0166088) sum median effective algorithm bandwidth: 23.035 GB/s ______________________________________________________ Evaluating algorithm #3/6: 01 atomicAdd from each workItem Kernels compilation done in 0.0958129 seconds algorithm times (in seconds) - 10 values (min=0.00275287 10%=0.00275343 median=0.00275491 90%=0.0986844 max=0.0986844) sum median effective algorithm bandwidth: 135.223 GB/s ______________________________________________________ Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values Kernels compilation done in 0.0561924 seconds algorithm times (in seconds) - 10 values (min=0.00146443 10%=0.00146472 median=0.00146585 90%=0.0577716 max=0.0577716) sum median effective algorithm bandwidth: 254.138 GB/s ______________________________________________________ Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread Kernels compilation done in 0.325482 seconds algorithm times (in seconds) - 10 values (min=0.00880772 10%=0.00919009 median=0.00919206 90%=0.334179 max=0.334179) sum median effective algorithm bandwidth: 40.5273 GB/s ______________________________________________________ Evaluating algorithm #6/6: 04 local reduction Kernels compilation done in 0.0863545 seconds algorithm times (in seconds) - 10 values (min=0.008671 10%=0.00867141 median=0.00867457 90%=0.0951242 max=0.0951242) sum median effective algorithm bandwidth: 42.9449 GB/s |
Collaborator
|
✅ Результаты тестирования PR #1046 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum === === main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.307631 sec (CUDA: 0.117623 sec, OpenCL: 0.0376242 sec, Vulkan: 0.152323 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... ______________________________________________________ Evaluating algorithm #1/3: CPU algorithm times (in seconds) - 1 values (min=3.3262 10%=3.3262 median=3.3262 90%=3.3262 max=3.3262) Mandelbrot effective algorithm GFlops: 3.00643 GFlops saving image to 'mandelbrot CPU.bmp'... CPU vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #2/3: CPU with OpenMP OpenMP threads: x4 threads algorithm times (in seconds) - 10 values (min=1.03202 10%=1.03632 median=1.03806 90%=1.0433 max=1.0433) Mandelbrot effective algorithm GFlops: 9.63333 GFlops saving image to 'mandelbrot CPU with OpenMP.bmp'... CPU with OpenMP vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #3/3: GPU Kernels compilation done in 0.0598296 seconds algorithm times (in seconds) - 10 values (min=0.00427458 10%=0.00427503 median=0.00428001 90%=0.0641682 max=0.0641682) Mandelbrot effective algorithm GFlops: 2336.44 GFlops saving image to 'mandelbrot GPU.bmp'... GPU vs CPU average results difference: 0.942446% === main_sum stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.303042 sec (CUDA: 0.125126 sec, OpenCL: 0.0383245 sec, Vulkan: 0.139534 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... PCI-E median bandwidth, gb/s8.08702 ______________________________________________________ Evaluating algorithm #1/6: CPU algorithm times (in seconds) - 10 values (min=0.0366085 10%=0.0366685 median=0.0368502 90%=0.038256 max=0.038256) sum median effective algorithm bandwidth: 10.1093 GB/s ______________________________________________________ Evaluating algorithm #2/6: CPU with OpenMP algorithm times (in seconds) - 10 values (min=0.0159507 10%=0.0160822 median=0.0164099 90%=0.0169195 max=0.0169195) sum median effective algorithm bandwidth: 22.7014 GB/s ______________________________________________________ Evaluating algorithm #3/6: 01 atomicAdd from each workItem Kernels compilation done in 0.0494233 seconds algorithm times (in seconds) - 10 values (min=0.00275441 10%=0.00275442 median=0.00275591 90%=0.0522888 max=0.0522888) sum median effective algorithm bandwidth: 135.175 GB/s ______________________________________________________ Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values Kernels compilation done in 0.0420174 seconds algorithm times (in seconds) - 10 values (min=0.00146484 10%=0.00146498 median=0.00146661 90%=0.0435887 max=0.0435887) sum median effective algorithm bandwidth: 254.006 GB/s ______________________________________________________ Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread Kernels compilation done in 0.0811999 seconds algorithm times (in seconds) - 10 values (min=0.00789586 10%=0.00789755 median=0.0082293 90%=0.0892047 max=0.0892047) sum median effective algorithm bandwidth: 45.2686 GB/s ______________________________________________________ Evaluating algorithm #6/6: 04 local reduction Kernels compilation done in 0.0493637 seconds algorithm times (in seconds) - 10 values (min=0.00807723 10%=0.00807878 median=0.00808285 90%=0.0575303 max=0.0575303) sum median effective algorithm bandwidth: 46.0888 GB/s |
Collaborator
|
✅ Результаты тестирования PR #1046 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum === === main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.313181 sec (CUDA: 0.120893 sec, OpenCL: 0.0372148 sec, Vulkan: 0.155016 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... ______________________________________________________ Evaluating algorithm #1/3: CPU algorithm times (in seconds) - 1 values (min=3.33717 10%=3.33717 median=3.33717 90%=3.33717 max=3.33717) Mandelbrot effective algorithm GFlops: 2.99655 GFlops saving image to 'mandelbrot CPU.bmp'... CPU vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #2/3: CPU with OpenMP OpenMP threads: x4 threads algorithm times (in seconds) - 10 values (min=1.03331 10%=1.03481 median=1.03838 90%=1.05374 max=1.05374) Mandelbrot effective algorithm GFlops: 9.63036 GFlops saving image to 'mandelbrot CPU with OpenMP.bmp'... CPU with OpenMP vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #3/3: GPU Kernels compilation done in 0.058324 seconds algorithm times (in seconds) - 10 values (min=0.00427603 10%=0.00427617 median=0.00427915 90%=0.062661 max=0.062661) Mandelbrot effective algorithm GFlops: 2336.91 GFlops saving image to 'mandelbrot GPU.bmp'... GPU vs CPU average results difference: 0.942446% === main_sum stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.317838 sec (CUDA: 0.124509 sec, OpenCL: 0.0381234 sec, Vulkan: 0.155145 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... PCI-E median bandwidth, gb/s8.1416 ______________________________________________________ Evaluating algorithm #1/6: CPU algorithm times (in seconds) - 10 values (min=0.0369011 10%=0.036915 median=0.0370287 90%=0.0381435 max=0.0381435) sum median effective algorithm bandwidth: 10.0606 GB/s ______________________________________________________ Evaluating algorithm #2/6: CPU with OpenMP algorithm times (in seconds) - 10 values (min=0.01611 10%=0.0165244 median=0.0166567 90%=0.0173817 max=0.0173817) sum median effective algorithm bandwidth: 22.3651 GB/s ______________________________________________________ Evaluating algorithm #3/6: 01 atomicAdd from each workItem Kernels compilation done in 0.051149 seconds algorithm times (in seconds) - 10 values (min=0.00275406 10%=0.00275426 median=0.00275525 90%=0.0540118 max=0.0540118) sum median effective algorithm bandwidth: 135.207 GB/s ______________________________________________________ Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values Kernels compilation done in 0.04374 seconds algorithm times (in seconds) - 10 values (min=0.00146493 10%=0.0014651 median=0.00146799 90%=0.04532 max=0.04532) sum median effective algorithm bandwidth: 253.769 GB/s ______________________________________________________ Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread Kernels compilation done in 0.0802507 seconds algorithm times (in seconds) - 10 values (min=0.00773843 10%=0.00773869 median=0.0077421 90%=0.0880916 max=0.0880916) sum median effective algorithm bandwidth: 48.1173 GB/s ______________________________________________________ Evaluating algorithm #6/6: 04 local reduction Kernels compilation done in 0.0596682 seconds algorithm times (in seconds) - 10 values (min=0.0075666 10%=0.00756916 median=0.00790055 90%=0.0673285 max=0.0673285) sum median effective algorithm bandwidth: 47.1523 GB/s |
Collaborator
|
✅ Результаты тестирования PR #1046 Логи тестирования (нажмите чтобы развернуть)=== СТАТУС: Успешно выполнены программы: main_mandelbrot, main_sum === === main_mandelbrot stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.31047 sec (CUDA: 0.120406 sec, OpenCL: 0.0375735 sec, Vulkan: 0.152432 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... ______________________________________________________ Evaluating algorithm #1/3: CPU algorithm times (in seconds) - 1 values (min=3.32178 10%=3.32178 median=3.32178 90%=3.32178 max=3.32178) Mandelbrot effective algorithm GFlops: 3.01044 GFlops saving image to 'mandelbrot CPU.bmp'... CPU vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #2/3: CPU with OpenMP OpenMP threads: x4 threads algorithm times (in seconds) - 10 values (min=1.02463 10%=1.02675 median=1.03189 90%=1.03494 max=1.03494) Mandelbrot effective algorithm GFlops: 9.69094 GFlops saving image to 'mandelbrot CPU with OpenMP.bmp'... CPU with OpenMP vs CPU average results difference: 0% ______________________________________________________ Evaluating algorithm #3/3: GPU Kernels compilation done in 0.0589184 seconds algorithm times (in seconds) - 10 values (min=0.00427635 10%=0.00427861 median=0.00428206 90%=0.0632552 max=0.0632552) Mandelbrot effective algorithm GFlops: 2335.33 GFlops saving image to 'mandelbrot GPU.bmp'... GPU vs CPU average results difference: 0.942446% === main_sum stdout (exit code: -11 (segfault после выполнения)) === Found 1 GPUs in 0.289337 sec (CUDA: 0.124282 sec, OpenCL: 0.0400632 sec, Vulkan: 0.124931 sec) Available devices: Device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using device #0: API: CUDA+OpenCL+Vulkan. GPU. Tesla T4 (CUDA 12020). Free memory: 14822/14930 Mb. Using OpenCL API... PCI-E median bandwidth, gb/s8.39736 ______________________________________________________ Evaluating algorithm #1/6: CPU algorithm times (in seconds) - 10 values (min=0.0362301 10%=0.0365077 median=0.0370137 90%=0.0376394 max=0.0376394) sum median effective algorithm bandwidth: 10.0646 GB/s ______________________________________________________ Evaluating algorithm #2/6: CPU with OpenMP algorithm times (in seconds) - 10 values (min=0.0157405 10%=0.0161578 median=0.0166301 90%=0.0170072 max=0.0170072) sum median effective algorithm bandwidth: 22.4009 GB/s ______________________________________________________ Evaluating algorithm #3/6: 01 atomicAdd from each workItem Kernels compilation done in 0.0540067 seconds algorithm times (in seconds) - 10 values (min=0.00275269 10%=0.00275309 median=0.00275517 90%=0.0568703 max=0.0568703) sum median effective algorithm bandwidth: 135.211 GB/s ______________________________________________________ Evaluating algorithm #4/6: 02 atomicAdd but each workItem loads K values Kernels compilation done in 0.0452495 seconds algorithm times (in seconds) - 10 values (min=0.00146507 10%=0.00146518 median=0.00146651 90%=0.0468228 max=0.0468228) sum median effective algorithm bandwidth: 254.024 GB/s ______________________________________________________ Evaluating algorithm #5/6: 03 local memory and atomicAdd from master thread Kernels compilation done in 0.0814909 seconds algorithm times (in seconds) - 10 values (min=0.00789574 10%=0.00789604 median=0.00789708 90%=0.089494 max=0.089494) sum median effective algorithm bandwidth: 47.173 GB/s ______________________________________________________ Evaluating algorithm #6/6: 04 local reduction Kernels compilation done in 0.0469491 seconds algorithm times (in seconds) - 10 values (min=0.00740907 10%=0.00741016 median=0.0074135 90%=0.0544471 max=0.0544471) sum median effective algorithm bandwidth: 50.2501 GB/s |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Локальный вывод
*P.S. Вроде как на MacBook атомики соптимизированы, так что AtomicAdd работает даже лучше, чем алгоритм 4
Вывод Github CI