Skip to content

Commit d2ce4a2

Browse files
author
pevnak
committed
finished with kernel abstractions
1 parent c2ab77e commit d2ce4a2

File tree

1 file changed

+9
-7
lines changed

1 file changed

+9
-7
lines changed

docs/src/lectures/lecture_11/lecture.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -498,6 +498,8 @@ For the sake of completness, we benchmark the speed of the kernel for comparison
498498
@benchmark Metal.@sync reduce_singlethread(backend, 64)(+, cx, cb, ndrange=(1,))
499499
```
500500

501+
:::
502+
501503
We can use **atomic** operations to mark that the reduction operation has to be performed exclusively. This have the advantage that we can do some operation while fetching the data, but it is still a very bad idea.
502504

503505
::: tabs
@@ -893,13 +895,13 @@ Let's now compare different versions and tabulate the results
893895

894896
| kernel version | min time |
895897
|:-----------------------------------------------------|:-----------:|
896-
| single thread | 56.399 ms |
897-
| multiple threads with atomic reduction | 1.772 ms |
898-
| parallel reduction | 33.381 μs |
899-
| parallel reduction with local mem | 34.261 μs |
900-
| parallel reduction with warps | 26.890 μs |
901-
| default sum on GPU | 31.960 μs |
902-
| default sum on CPU | 82.391 μs |
898+
| single thread | 71.780 ms |
899+
| multiple threads with atomic reduction | 2.197 ms |
900+
| parallel reduction | 29.300 μs |
901+
| parallel reduction with local mem | 26.764 μs |
902+
| parallel reduction with warps | 25.063 μs |
903+
| default sum on GPU | 47.090 μs |
904+
| default sum on CPU | 165.697 μs |
903905

904906

905907
What we have missed to optimize:

0 commit comments

Comments
 (0)