Skip to content

Commit 5d45a7a

Browse files
author
pevnak
committed
finished with kernel abstractions
1 parent d2ce4a2 commit 5d45a7a

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

docs/src/lectures/lecture_11/lecture.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -791,6 +791,9 @@ sum(x)
791791
end
792792
end
793793
```
794+
795+
:::
796+
794797
The performance improvement is negligible, but that's because we have a relatively new GPU with lots of global memory bandwith. On older or lower-end GPUs, using shared memory would be valuable. But at least, we are not modifying the original array.
795798

796799
If we inspect the above kernel in profiler, we can read that it uses 32 registers per thread. But if the SM has 16384 registers, then block of size 1024 will have to share registers, which might lead to poor utilization. Changing the blocksize to 512 improves the throughput a bit as can be seen from below

0 commit comments

Comments
 (0)