CUDA implementation of Canny edge detector in C/C++.
You can use cmake to compile the files. I have made a CMakeLists available for compilation.
I have made available a main file that executes the code.
In particular, these are the parameters to be given on the command line:
./main argv[1] argv[2] argv[3] argv[4] argv[5] argv[6]
where :
argv[1]: input image pathargv[2]: kernel size of Sobelargv[3]: low threshold for Hysteresis stepargv[4]: high threshold for Hysteresis stepargv[5]: L2 norm -> 0 activated 1 deactivated (uses approximation with abs)argv[6]: modes -> [0] CPU , [1] GPU custom (my implementation) , [2] Runs all modes. With [0] run OpenCV Canny CPU while with [1] run Opencv GPU. At last, with [2] run both.
During the execution of the algorithm, the execution times are also calculated, expressed in ms.
Examples of image output of my Canny GPU version.
| Original | Canny GPU Output |
|---|---|
![]() |
![]() |
| Original | Canny GPU Output |
|---|---|
![]() |
![]() |
| Original | Canny GPU Output |
|---|---|
![]() |
![]() |
N.B: obviously, the results may vary according to the value chosen for the thresholds in the hysteresis step.
I tried several kernel configurations but the one that gave the best results was the one where I used a thread block size of 16x16.
| Kernel Configuration |
|---|
![]() |
This is the pie chart showing the execution times of the various kernel device function and data transfer memcpy routines on 720p image resolution.
| Kernel time esec |
|---|
![]() |
This is the comparison analysis between the OpenCV CPU version and my parallel version on GPU.
| CPU v.s. GPU |
|---|
![]() |
As you can see from the graph, with a low resolution image the results of the two versions are similar. As the image resolution increases, the parallel version gets significantly better results.








