@@ -22,16 +22,23 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
2222 - Use ** C++17** . Prefer ` constexpr ` , ` auto ` and ` enum class ` .
2323 - Use range-based for loops on the host side.
2424 - Use smart pointers (` std::unique_ptr ` , ` std::shared_ptr ` ) on the host side.
25+ - Dynamic memory allocation on the device side is strongly discouraged.
2526 - Never pass smart pointers as parameters to __ global__ functions.
26- - Avoid dynamic memory allocation on the device side.
27- - ** Memory Management** : Use RAII. Avoid raw ` new ` /` delete ` except in CUDA contexts where unavoidable.
27+ - ** Memory Management** :
28+ - Use RAII on the host side.
29+ - Avoid all dynamic memory allocation on the device side.
30+ - Understand that reference-counting smart pointers cannot be kept consistent between
31+ host and device, and that kernels run asynchronously from host code.
2832- ** Error Handling** :
2933 - Use exceptions in host C++ code.
3034 - In CUDA, check and propagate error codes using helper utilities/macros. Never ignore errors.
3135- ** Namespaces** : Group related functions/classes logically. Avoid polluting the global namespace.
3236- ** Headers** :
3337 - Keep headers minimal; forward declare instead of including heavy dependencies.
34- - Each header should be guarded with ` #pragma once ` .
38+ However, small helper functions declared ` static inline __device__ ` use several times should be
39+ included instead of copying the code.
40+ - Each header should be guarded with ` #pragma once ` . ifndef/endif guards should be used in special
41+ circumstances only.
3542- ** Style** :
3643 - ` snake_case ` for variables and functions.
3744 - ` CamelCase ` for class and struct names.
@@ -41,14 +48,19 @@ It ensures that contributions (from GitHub Copilot, ChatGPT, Claude, etc.) follo
4148
4249## CUDA Guidelines
4350
44- - Separate ** kernels** from host orchestration code.
51+ - Separate ** kernels** (` __global__ ` functions) from host orchestration code, but keep
52+ them in the same module as the host core that starts them.
4553- Name kernels descriptively, e.g. ` compute_gradient_kernel ` .
4654- Document assumptions about:
4755 - Thread/block layout
4856 - Shared memory usage
4957 - Synchronization requirements
5058- Use ` __restrict__ ` and ` constexpr ` where appropriate for performance and clarity.
51- - Prefer small, focused kernels over overly complex ones.
59+ - Avoid writing kernels that use ` local memory ` , limit variables to registers and shared
60+ memory as much as possible. To achieve this, prefer focused kernels over complex ones.
61+ - To structure larger kernels, use ` __device__ ` functions that are declared
62+ ` static inline __device__ ` . Ensure that caller and device functions are compiled together.
63+ - Avoid dynamic parallelism.
5264- Always validate CUDA API calls.
5365
5466---
0 commit comments