Skip to content

Commit fee4b21

Browse files
committed
Always provide occupancy API methods with CUfunction for jitify2::KernelData
Jitify2 jitify2::KernelData::function returns a jitify2::CudaFunction, which is either a CUkernel or a CUfunction depending on if JITIFY_USE_CONTEXT_INDEPENDENT_LOADING is being used. If JITIFY_USE_CONTEXT_INDEPENDENT_LOADING is not defined in advance, for CUDA >= 12 a CUkernel will be returned, otherwise a CUfunction is returned. In our case, given a jitify2::KernelData for an agent function or condition, we want to pass the CUfunction to cuOccupancyMaxPotentialBlockSize to get the block size for the kernel launch. Jitify2 includes a method for this, but it configures the kernel for a grid-stride loop using the full device, where as we want to launch the minimum gridsize possibel that includes at least one thread per agent (not using a grid stride loop). In the ugprade from Jitify 1 to Jitify 2, we did not update out use of this to reflect the change in type. We missed this, as passing a CUkernel to cuOccupancyMaxPotentialBlockSize works as expected on recent CUDA drivers (i.e R575 and R580). However, on systems with older CUDA drivers (R550) such as Google Colab and TUoS Stanage, uncaught cuda errors within cuOccupancyMaxPotentialBlockSize were leading to a division by zero floating point exceptions. This is because cuOccupancyMaxPotentialBlockSize expects a CUfunction, but we were providing a CUkernel cast to a CUfunction. The documentation says that cuLibraryGetKernel should be used, to get a CUkernel from a CUfunction: > the API can also be used with context-less kernel CUkernel by querying the handle using cuLibraryGetKernel() and then passing it to the API by casting to CUfunction. Here, the context to use for calculations will be the current context. This commits adds an unnamed namespace function which given a jitify2::KernelData returns a CUfunction regardless of the value of JITIFY_USE_CONTEXT_INDEPENDENT_LOADING, which uses if constexpr in a c++20 templated lambda (to avoid both sides of a non-templated if constexpr needing to compile at the same time)
1 parent a941005 commit fee4b21

1 file changed

Lines changed: 29 additions & 2 deletions

File tree

src/flamegpu/simulation/CUDASimulation.cu

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,33 @@ namespace {
5757
return std::unique_ptr<detail::Timer>(new detail::SteadyClockTimer());
5858
}
5959
}
60+
61+
/**
62+
* Given a jitify2::KernelData, gets the CUfunction for the current CUDA context, regardless of if context-independent loading is being used or not
63+
*
64+
* Jitify 2 may (or may not) be using context-independent loading, leading to `jitify2::KernelData::function` returning a `CUkernel` or a `CUfunction`.
65+
* The Driver API method `cuOccupancyMaxPotentialBlockSize` expects a CUfunction, but is documented as "the API can also be used with context-less kernel CUkernel by querying the handle using cuLibraryGetKernel() and then passing it to the API by casting to CUfunction. Here, the context to use for calculations will be the current context."
66+
* For older drivers, such as R550, not calling `cuLibraryGetKernel` first results in a runtime error. Newer drivers, such as R580, appear to implicitly make this call and proceed normally.
67+
*
68+
* This method always returns the CUfunction for the jitify2::KernelData that is valid in the the current context.
69+
*
70+
* @param instance the jitify2::KernelData instance
71+
* @return the CUfunction for the jitify2::KernelData in the current CUDA context
72+
*/
73+
CUfunction cuFunctionFromJitify2KernelData(const jitify2::KernelData& instance) {
74+
// Use a templated lambda so that both sides of the if constexpr do not need to compile (but msut be syntacitally valid)
75+
auto handler = []<typename T>(T cu_kernel_or_func) -> CUfunction {
76+
if constexpr (std::is_same_v<T, CUkernel>) {
77+
CUfunction cu_func = NULL;
78+
gpuErrchkDriverAPI(cuKernelGetFunction(&cu_func, cu_kernel_or_func));
79+
return cu_func;
80+
} else {
81+
// If not a CUKernel, we assume this is must be a CUfunction
82+
return static_cast<CUfunction>(cu_kernel_or_func);
83+
}
84+
};
85+
return handler(instance.function());
86+
}
6087
} // anonymous namespace
6188

6289
CUDASimulation::CUDASimulation(const ModelDescription& _model, int argc, const char** argv, bool _isSWIG)
@@ -755,7 +782,7 @@ void CUDASimulation::stepLayer(const std::shared_ptr<LayerData>& layer, const un
755782
// get instantiation
756783
const jitify2::KernelData& instance = cuda_agent.getRTCInstantiation(func_condition_identifier);
757784
// calculate the grid block size for main agent function
758-
CUfunction cu_func = (CUfunction)instance.function();
785+
CUfunction cu_func = cuFunctionFromJitify2KernelData(instance);
759786
gpuErrchkDriverAPI(cuOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, cu_func, 0, 0, state_list_size));
760787
//! Round up according to CUDAAgent state list size
761788
gridSize = (state_list_size + blockSize - 1) / blockSize;
@@ -983,7 +1010,7 @@ void CUDASimulation::stepLayer(const std::shared_ptr<LayerData>& layer, const un
9831010
// get instantiation
9841011
const jitify2::KernelData& instance = cuda_agent.getRTCInstantiation(func_name);
9851012
// calculate the grid block size for main agent function
986-
CUfunction cu_func = (CUfunction)instance.function();
1013+
CUfunction cu_func = cuFunctionFromJitify2KernelData(instance);
9871014
gpuErrchkDriverAPI(cuOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, cu_func, 0, 0, state_list_size));
9881015
//! Round up according to CUDAAgent state list size
9891016
gridSize = (state_list_size + blockSize - 1) / blockSize;

0 commit comments

Comments
 (0)