We currently have poor GPU thread blocking for at-points operators with all CEED_EVAL_NONE inputs and outputs.
The issue is in backends/cuda-gen/ceed-cuda-gen-operator-build.cpp:CeedOperatorBuildKernel_Cuda_gen:
// Lines 1228-1234
if (Q_1d == 0) {
if (is_at_points) Q_1d = max_num_points;
else CeedCallBackend(CeedOperatorGetNumQuadraturePoints(op, &Q_1d));
}
if (Q == 0) Q = Q_1d;
data->Q = Q;
data->Q_1d = Q_1d;
Rather than blocking by cell/slices of cells, we use a 1D blocking strategy with Q_1d = max_num_points. This can hit the maximum thread count/block size limits or the shared memory limits, causing a launch-time failure.