Open
Description
GC performance is bad due to lots of temporary array allocation during the kernel launching phase (used as size arrays to configure GPU resources) and the kernel running phase (used as temporary arrays to avoid control flow divergence).
Consider creating a cache to store all these temporary arrays together before running the rhs!
iteration.