Currently we set the cache configuration to 48K L1 and 16 K shared (Fermi). However, this isn't optimal for all kernels and the auto tuner can actually switch the default cache configuration if it requests more than 16K per SM.
The solution is expand the TuneParam class to include a member variable enum cudaFuncCache, which will be tuned per kernel. This shouldn't be too much work, adding it to the 0.4.1 milestone.....
Currently we set the cache configuration to 48K L1 and 16 K shared (Fermi). However, this isn't optimal for all kernels and the auto tuner can actually switch the default cache configuration if it requests more than 16K per SM.
The solution is expand the TuneParam class to include a member variable enum cudaFuncCache, which will be tuned per kernel. This shouldn't be too much work, adding it to the 0.4.1 milestone.....