Hi Jed,
What do you mean by interleaving threads? From my understanding, I think we make threads run in sequential in fine-grain level so that reduces the cache access conflicts. 'N' is the number of threads, 'b' is a block size. We run the threads in order 0,1,..,N-1 or another mapped order j(i). Please correct me! What do we use to measure the performance of the program? compute time, cache hit rate, or anything else?