In some early versions, the permutation preparation is put at CPU side. Recently I moved it to GPU side. But found the throughput drop by 30-50%. A little confused, we should find out the reason later.