We should run DCE after register allocation and shuffle truncation. We generate quite a few unnecessary registers.