Dear authors, I've read your work and find them really interesting. But I wonder what if we consider the subnetwork be the first L layers, as you've said the output between layers shows high similarities. Will this kind of guidance still be effective to the performance of the model?
Dear authors, I've read your work and find them really interesting. But I wonder what if we consider the subnetwork be the first L layers, as you've said the output between layers shows high similarities. Will this kind of guidance still be effective to the performance of the model?