What is the ablation for constructing subnetwork by using first L layers

Dear authors, I've read your work and find them really interesting. But I wonder what if we consider the subnetwork be the first L layers, as you've said the output between layers shows high similarities. Will this kind of guidance still be effective to the performance of the model?