What is the change
Add extra full unrolling pass to improve performance on cores with branch predictors. It helps produce simplified loops, which can then be SROA'd allowing further simplification.
This change is already present in the codebase in the form of a performance patch file.
Why this change cannot be done upstream
The introduced change is in common code with no easy way to demonstrate general usefullness. We know this to be benefitial for some Arm specific cases from internal benchmarking.