Open
Description
Currently, both autoschedulers appear to example Target.natural_vector_size()
for the single target passed to them, and generate vectorizations accordingly.
This is fine when targeting a single microarchitecture, but experience has shown that you can get good results across a range of microarchitectures (e.g. x86 SSE4.1/AVX/AVX2) by just vectorizing relative to the specific subtarget in a multitarget build.
Could the generated autoschedule emit vectorizations in terms of what the value of get_target().natural_vector_size<>()
is at C++ compiletime? If so, we could probably emit more flexible schedules.
(Yes, this might well constrain other aspects of the schedule, e.g. tiling, so this may be impractical or impossible.)