Describe the bug
std::max({value1, value2, value3}) (at least with double values) is much slower than std::max(value1, std::max(value2, value3)). The explicitly-vectorised implementation described here https://learn.microsoft.com/en-us/cpp/standard-library/vectorized-stl-algorithms?view=msvc-170 is called with the initialiser list, which means function call overhead, possibly some dispatching to a variant of the function with the best vector instructions for the ISA extensions present on the CPU, and presumably eventually gets to the 'last three' elements and uses a non-vectorised implementation to deal with them.
As it's a fixed size at a compile time, picking the variant based on the size and avoiding the vectorised one for input small enough to not benefit from it should help. At the moment, _USE_STD_VECTOR_ALGORITHMS is the only control users have, and that kills the optimisation in places it's actually helpful, too. For the three-value example given, it's not a big loss of readability or conciseness to avoid the initialiser list, but the threshold where the optimisation is an optimisation is more than three.
Command-line test case
I'm unconvinced this will make this any clearer, but can throw together a microbenchmark that demonstrates this if you really need one.
Expected behavior
The manually-vectorised implementations of algorithms are only used when they have a reasonable chance of not making things slower.
STL version
Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35228 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
This isn't the latest, but I looked at the relevant header, and there's still nothing to address this.
Additional context
Add any other context about the problem here.
Describe the bug
std::max({value1, value2, value3})(at least withdoublevalues) is much slower thanstd::max(value1, std::max(value2, value3)). The explicitly-vectorised implementation described here https://learn.microsoft.com/en-us/cpp/standard-library/vectorized-stl-algorithms?view=msvc-170 is called with the initialiser list, which means function call overhead, possibly some dispatching to a variant of the function with the best vector instructions for the ISA extensions present on the CPU, and presumably eventually gets to the 'last three' elements and uses a non-vectorised implementation to deal with them.As it's a fixed size at a compile time, picking the variant based on the size and avoiding the vectorised one for input small enough to not benefit from it should help. At the moment,
_USE_STD_VECTOR_ALGORITHMSis the only control users have, and that kills the optimisation in places it's actually helpful, too. For the three-value example given, it's not a big loss of readability or conciseness to avoid the initialiser list, but the threshold where the optimisation is an optimisation is more than three.Command-line test case
I'm unconvinced this will make this any clearer, but can throw together a microbenchmark that demonstrates this if you really need one.
Expected behavior
The manually-vectorised implementations of algorithms are only used when they have a reasonable chance of not making things slower.
STL version
This isn't the latest, but I looked at the relevant header, and there's still nothing to address this.
Additional context
Add any other context about the problem here.