Currently, broadcasting logic is called for every vcompute call. Depending on how xtensor is structured, it's possible a lot of logic is duplicated hundreds of times unnecessarily.
We could either broadcast before doing the operation, or try to refactor xtensor to be able to re-use broadcasting functions. This may reduce the binary size by a good margin.