Better benchmarking/performance monitoring

Our benchmarking suite right now is incredibly unreliable right now - there's too much variation between runs to make it useful.

I'm not sure what the right option here is, but we should try to have something more stable.