Description
Today in the EventPipe, it is not easy to distinguish between an CPU bound thread and a blocked thread. Today we just distinguish between threads in managed code (which CAN'T be blocked) and threads that are not in managed code (which often are blocked but don't have to be).
This works OK, but is really not what we want because you can't answer simple questions like: is this thread/process CPU bound? How much CPU capacity is my process consuming? These are really quite useful.
Fundamentally the reason we don't have this information is that there was no really EASY, EFFICIENT way of getting this information in a CROSS_PLATFORM way (since we would need to fetch it once per msec, PER THREAD) which is a lot.
My proposal here is to solve this in conjunction with Issue #11301.
- We add a new 64 bit field CpuOnThreadNSec which gives the number of nanoseconds of CPU time on that thread (the basis is not given, it is only useful for computing deltas). This number can be 0 which means at the time the sample is taken we did not fetch this CPU time.
For any threads that are actually sampled using the algorithm in #11301.
- for every thread, if the last sample was in managed code, we will not bother setting the CpuOnThreadNSec field if the last time CpuOnThreadNSec was emitted on this thread was < 1 second ago.
Thus we get accurate CpuOnThreadNSec every second (if we get events on the thread at all), However we only pay for this once every second rather than once every msec, which is a huge improvement (indeed we may wish to make the threshold once every 100 msec (or make it configurable)).
Now if we do the optimization in #11301 if a thread never returns from native code, then you wont get any data on it (even if that native code is consuming CPU). Thus event threads that have not returned should also log their CpuOnThreadsNSec at a low rate (~ once a second or slower), I actually recommend a binary backoff algorithm (thus 1 sec, then 2 , then 4 ...) as long as the thread never returns from native code. This allows the overhead of threads that just block and do almost nothing to approach 0 as time goes on.
Things can be tweeked here, but the basics are
- We have a new OPTIONAL field on the CPU Sample event that tells you CPU spend on the thread.
- We collect this but at a lower rate
- For threads outside the runtime we collect it at even a lower rate still (because they are likley to be blocked).
But now the viewer can get a very good idea of what % of the time a thread spends CPU bound, and if it assumes all managed code is CPU, it can attribute how much of the native time is blocked.