Distinguish between Blocked and CPU time in EventPipe Traces.  

Today in the EventPipe, it is not easy to distinguish between an CPU bound thread and a blocked thread.  Today we just distinguish between threads in managed code (which CAN'T be blocked) and threads that are not in managed code (which often are blocked but don't have to be).

This works OK, but is really not what we want because you can't answer simple questions like: is this thread/process CPU bound?   How much CPU capacity is my process consuming?     These are really quite useful.  

Fundamentally the reason we don't have this information is that there was no really EASY, EFFICIENT way of getting this information in a CROSS_PLATFORM way (since we would need to fetch it once per msec, PER THREAD) which is a lot.

My proposal here is to solve this in conjunction with Issue dotnet/runtime#11301.  
* We add a new 64 bit field CpuOnThreadNSec which gives the number of nanoseconds of CPU time on that thread (the basis is not given, it is only useful for computing deltas).    This number can be 0 which means at the time the sample is taken we did not fetch this CPU time.
  
For any threads that are actually sampled using the algorithm in dotnet/runtime#11301.

* for every thread, if the last sample was in managed code, we will not bother setting the CpuOnThreadNSec field if the last time CpuOnThreadNSec was emitted on this thread was <  1 second ago.  

Thus we get accurate CpuOnThreadNSec  every second (if we get events on the thread at all),   However we only pay for this once every second rather than once every msec, which is a huge improvement (indeed we may wish to make the threshold once every 100 msec (or make it configurable)).  

Now if we do the optimization in dotnet/runtime#11301 if a thread never returns from native code, then you wont get any data on it (even if that native code is consuming CPU).     Thus event threads that have not returned should also log their CpuOnThreadsNSec at a low rate (~ once a second or slower),  I actually recommend a binary backoff algorithm (thus 1 sec, then 2 , then 4 ...) as long as the thread never returns from native code.   This allows the overhead of threads that just block and do almost nothing to approach 0 as time goes on.   

Things can be tweeked here, but the basics are
1) We have a new OPTIONAL field on the CPU Sample event that tells you CPU spend on the thread.   
2) We collect this but at a lower rate 
3) For threads outside the runtime we collect it at even a lower rate still (because they are likley to be blocked).

But now the viewer can get a very good idea of what % of the time a thread spends CPU bound, and if it assumes all managed code is CPU, it can attribute how much of the native time is blocked.

@noahfalk @brianrob @jorive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distinguish between Blocked and CPU time in EventPipe Traces. #11316

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Distinguish between Blocked and CPU time in EventPipe Traces. #11316

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions