Skip to content

[IPC Protocol] Add specification to configure a user_events eventpipe session #5454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Apr 21, 2025

In .NET 10, the EventPipe infrastructure will be leveraged to support user_events.

This PR documents the protocol for enabling a user_events-based EventPipe Session through the Diagnostics IPC protocol, where a new EventPipe Command ID CollectTracing5 will accept necessary tracepoint configuration.

As the user_events EventPipe session is not streaming based, the payload is expected to first encode a uint output_format to denote the session format (streaming vs user_events). Afterwards, only relevant session configuration options are to be encoded, outlined at the top of the EventPipe Commands section in the Streaming Session section, User_events Session section, and Session Providers section.

For user_events EventPipe sessions, an additional tracepoint_config is to be encoded, to map Event IDs to tracepoints.
This protocol expects the Client to have access to the user_events_data file in order to enable configuring a user_event EventPipe session, and expects the SCM_RIGHTS to the user_events_data file descriptor to be sent in the continuation stream.

Additionally, as user_events does not support callbacks, a new event_filter config is expected to be encoded in CollectTracing5, to act as an allow/deny list of Event IDs that will apply post keyword/level filter.

@mdh1418 mdh1418 requested a review from a team as a code owner April 21, 2025 23:53
@mdh1418 mdh1418 force-pushed the add_ipc_message_protocol_for_user_events_ep_session branch from 722afa6 to 4094fe9 Compare April 21, 2025 23:54
mdh1418 added 2 commits April 25, 2025 12:19
Update CommandSets Command IDs
Move EventPipe StopTracing to beginning
Fix sample payload serialization
Clarify Header Size and NetTrace format version
Clarify that filter_data can be 0 length to avoid confusion with
optional meaning that encoding can be skipped
@mdh1418 mdh1418 force-pushed the add_ipc_message_protocol_for_user_events_ep_session branch from c30b773 to 8b5bf46 Compare April 25, 2025 16:26
@mdh1418 mdh1418 force-pushed the add_ipc_message_protocol_for_user_events_ep_session branch from 1ba9499 to 87902f0 Compare May 1, 2025 00:56
@mdh1418 mdh1418 requested a review from brianrob May 6, 2025 14:46
@mdh1418 mdh1418 requested a review from agocke May 6, 2025 20:51
Copy link
Member

@brianrob brianrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdh1418 thanks for updating this doc and for your patience with me reviewing it. A couple of questions, but otherwise, looks good.

@steveisok steveisok requested a review from lateralusX May 21, 2025 23:47

The `version` is the version of the tracepoint format, which in this case is [version 1](#tracepoint-format-v1).

The `event_id` is the ID of the event, defined by the EventSource/native manifest.
Copy link
Member

@lateralusX lateralusX May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Events are also versioned, so just having the event id won't be enough to know how to parse the payload. In EventPipe and nettrace format each serialized event has a metadata id associated that is unique to the session stream and will appear before the event in the stream. Inside the metadata record it will carry additional data about the event, like provider name and lots of metadata about the event, including version. How will this work for user events? I see that we have a metadata field, and looking at the description below, it sounds like we will include metadata for the first event of that type per session, similar to the nettrace metadata event. But since each event instance doesn't carry a unique metadata id, how will we lookup metadata for potential different versions of an event?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata should be sent once per-event, per-EventPipe-session. (Process id,tracepoint) should be sufficient to resolve a session and (tracepoint,event-id) is sufficient to resolve an event. The combination of process id + event id + tracepoint name should then have equivalent precision as the metadata id used in the nettrace format. If we ever support multiple EventSources with the same name, same event id, but different parameter signatures in the same process it would cause problems and we'd have to address it. No plans to do that though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so that means we would need the tracepoint to be different for different providers, since event id are only unique within a specific provider, mixing events from different providers into the same tracepoint will cause event id collisions. How will we handle different versions of the same event?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we handle different versions of the same event?

In the same process we don't support different versions of the same event. In separate processes you have the PID to distinguish and each process will emit its own metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so that means we would need the tracepoint to be different for different providers, since event id are only unique within a specific provider, mixing events from different providers into the same tracepoint will cause event id collisions

yep! #5454 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we handle different versions of the same event?

In the same process we don't support different versions of the same event. In separate processes you have the PID to distinguish and each process will emit its own metadata.

OK, events over EventPipe can technically handle this since each unique event instance is provider+event_id+event_version. Do we enforce this in anyway or is it just a convention that we should only support one version of the same event in the same process? Making sure that a specific event fired by runtime uses the same version is probably controllable, but since we generate all versions in clretwallmain.h, it is still possible to use both V1 and V2 for the same event in different parts of the runtime. For EventSource its probably more complicated to enforce since we have less control of usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there events where multiple of its versions are actually in use at the same time? I've only just started looking through.

There might be multiple ways to discern the EventPipe Event's version. One would be through the EventPipe Event Metadata, but I'm trying to validate that the metadata will always encode the EventPipeEvent's version. One oddity (at least at a glance it looks like an inconsistency) is https://github.com/dotnet/runtime/blob/9703a2ed912485bdb888ac3e3a023564ec9e3528/src/native/eventpipe/ep-event-source.c#L132-L154, where the metadata encodes version 1 but the actual event being added has version 0

Another way to discern the EventPipe Event's version is IFF different versions of the same event have a unique size. Since the size is being encoded in the __rel_loc, that can be used to reverse engineer which version of the event is being seen. I don't know if we have that invariant, that newer versions of the same event are strictly larger than previous versions. I'd have to investigate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we enforce this in anyway or is it just a convention that we should only support one version of the same event in the same process?

EventSource enforces it and it is documented:

"Each method has a body that calls WriteEvent passing it an ID (a numeric value that represents the event) and the arguments of the event method. The ID needs to be unique within the EventSource. The ID is explicitly assigned using the System.Diagnostics.Tracing.EventAttribute"

If you try to define an EventSource with multiple events whose IDs match (in order to give them different versions) then internally EventSource will have an error like this:

Exception thrown: 'System.ArgumentException' in System.Private.CoreLib.dll
Exception thrown: 'System.ArgumentException' in System.Private.CoreLib.dll
EventSource Error: ERROR: Exception in Command Processing for EventSource MyEventSource: Event Foo has ID 1 which is already in use.

That exception gets caught internally to avoid having instrumentation break user code, but it should prevent the event from working.

Likewise if you try to define multiple EventSources with the same provider ID in order to define their events differently, that also hits a different error check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, good, that should take care of only emitting one version of an event in managed code. I don't believe we enforce this rule for native runtime events so then it will become a convention to not emit multiple versions of the same event through runtime code. Maybe we could implement some validation code to enforce this.

So based on above discussion, in a given user_event trace, in order for parsers to correctly parse events the following needs to hold:

  • All events from a given provider must go into its own tracepoint to make sure event ids are unique. Providers can't be mixed in a single tracepoint since there will be event id collisions.
  • First instance of an event inside a tracepoint needs to carry metadata for the event id.
  • Only one version of a given event can only appear inside the same tracepoint.
  • If multiple processes write into the same tracepoint, parsers must parse events based on pid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be multiple ways to discern the EventPipe Event's version. One would be through the EventPipe Event Metadata, but I'm trying to validate that the metadata will always encode the EventPipeEvent's version. One oddity (at least at a glance it looks like an inconsistency) is https://github.com/dotnet/runtime/blob/9703a2ed912485bdb888ac3e3a023564ec9e3528/src/native/eventpipe/ep-event-source.c#L132-L154, where the metadata encodes version 1 but the actual event being added has version 0

@mdh1418 that represent the process info event emitted once per session. We should probably correct that inconsistency. Looking at perfview implementation it doesn't look like it does any structured parsing of this event https://github.com/brianrob/perfview/blob/a1c8b66ac9358dc4a7bbae41825e917c130f7ca3/src/TraceEvent/TraceLog.cs#L2091. Since we already introduced version 1 in the metadata we should move up the event definition to version 1 as well.

#### User_events Session Payload:
* `uint output_format`: 1
* `ulong rundownKeyword`: Indicates the keyword for the rundown provider
* `array<user_events_provider_config> providers`: The providers to turn on for the session
Copy link
Member

@lateralusX lateralusX May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we drop the requestStackWalk bool for user events and I guess its intentional. Reading up on other discussions around stack traces I'm under the impression that we will rely on OS stack walking capabilities for user events, correct?

If that is the case then I assume we won't trigger the codepath in ep_buffer_manager_write_event where we check if session have the stack walks enabled (based on the requestStackWalk) and if the event requests stack walks (part of the event configuration) or make sure ep_session_get_enable_stackwalk (session) is always false for user event sessions.

There is one exception to this, the sample profiler provider, that provider does stack walks as part of sampling and pass stack directly to ep_write_sample_profile_event. Will we support the sample profiler for user events, or will we use some other mechanism provided by OS to sample threads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a user_event eventpipe session, I think the aim is to not use any ep_buffer_manager* logic. Right now in the runtime PR draft, the stackwalk_requested is forcefully set to false if its a user_events session.

Will we support the sample profiler for user events, or will we use some other mechanism provided by OS to sample threads?

That was one of the things we were discussing offline, where I need to thoroughly investigate what is needed to make stackwalking possible for user_events, but I haven't gotten to that yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, one thing to keep an eye on is what requirements the OS unwinder puts on frames. AFAIK CoreCLR JIT doesn't provide native platform specific unwind information (DWARF) that gets registered with OS unwinder, this is only done when using NativeAOT (as part of the binary). CoreCLR JIT uses Windows unwind op codes on all platforms and rely on internal unwinder to correctly unwind stacks including managed frames. If this observation still holds, then OS unwinder will only be able to stackwalk managed frames if they are compiled with frame pointers, something the JIT optimizer can omit if not instructed to disable that optimization.

mdh1418 added 2 commits June 4, 2025 12:16
Parallel renaming to Runtime counterpart PR.
The serialization format and output format were deemed confusing
to have side by side, so renamed the `output format` to more clearly
represent its usage
@mdh1418
Copy link
Member Author

mdh1418 commented Jun 4, 2025

@lateralusX @noahfalk @beaubelgrave @brianrob @AaronRobinsonMSFT
If I'm not mistaken, all blocking concerns have been addressed, and right now the unresolved comments are discussions for understanding/clarity that can be addressed in a separate PR?

The runtime counterpart is nearly mergeable, and I haven't discovered any additional critical modifications to the spec for the initial iteration. I'm planning to merge this and the runtime PR soon (end of this week or early next week) and then more incremental follow-up PRs will be made to address additional nice-to-haves + clarity.

@mdh1418 mdh1418 requested review from lateralusX and noahfalk June 18, 2025 15:13
@noahfalk noahfalk merged commit ff0733e into dotnet:main Jun 18, 2025
2 checks passed
@mdh1418 mdh1418 deleted the add_ipc_message_protocol_for_user_events_ep_session branch June 18, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants