Skip to content

[feature request] Add PartialActivityProcessor to provide additional visibility to very long-running processes #2704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mladjan-gadzic
Copy link

Fixes

Design discussion issue

Changes

PartialActivityProcessor is added. It helps to provide more visibility for long running processes. Logs are sent when span starts, during its lifetime and when it ends. It should be added before exporter processor.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

@mladjan-gadzic mladjan-gadzic requested a review from a team as a code owner April 17, 2025 13:03
Copy link

linux-foundation-easycla bot commented Apr 17, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@github-actions github-actions bot requested a review from MikeGoldsmith April 17, 2025 13:03
@github-actions github-actions bot added the comp:extensions Things related to OpenTelemetry.Extensions label Apr 17, 2025
try
{
WaitHandle.WaitAny(triggers, this.scheduledDelayMilliseconds);
this.Heartbeat();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of sending heartbeat events?
If there was a "start" event then if there was not "end" event yet then it means that the span is in progress.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending heartbeat events serves the purpose of indicating that a span is still in progress. This improves observability by showing that an operation is active, even if it’s long-running. The heartbeat interval can typically be configured through the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is necessary. An observability backend can indicate that a span is still in progress based on "started" and "ended" events. Such feature would also decrease the performance of the instrumented application. I imagine it could be an opt-in event, but I do not think this should the default behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some internal discussions about this and our conclusion was more/less the same. This is the first iteration just to get the community feedback on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @pellared said. A simple processor emitting an event(log) at the start/end would be a very good start.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I can see your point.

My main concern is how to cross-reference the heartbeat events with the partial span events. The obvious thing to use would be service.instance.id, I suppose.

When I started typing this response I was worried that it would need to be truly per-process unique to catch problems like a k8s pod getting OOMKilled and restarting. I have seen this not be true with the instance id injected by the OTEL k8s operator, for example. In that case it is unique to the pod but not a single run of the container (because it injects env vars to the pod).

Reading the conventions, though, I'm now thinking maybe that's just a problem with relying on the k8s operator. I guess if you've got enough control to add a PartialActivityProcessor then you can just call AddService() and let it generate a random instance id each time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry if I miss some other part of this discussion, but do we need a generic heartbeat? Most spans are super-short and won't benefit from it.

But if I know that I have a long-running job, I can instrument that long-running job and add domain-specific events (progress updates).

E.g.

timestamp = X, event_name = starting_job, attributes: {"job.name": "foo", "job.id": 42, "total_items": 100500, ...}
timestamp = X + 1, event_name = progress_update, attributes: {"job.id": 42, "done_items": 10, ...}
....

Wouldn't it be much more useful than generic span heartbeat?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really for reporting progress, this is to enable downstream processing that can recover from long-running (multi minute/hour/day) spans never completing (due to problems from outside the application) by rehydrating the unfinished span from this log-event data and sending it with an error state. This allows the proper visualisation/processing of spans hierarchies that contain failures of this type.

Our main use case is for long-running, complex, multi-level batch processes (think ML training, but we've got a very heterogeneous set of workloads) where individual steps suffer from random failures (viewed at scale). I've seen another use case for the same functionality for mobile apps, though, where the app can be suspended by the OS at any time.

The span heartbeats allow the processing to mark a span as failed (and send it) if it's not seen a heartbeat when it expects one. If we had a process/service heartbeat we could probably use that to mark all uncompleted spans we'd seen from that process as errors if the heartbeats stopped arriving. I've not been involved in the implementation of our PoC, though. I'll defer to @mladjan-gadzic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are other ways to achieve the same and this is a separate use case. I created open-telemetry/semantic-conventions#2159

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the other advantage of per-span heartbeats is that we can pick up changes to the span attributes during its lifetime. It's a bit of an edge-case, but it's a fundamental difference in what's possible.

[buffer, 0, sdkLimitOptions, null!, new Batch<Activity>(data)]);
var writePosition = result as int? ?? 0; // Use a default value if null

var logRecord = (LogRecord)logRecordConstructor.Invoke(null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also set the Severity and EventId.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about INFO for Severity?

About EventId, it get a little complicated. In docs it says that it should be only set when used through ILogger - and this is not the case. From where/what would it value be derived from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about INFO for Severity?

Works for me.

From where/what would it value be derived from?

This should be something which defines the name of the event. Maybe simply span.state.

Example: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-events.md

this.shutdownTrigger = new ManualResetEvent(false);

// Access OpenTelemetry internals as soon as possible to fail fast rather than waiting for the first heartbeat
AccessOpenTelemetryInternals(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pellared while at it, do you know of any good practice in open telemetry, where this can be rewritten without any hacks or ugly practices?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am not a maintainer/approver here 😉
CC @open-telemetry/dotnet-approvers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think it is a good idea to rely on any internals. If your intent to send a signal at Activity start/end, accept a ILoggerFactory/iLogger, and then use that to emit the log, just like how end users would do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea about accessing internals comes from serializing span into proto base64 encoded representation to body of log record so that partial collector (there should be soon a pr to collector contrib) can reuse span data. this is a broader approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we really need is just a way to serialise all the Span content into a log message so that if the Span never completes normally (e.g. due to a process crash) we can rehydrate it in the component that has been recording these partial span events (including all the original attributes, etc.), mark it as finished in an error state and send it back in to our OTEL pipeline so it's not lost.

Using the serialiser from OpenTelemetry.Exporter.OpenTelemetryProtocol seems the obvious way to do that without missing something, but that code's not on the public API.

FWIW I agree that accessing this via reflection is far from ideal (versioning issues worry me). Does anyone know why this isn't public in the upstream exporter?

Copy link
Member

@pellared pellared May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serializing to OTLP would be coupling the processor to certain exporter implementation. We should simply encode the span as a complex log attribute named span.state. The the OTLP exporter would be responsible to serialize it accordingly.

PS. I have no idea how OTel .NET SDK handles complex log attributes. Referring to @open-telemetry/dotnet-approvers so that they can you out if necessary.

@pellared
Copy link
Member

PTAL @lmolkova

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 30, 2025
@pellared
Copy link
Member

I think that processor can be used as a prototype for:

I am not sure what is the stability policy in this project and if it accepts "Development/Experimental" features. Defering to @open-telemetry/dotnet-maintainers

Still I think it would be good to cleanup this PR so that it is in a "mergable" state and that it proposes some semantic conventions.

@mladjan-gadzic
Copy link
Author

I think that processor can be used as a prototype for:

I am not sure what is the stability policy in this project and if it accepts "Development/Experimental" features. Defering to @open-telemetry/dotnet-maintainers

Still I think it would be good to cleanup this PR so that it is in a "mergable" state and that it proposes some semantic conventions.

thanks for a feedback @pellared! i'll be off until Monday, but i'd clean it up on Monday if not before.

@github-actions github-actions bot removed the Stale label May 1, 2025
@mladjan-gadzic
Copy link
Author

I think that processor can be used as a prototype for:

I am not sure what is the stability policy in this project and if it accepts "Development/Experimental" features. Defering to @open-telemetry/dotnet-maintainers
Still I think it would be good to cleanup this PR so that it is in a "mergable" state and that it proposes some semantic conventions.

thanks for a feedback @pellared! i'll be off until Monday, but i'd clean it up on Monday if not before.

@pellared cleaned it up as much as i could

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:extensions Things related to OpenTelemetry.Extensions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants