Skip to content

Take Final Stack Trace Sample When Trace Profiling Stops#2277

Merged
laurit merged 15 commits intosignalfx:mainfrom
tduncan:span-end-thread-sampling
Apr 25, 2025
Merged

Take Final Stack Trace Sample When Trace Profiling Stops#2277
laurit merged 15 commits intosignalfx:mainfrom
tduncan:span-end-thread-sampling

Conversation

@tduncan
Copy link
Copy Markdown
Contributor

@tduncan tduncan commented Apr 16, 2025

This PR adds a final stack trace sample taken when stack trace sampling is stopped for a given trace. Upon stopping sampling for a given trace ID a find stack trace sample will be taken on demand with a duration matching more closely with the amount of time elapsed between when the previous sample was taken and when the service entry span ends.

In addition the mechanism by which the stack trace duration is calculated has been updated to compute the value every iteration rather than assume the configured sampling duration. This has the effect that at sampling start and stop a value likely much smaller than the configured sampling rate will be calculated, and normal executions of the sampler will report computed values rather than an assumed value of the sampling period.

Note: This PR and #2276 are addressing the same issue.

@tduncan tduncan requested review from a team as code owners April 16, 2025 21:22
Copy link
Copy Markdown

@lo-jason lo-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I had comment on slack https://splunk.slack.com/archives/C07MYK71B55/p1744845504377729?thread_ts=1744835907.680119&cid=C07MYK71B55

callgraph executor ending early is a bug we should look to address before release as well.

Comment on lines +57 to +56
traceId -> {
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
scheduler.scheduleAtFixedRate(
new StackTraceGatherer(
samplingPeriod, spanContext.getTraceId(), Thread.currentThread()),
SCHEDULER_INITIAL_DELAY,
samplingPeriod.toMillis(),
TimeUnit.MILLISECONDS);
return scheduler;
});
traceId -> new ThreadSampler(traceId, Thread.currentThread(), samplingPeriod));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change in here?

Copy link
Copy Markdown
Contributor Author

@tduncan tduncan Apr 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation and scheduling of StackTraceGatherer instances has moved into the construction of the new ThreadSampler which now assumes responsibility of managing thread profiling mechanism, including stopping the scheduled sampling and taking the final on demand sample when trace profiling is stopped.

}
}

private Duration samplingPeriod(long fromNanos, long toNanos) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this method provides too much -- I'd just inline it.

.collect(Collectors.toList());
}

static class StagedStackTraces extends AbstractList<StackTrace> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels bad to me. It's also just a strange inner class that can potentially claim to be "emptied" even though it has content in the list and can have content in the list but still be "emptied". I don't follow. Even if you do still somehow want to keep this, it needs to be private.

Copy link
Copy Markdown
Contributor Author

@tduncan tduncan Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be private as it is used in ScheduledExecutorStackTraceSamplerTest.

Testing the inclusion of a unscheduled thread snapshot being taken when trace sampling is stopped caused contradictory requirements of visibility into the StagingArea. Right now the expectations is that the StagingArea will be emptied when sampling stops and previously this was confirmed by checking that the List here was indeed an empty list. Now we have a new requirement that a final thread snapshot is taken when sampling is stopped, but because the StagingArea is also emptied there wasn't a reliable way to confirm an additional StackTrace instance was staged. This class is the compromise a came up with; it records the staged StackTraces (which is what most tests care about) as well as whether the empty method was called for a specific trace ID.

Also part of the context, and not at all possible to know from this PR, is a change that I have nearly ready (its partially blocked by #2262) to completely remove this part of the design. Instead of staging and emptying by trace ID the StagingArea will keep all StackTraces in a single bucket and export all of the at the same time periodically rendering this class and the test that relies on it unnecessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to put a comment or use @VisibleForTesting so that the reader knows why things are visible.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm familiar with that convention for production code, though this class exists in the test directory and is only available to tests (specifically those in the com.splunk.opentelemetry.profiler.snapshot package).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tduncan is another option is to use a mock stagingarea. That is what i did in #2276

… that has never been used to stage a StackTrace.
class StackTraceGatherer implements Runnable {
private final Duration samplingPeriod;
private class ThreadSampler {
private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address in new PR I just opened: #2283


void stop() {
scheduler.shutdown();
gatherer.run();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it why you'd want to get a sample on end. That sample doesn't have any application code running, what use does it have?

Copy link
Copy Markdown
Contributor Author

@tduncan tduncan Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helps report more accurate estimated timing information. Prior to this change all the trace profiles would have timing data that was some multiple of 20 because every sample was submitted with a 20ms duration, and would almost always be more than 20ms longer than the reported span duration (#2277 also improves the default accuracy but will still be a multiple of 10ms).

Computing each sample period and taking a final sample greatly reduces the difference between the report span duration and the reported profiling duration. They are still not the same, however. In fact with this change the profiling duration is consistently undercounting the duration by a few milliseconds when compared to the duration of the entry span duration.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that using the timing data this way makes sense at all for snapshot profiler? A stack sample proves that a method was executed at given time. It does not reveal anything about how long the method was executing. I get the feeling that you are deriving arbitrary meaning here for the period. As far as I remember profilers usually show percentages, like this method was observed in 52% of samples.
I guess the SCHEDULER_INITIAL_DELAY is set to 0 for similar reasons.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.

The assumption today is that for a particular callstack. That callstack spent $INTERVAL_TIME running. Say the interval is 10ms.

  • Span that runs 5ms will show method was executing for 10ms (1 sample on start, which says it took 10ms)
  • Span that runs for 11ms will show method was executing for 20ms (2 samples, 10ms each)
    Customers easily see this inaccuracy and wonder why method execution time is not accurate.

For our purposes because we are not taking a continuous sample we cannot assume the first sample took 10ms because we know we are starting this sample when the method starts. Making an assumption that the method has run for 5-10ms before we took a sample would be incorrect.

This extrapolation does lead to somewhat arbitrary inaccuracies for example these stack traces

Time: Stack
T0: java.lang.Thread.run() (synchronous)
T1: java.lang.Thread.run(), foo.bar()
T2:java.lang.Thread.run(), foo.bar()

Was the duration T1-T0 spent in run() or bar()? By necessity we have to extrapolate and guess. It is really a coin flip that we can go either way. What we can say for sure is that bar() ran for at least 10ms. And some combination of 10ms spent likely in either run() or bar() or even a method call that ran too fast to be tracked.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.

The difference is that a classic profiler has usually lots of samples. Here even if you somehow combined samples from multiple requests to the same endpoint it still wouldn't help because all the samples are biased because they are taken at 10ms intervals.

This extrapolation does lead to somewhat arbitrary inaccuracies

If you already know that this is not going to work correctly then perhaps you should considered alternatives. Whatever we choose to implement now is what we are going to be stuck with for the foreseeable future. If you believe that taking an extra sample at the start and end of the trace is the best option then you will need to get this into the gdi spec along with whatever magic you need to do to compute the period.
Imo it would be nice to see you moving from having a separate scheduler per profiled trace into something more standard like having a single scheduler that drives the sampling for all profiled traces.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary inaccuracy is too harsh, the main point is that the strategy is consistent. We always say for the time since last sample it is spent in the method that we see. We are looking at smaller time periods here, but still it is for the user to understand approximate time taken.

Instant now = Instant.now();
ThreadInfo threadInfo = threadMXBean.getThreadInfo(thread.getId(), Integer.MAX_VALUE);
SpanContext spanContext = retrieveActiveSpan(thread);
Duration samplingPeriod = samplingPeriod(timestampNanos, currentSampleTimestamp);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have heard that the period isn't really used for snapshot profiling events, you sure it is worth it to compute exact values? Secondly since this code is called from the scheduled executor and also manually on stop you have a bit of race possibility here. Wouldn't be surprised when with debugger you could create a situation where the computed period is negative.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the sampling period was ignored (see changes in AsyncStackTraceExporter). Now the sampling period included in the profiling log message will be the one computed by the StackTraceSampler. This data is eventually stored in the profiling backend and used to calculated the estimated total time each stack frame was running.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @laurit this timestampnanos needs to be threadsafe

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo the timestampnanos could just be the tip of the iceberg here. ThreadSampler#shutdown calls

scheduler.shutdown();
gatherer.run();

ExecutorService.shutdown javadoc says

Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. Invocation has no additional effect if already shut down.
This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that.

I didn't try this out but hypothetically assuming that the executor was already executing gatherer.run when shutdown was called gatherer.run may run in parallel from the executor and from the explicit gatherer.run() call. Similarly stagingArea.empty(spanContext.getTraceId()) called from ScheduledExecutorStackTraceSampler#stop could run in parallel with gatherer.run from the executor with some unlucky timing. I'd guess that when gatherer.run is run in parallel the previous timestamp computations could be off, also AccumulatingStagingArea internally uses ArrayList that probably could be corrupted with concurrent mutations.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 accumulatingstagingarea needs to be threadsafe. Internal use of ArrayList won't work.

Copy link
Copy Markdown
Contributor Author

@tduncan tduncan Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laurit, @lo-jason Thread-safety of AccumulatingStagingArea addressed in #2288.

private class StackTraceGatherer implements Runnable {
private final String traceId;
private final Thread thread;
private long timestampNanos;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestampnanos is accessed from multiple threads so should be made volatile

@laurit laurit merged commit 569d98b into signalfx:main Apr 25, 2025
26 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 25, 2025
@tduncan tduncan deleted the span-end-thread-sampling branch April 25, 2025 16:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants