Take Final Stack Trace Sample When Trace Profiling Stops by tduncan · Pull Request #2277 · signalfx/splunk-otel-java

tduncan · 2025-04-16T21:22:34Z

This PR adds a final stack trace sample taken when stack trace sampling is stopped for a given trace. Upon stopping sampling for a given trace ID a find stack trace sample will be taken on demand with a duration matching more closely with the amount of time elapsed between when the previous sample was taken and when the service entry span ends.

In addition the mechanism by which the stack trace duration is calculated has been updated to compute the value every iteration rather than assume the configured sampling duration. This has the effect that at sampling start and stop a value likely much smaller than the configured sampling rate will be calculated, and normal executions of the sampler will report computed values rather than an assumed value of the sampling period.

Note: This PR and #2276 are addressing the same issue.

… assuming an assumed duration based on the configured sampling period.

…plify the coordination of taking a final sample when trace sampling stops.

lo-jason

looks good. I had comment on slack https://splunk.slack.com/archives/C07MYK71B55/p1744845504377729?thread_ts=1744835907.680119&cid=C07MYK71B55

callgraph executor ending early is a bug we should look to address before release as well.

… into pprof.

breedx-splk · 2025-04-17T23:13:44Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

-        traceId -> {
-          ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
-          scheduler.scheduleAtFixedRate(
-              new StackTraceGatherer(
-                  samplingPeriod, spanContext.getTraceId(), Thread.currentThread()),
-              SCHEDULER_INITIAL_DELAY,
-              samplingPeriod.toMillis(),
-              TimeUnit.MILLISECONDS);
-          return scheduler;
-        });
+        traceId -> new ThreadSampler(traceId, Thread.currentThread(), samplingPeriod));


Why is this change in here?

The creation and scheduling of StackTraceGatherer instances has moved into the construction of the new ThreadSampler which now assumes responsibility of managing thread profiling mechanism, including stopping the scheduled sampling and taking the final on demand sample when trace profiling is stopped.

breedx-splk · 2025-04-17T23:15:48Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

      }
    }

+    private Duration samplingPeriod(long fromNanos, long toNanos) {


I don't think this method provides too much -- I'd just inline it.

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java

breedx-splk · 2025-04-17T23:20:57Z

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java

+        .collect(Collectors.toList());
+  }
+
+  static class StagedStackTraces extends AbstractList<StackTrace> {


This feels bad to me. It's also just a strange inner class that can potentially claim to be "emptied" even though it has content in the list and can have content in the list but still be "emptied". I don't follow. Even if you do still somehow want to keep this, it needs to be private.

It can't be private as it is used in ScheduledExecutorStackTraceSamplerTest.

Testing the inclusion of a unscheduled thread snapshot being taken when trace sampling is stopped caused contradictory requirements of visibility into the StagingArea. Right now the expectations is that the StagingArea will be emptied when sampling stops and previously this was confirmed by checking that the List here was indeed an empty list. Now we have a new requirement that a final thread snapshot is taken when sampling is stopped, but because the StagingArea is also emptied there wasn't a reliable way to confirm an additional StackTrace instance was staged. This class is the compromise a came up with; it records the staged StackTraces (which is what most tests care about) as well as whether the empty method was called for a specific trace ID.

Also part of the context, and not at all possible to know from this PR, is a change that I have nearly ready (its partially blocked by #2262) to completely remove this part of the design. Instead of staging and emptying by trace ID the StagingArea will keep all StackTraces in a single bucket and export all of the at the same time periodically rendering this class and the test that relies on it unnecessary.

We tend to put a comment or use @VisibleForTesting so that the reader knows why things are visible.

Sure, I'm familiar with that convention for production code, though this class exists in the test directory and is only available to tests (specifically those in the com.splunk.opentelemetry.profiler.snapshot package).

@tduncan is another option is to use a mock stagingarea. That is what i did in #2276

… that has never been used to stage a StackTrace.

laurit · 2025-04-21T13:38:53Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

-  class StackTraceGatherer implements Runnable {
-    private final Duration samplingPeriod;
+  private class ThreadSampler {
+    private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();


imo it would be better to use daemon threads like in https://github.com/signalfx/splunk-otel-java/blob/main/profiler/src/main/java/com/splunk/opentelemetry/profiler/util/HelpfulExecutors.java

Address in new PR I just opened: #2283

laurit · 2025-04-21T13:42:53Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+
+    void stop() {
+      scheduler.shutdown();
+      gatherer.run();


I don't get it why you'd want to get a sample on end. That sample doesn't have any application code running, what use does it have?

It helps report more accurate estimated timing information. Prior to this change all the trace profiles would have timing data that was some multiple of 20 because every sample was submitted with a 20ms duration, and would almost always be more than 20ms longer than the reported span duration (#2277 also improves the default accuracy but will still be a multiple of 10ms).

Computing each sample period and taking a final sample greatly reduces the difference between the report span duration and the reported profiling duration. They are still not the same, however. In fact with this change the profiling duration is consistently undercounting the duration by a few milliseconds when compared to the duration of the entry span duration.

Are you sure that using the timing data this way makes sense at all for snapshot profiler? A stack sample proves that a method was executed at given time. It does not reveal anything about how long the method was executing. I get the feeling that you are deriving arbitrary meaning here for the period. As far as I remember profilers usually show percentages, like this method was observed in 52% of samples.
I guess the SCHEDULER_INITIAL_DELAY is set to 0 for similar reasons.

This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.

The assumption today is that for a particular callstack. That callstack spent $INTERVAL_TIME running. Say the interval is 10ms.

Span that runs 5ms will show method was executing for 10ms (1 sample on start, which says it took 10ms)

Span that runs for 11ms will show method was executing for 20ms (2 samples, 10ms each)
Customers easily see this inaccuracy and wonder why method execution time is not accurate.

For our purposes because we are not taking a continuous sample we cannot assume the first sample took 10ms because we know we are starting this sample when the method starts. Making an assumption that the method has run for 5-10ms before we took a sample would be incorrect.

This extrapolation does lead to somewhat arbitrary inaccuracies for example these stack traces

Time: Stack T0: java.lang.Thread.run() (synchronous) T1: java.lang.Thread.run(), foo.bar() T2:java.lang.Thread.run(), foo.bar()

Was the duration T1-T0 spent in run() or bar()? By necessity we have to extrapolate and guess. It is really a coin flip that we can go either way. What we can say for sure is that bar() ran for at least 10ms. And some combination of 10ms spent likely in either run() or bar() or even a method call that ran too fast to be tracked.

This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.

The difference is that a classic profiler has usually lots of samples. Here even if you somehow combined samples from multiple requests to the same endpoint it still wouldn't help because all the samples are biased because they are taken at 10ms intervals.

This extrapolation does lead to somewhat arbitrary inaccuracies

If you already know that this is not going to work correctly then perhaps you should considered alternatives. Whatever we choose to implement now is what we are going to be stuck with for the foreseeable future. If you believe that taking an extra sample at the start and end of the trace is the best option then you will need to get this into the gdi spec along with whatever magic you need to do to compute the period.
Imo it would be nice to see you moving from having a separate scheduler per profiled trace into something more standard like having a single scheduler that drives the sampling for all profiled traces.

Arbitrary inaccuracy is too harsh, the main point is that the strategy is consistent. We always say for the time since last sample it is spent in the method that we see. We are looking at smaller time periods here, but still it is for the user to understand approximate time taken.

laurit · 2025-04-21T13:47:38Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

-        Instant now = Instant.now();
        ThreadInfo threadInfo = threadMXBean.getThreadInfo(thread.getId(), Integer.MAX_VALUE);
-        SpanContext spanContext = retrieveActiveSpan(thread);
+        Duration samplingPeriod = samplingPeriod(timestampNanos, currentSampleTimestamp);


I have heard that the period isn't really used for snapshot profiling events, you sure it is worth it to compute exact values? Secondly since this code is called from the scheduled executor and also manually on stop you have a bit of race possibility here. Wouldn't be surprised when with debugger you could create a situation where the computed period is negative.

Previously the sampling period was ignored (see changes in AsyncStackTraceExporter). Now the sampling period included in the profiling log message will be the one computed by the StackTraceSampler. This data is eventually stored in the profiling backend and used to calculated the estimated total time each stack frame was running.

+1 @laurit this timestampnanos needs to be threadsafe

Imo the timestampnanos could just be the tip of the iceberg here. ThreadSampler#shutdown calls

scheduler.shutdown(); gatherer.run();

ExecutorService.shutdown javadoc says

Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. Invocation has no additional effect if already shut down.
This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that.

I didn't try this out but hypothetically assuming that the executor was already executing gatherer.run when shutdown was called gatherer.run may run in parallel from the executor and from the explicit gatherer.run() call. Similarly stagingArea.empty(spanContext.getTraceId()) called from ScheduledExecutorStackTraceSampler#stop could run in parallel with gatherer.run from the executor with some unlucky timing. I'd guess that when gatherer.run is run in parallel the previous timestamp computations could be off, also AccumulatingStagingArea internally uses ArrayList that probably could be corrupted with concurrent mutations.

+1 accumulatingstagingarea needs to be threadsafe. Internal use of ArrayList won't work.

@laurit, @lo-jason Thread-safety of AccumulatingStagingArea addressed in #2288.

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java

lo-jason · 2025-04-22T17:09:23Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+  private class StackTraceGatherer implements Runnable {
    private final String traceId;
    private final Thread thread;
+    private long timestampNanos;


timestampnanos is accessed from multiple threads so should be made volatile

…d from multiple threads.

tduncan added 4 commits April 16, 2025 08:55

Calculate the sampling period using nanosecond timestamps rather than…

3510ac2

… assuming an assumed duration based on the configured sampling period.

Take a final stacktrace sample when trace profiling is stopped.

42ac0ee

Apply spotless code formatting.

2eb9b5d

Encapsulate thread profiler and scheduler in a dedicated class to sim…

951ca3f

…plify the coordination of taking a final sample when trace sampling stops.

tduncan requested review from a team as code owners April 16, 2025 21:22

Rename variable.

3923999

lo-jason approved these changes Apr 16, 2025

View reviewed changes

tduncan added 4 commits April 17, 2025 08:17

Use sample duration reported on the StackTrace object when converting…

946bedd

… into pprof.

Remove the unused sampling period field from AsyncStackTraceExporter.

43de019

Use static method to create instances of PprofCpuEventExporter.Builder.

2a1c2df

Apply spotless code formatting.

ed53097

breedx-splk reviewed Apr 17, 2025

View reviewed changes

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java Outdated Show resolved Hide resolved

breedx-splk reviewed Apr 17, 2025

View reviewed changes

Prevent NPE if InMemoryStagingArea were to be emptied with a trace ID…

37bd482

… that has never been used to stage a StackTrace.

laurit reviewed Apr 21, 2025

View reviewed changes

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java Show resolved Hide resolved

lo-jason reviewed Apr 22, 2025

View reviewed changes

tduncan added 4 commits April 22, 2025 10:30

Merge 'main' into 'span-end-thread-sampling'.

799047c

Mark the 'timestampNanos' field as volatile since it can be reference…

c94173c

…d from multiple threads.

Inline method.

cf65e02

Merge branch 'main' into 'span-end-thread-sampling'.

56e9558

tduncan mentioned this pull request Apr 24, 2025

Make AccumulatingStagingArea Thread Safe #2288

Merged

Merge 'main' into 'span-end-thread-sampling'.

a592ae0

laurit approved these changes Apr 25, 2025

View reviewed changes

laurit merged commit 569d98b into signalfx:main Apr 25, 2025
26 checks passed

github-actions bot locked and limited conversation to collaborators Apr 25, 2025

tduncan deleted the span-end-thread-sampling branch April 25, 2025 16:36

Conversation

tduncan commented Apr 16, 2025

Uh oh!

lo-jason left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tduncan Apr 18, 2025 •

edited

Loading

tduncan Apr 17, 2025 •

edited

Loading

tduncan Apr 21, 2025 •

edited

Loading

tduncan Apr 23, 2025 •

edited

Loading