Take Final Stack Trace Sample When Trace Profiling Stops#2277
Take Final Stack Trace Sample When Trace Profiling Stops#2277laurit merged 15 commits intosignalfx:mainfrom
Conversation
… assuming an assumed duration based on the configured sampling period.
…plify the coordination of taking a final sample when trace sampling stops.
lo-jason
left a comment
There was a problem hiding this comment.
looks good. I had comment on slack https://splunk.slack.com/archives/C07MYK71B55/p1744845504377729?thread_ts=1744835907.680119&cid=C07MYK71B55
callgraph executor ending early is a bug we should look to address before release as well.
| traceId -> { | ||
| ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); | ||
| scheduler.scheduleAtFixedRate( | ||
| new StackTraceGatherer( | ||
| samplingPeriod, spanContext.getTraceId(), Thread.currentThread()), | ||
| SCHEDULER_INITIAL_DELAY, | ||
| samplingPeriod.toMillis(), | ||
| TimeUnit.MILLISECONDS); | ||
| return scheduler; | ||
| }); | ||
| traceId -> new ThreadSampler(traceId, Thread.currentThread(), samplingPeriod)); |
There was a problem hiding this comment.
Why is this change in here?
There was a problem hiding this comment.
The creation and scheduling of StackTraceGatherer instances has moved into the construction of the new ThreadSampler which now assumes responsibility of managing thread profiling mechanism, including stopping the scheduled sampling and taking the final on demand sample when trace profiling is stopped.
| } | ||
| } | ||
|
|
||
| private Duration samplingPeriod(long fromNanos, long toNanos) { |
There was a problem hiding this comment.
I don't think this method provides too much -- I'd just inline it.
profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java
Outdated
Show resolved
Hide resolved
| .collect(Collectors.toList()); | ||
| } | ||
|
|
||
| static class StagedStackTraces extends AbstractList<StackTrace> { |
There was a problem hiding this comment.
This feels bad to me. It's also just a strange inner class that can potentially claim to be "emptied" even though it has content in the list and can have content in the list but still be "emptied". I don't follow. Even if you do still somehow want to keep this, it needs to be private.
There was a problem hiding this comment.
It can't be private as it is used in ScheduledExecutorStackTraceSamplerTest.
Testing the inclusion of a unscheduled thread snapshot being taken when trace sampling is stopped caused contradictory requirements of visibility into the StagingArea. Right now the expectations is that the StagingArea will be emptied when sampling stops and previously this was confirmed by checking that the List here was indeed an empty list. Now we have a new requirement that a final thread snapshot is taken when sampling is stopped, but because the StagingArea is also emptied there wasn't a reliable way to confirm an additional StackTrace instance was staged. This class is the compromise a came up with; it records the staged StackTraces (which is what most tests care about) as well as whether the empty method was called for a specific trace ID.
Also part of the context, and not at all possible to know from this PR, is a change that I have nearly ready (its partially blocked by #2262) to completely remove this part of the design. Instead of staging and emptying by trace ID the StagingArea will keep all StackTraces in a single bucket and export all of the at the same time periodically rendering this class and the test that relies on it unnecessary.
There was a problem hiding this comment.
We tend to put a comment or use @VisibleForTesting so that the reader knows why things are visible.
There was a problem hiding this comment.
Sure, I'm familiar with that convention for production code, though this class exists in the test directory and is only available to tests (specifically those in the com.splunk.opentelemetry.profiler.snapshot package).
… that has never been used to stage a StackTrace.
| class StackTraceGatherer implements Runnable { | ||
| private final Duration samplingPeriod; | ||
| private class ThreadSampler { | ||
| private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); |
There was a problem hiding this comment.
imo it would be better to use daemon threads like in https://github.com/signalfx/splunk-otel-java/blob/main/profiler/src/main/java/com/splunk/opentelemetry/profiler/util/HelpfulExecutors.java
|
|
||
| void stop() { | ||
| scheduler.shutdown(); | ||
| gatherer.run(); |
There was a problem hiding this comment.
I don't get it why you'd want to get a sample on end. That sample doesn't have any application code running, what use does it have?
There was a problem hiding this comment.
It helps report more accurate estimated timing information. Prior to this change all the trace profiles would have timing data that was some multiple of 20 because every sample was submitted with a 20ms duration, and would almost always be more than 20ms longer than the reported span duration (#2277 also improves the default accuracy but will still be a multiple of 10ms).
Computing each sample period and taking a final sample greatly reduces the difference between the report span duration and the reported profiling duration. They are still not the same, however. In fact with this change the profiling duration is consistently undercounting the duration by a few milliseconds when compared to the duration of the entry span duration.
There was a problem hiding this comment.
Are you sure that using the timing data this way makes sense at all for snapshot profiler? A stack sample proves that a method was executed at given time. It does not reveal anything about how long the method was executing. I get the feeling that you are deriving arbitrary meaning here for the period. As far as I remember profilers usually show percentages, like this method was observed in 52% of samples.
I guess the SCHEDULER_INITIAL_DELAY is set to 0 for similar reasons.
There was a problem hiding this comment.
This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.
The assumption today is that for a particular callstack. That callstack spent $INTERVAL_TIME running. Say the interval is 10ms.
- Span that runs 5ms will show method was executing for 10ms (1 sample on start, which says it took 10ms)
- Span that runs for 11ms will show method was executing for 20ms (2 samples, 10ms each)
Customers easily see this inaccuracy and wonder why method execution time is not accurate.
For our purposes because we are not taking a continuous sample we cannot assume the first sample took 10ms because we know we are starting this sample when the method starts. Making an assumption that the method has run for 5-10ms before we took a sample would be incorrect.
This extrapolation does lead to somewhat arbitrary inaccuracies for example these stack traces
Time: Stack
T0: java.lang.Thread.run() (synchronous)
T1: java.lang.Thread.run(), foo.bar()
T2:java.lang.Thread.run(), foo.bar()
Was the duration T1-T0 spent in run() or bar()? By necessity we have to extrapolate and guess. It is really a coin flip that we can go either way. What we can say for sure is that bar() ran for at least 10ms. And some combination of 10ms spent likely in either run() or bar() or even a method call that ran too fast to be tracked.
There was a problem hiding this comment.
This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives.
The difference is that a classic profiler has usually lots of samples. Here even if you somehow combined samples from multiple requests to the same endpoint it still wouldn't help because all the samples are biased because they are taken at 10ms intervals.
This extrapolation does lead to somewhat arbitrary inaccuracies
If you already know that this is not going to work correctly then perhaps you should considered alternatives. Whatever we choose to implement now is what we are going to be stuck with for the foreseeable future. If you believe that taking an extra sample at the start and end of the trace is the best option then you will need to get this into the gdi spec along with whatever magic you need to do to compute the period.
Imo it would be nice to see you moving from having a separate scheduler per profiled trace into something more standard like having a single scheduler that drives the sampling for all profiled traces.
There was a problem hiding this comment.
Arbitrary inaccuracy is too harsh, the main point is that the strategy is consistent. We always say for the time since last sample it is spent in the method that we see. We are looking at smaller time periods here, but still it is for the user to understand approximate time taken.
| Instant now = Instant.now(); | ||
| ThreadInfo threadInfo = threadMXBean.getThreadInfo(thread.getId(), Integer.MAX_VALUE); | ||
| SpanContext spanContext = retrieveActiveSpan(thread); | ||
| Duration samplingPeriod = samplingPeriod(timestampNanos, currentSampleTimestamp); |
There was a problem hiding this comment.
I have heard that the period isn't really used for snapshot profiling events, you sure it is worth it to compute exact values? Secondly since this code is called from the scheduled executor and also manually on stop you have a bit of race possibility here. Wouldn't be surprised when with debugger you could create a situation where the computed period is negative.
There was a problem hiding this comment.
Previously the sampling period was ignored (see changes in AsyncStackTraceExporter). Now the sampling period included in the profiling log message will be the one computed by the StackTraceSampler. This data is eventually stored in the profiling backend and used to calculated the estimated total time each stack frame was running.
There was a problem hiding this comment.
+1 @laurit this timestampnanos needs to be threadsafe
There was a problem hiding this comment.
Imo the timestampnanos could just be the tip of the iceberg here. ThreadSampler#shutdown calls
scheduler.shutdown();
gatherer.run();
ExecutorService.shutdown javadoc says
Initiates an orderly shutdown in which previously submitted tasks are executed, but no new tasks will be accepted. Invocation has no additional effect if already shut down.
This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that.
I didn't try this out but hypothetically assuming that the executor was already executing gatherer.run when shutdown was called gatherer.run may run in parallel from the executor and from the explicit gatherer.run() call. Similarly stagingArea.empty(spanContext.getTraceId()) called from ScheduledExecutorStackTraceSampler#stop could run in parallel with gatherer.run from the executor with some unlucky timing. I'd guess that when gatherer.run is run in parallel the previous timestamp computations could be off, also AccumulatingStagingArea internally uses ArrayList that probably could be corrupted with concurrent mutations.
There was a problem hiding this comment.
+1 accumulatingstagingarea needs to be threadsafe. Internal use of ArrayList won't work.
profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/InMemoryStagingArea.java
Show resolved
Hide resolved
| private class StackTraceGatherer implements Runnable { | ||
| private final String traceId; | ||
| private final Thread thread; | ||
| private long timestampNanos; |
There was a problem hiding this comment.
timestampnanos is accessed from multiple threads so should be made volatile
This PR adds a final stack trace sample taken when stack trace sampling is stopped for a given trace. Upon stopping sampling for a given trace ID a find stack trace sample will be taken on demand with a duration matching more closely with the amount of time elapsed between when the previous sample was taken and when the service entry span ends.
In addition the mechanism by which the stack trace duration is calculated has been updated to compute the value every iteration rather than assume the configured sampling duration. This has the effect that at sampling start and stop a value likely much smaller than the configured sampling rate will be calculated, and normal executions of the sampler will report computed values rather than an assumed value of the sampling period.
Note: This PR and #2276 are addressing the same issue.