-
Notifications
You must be signed in to change notification settings - Fork 45
Take Final Stack Trace Sample When Trace Profiling Stops #2277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
3510ac2
42ac0ee
2eb9b5d
951ca3f
3923999
946bedd
43de019
2a1c2df
ed53097
37bd482
799047c
c94173c
cf65e02
56e9558
a592ae0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,8 +36,7 @@ class ScheduledExecutorStackTraceSampler implements StackTraceSampler { | |
| Logger.getLogger(ScheduledExecutorStackTraceSampler.class.getName()); | ||
| private static final int SCHEDULER_INITIAL_DELAY = 0; | ||
|
|
||
| private final ConcurrentMap<String, ScheduledExecutorService> samplers = | ||
| new ConcurrentHashMap<>(); | ||
| private final ConcurrentMap<String, ThreadSampler> samplers = new ConcurrentHashMap<>(); | ||
| private final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); | ||
| private final StagingArea stagingArea; | ||
| private final Supplier<SpanTracker> spanTracker; | ||
|
|
@@ -54,52 +53,68 @@ class ScheduledExecutorStackTraceSampler implements StackTraceSampler { | |
| public void start(SpanContext spanContext) { | ||
| samplers.computeIfAbsent( | ||
| spanContext.getTraceId(), | ||
| traceId -> { | ||
| ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); | ||
| scheduler.scheduleAtFixedRate( | ||
| new StackTraceGatherer( | ||
| samplingPeriod, spanContext.getTraceId(), Thread.currentThread()), | ||
| SCHEDULER_INITIAL_DELAY, | ||
| samplingPeriod.toMillis(), | ||
| TimeUnit.MILLISECONDS); | ||
| return scheduler; | ||
| }); | ||
| traceId -> new ThreadSampler(traceId, Thread.currentThread(), samplingPeriod)); | ||
| } | ||
|
|
||
| @Override | ||
| public void stop(SpanContext spanContext) { | ||
| ScheduledExecutorService scheduler = samplers.remove(spanContext.getTraceId()); | ||
| if (scheduler != null) { | ||
| scheduler.shutdown(); | ||
| String traceId = spanContext.getTraceId(); | ||
| ThreadSampler sampler = samplers.remove(traceId); | ||
| if (sampler != null) { | ||
| sampler.stop(); | ||
| } | ||
|
|
||
| stagingArea.empty(spanContext.getTraceId()); | ||
| } | ||
|
|
||
| class StackTraceGatherer implements Runnable { | ||
| private final Duration samplingPeriod; | ||
| private class ThreadSampler { | ||
| private final ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); | ||
|
||
| private final StackTraceGatherer gatherer; | ||
|
|
||
| private ThreadSampler(String traceId, Thread thread, Duration period) { | ||
| gatherer = new StackTraceGatherer(traceId, thread, System.nanoTime()); | ||
| scheduler.scheduleAtFixedRate( | ||
| gatherer, SCHEDULER_INITIAL_DELAY, period.toMillis(), TimeUnit.MILLISECONDS); | ||
| } | ||
|
|
||
| void stop() { | ||
| scheduler.shutdown(); | ||
| gatherer.run(); | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't get it why you'd want to get a sample on end. That sample doesn't have any application code running, what use does it have?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It helps report more accurate estimated timing information. Prior to this change all the trace profiles would have timing data that was some multiple of 20 because every sample was submitted with a 20ms duration, and would almost always be more than 20ms longer than the reported span duration (#2277 also improves the default accuracy but will still be a multiple of 10ms). Computing each sample period and taking a final sample greatly reduces the difference between the report span duration and the reported profiling duration. They are still not the same, however. In fact with this change the profiling duration is consistently undercounting the duration by a few milliseconds when compared to the duration of the entry span duration.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure that using the timing data this way makes sense at all for snapshot profiler? A stack sample proves that a method was executed at given time. It does not reveal anything about how long the method was executing. I get the feeling that you are deriving arbitrary meaning here for the period. As far as I remember profilers usually show percentages, like this method was observed in 52% of samples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the same strategy used for classic java profilers. Unless methods are being instrumented directly then method timing is extrapolated the time based on the samples that it receives. The assumption today is that for a particular callstack. That callstack spent $INTERVAL_TIME running. Say the interval is 10ms.
For our purposes because we are not taking a continuous sample we cannot assume the first sample took 10ms because we know we are starting this sample when the method starts. Making an assumption that the method has run for 5-10ms before we took a sample would be incorrect. This extrapolation does lead to somewhat arbitrary inaccuracies for example these stack traces Was the duration T1-T0 spent in run() or bar()? By necessity we have to extrapolate and guess. It is really a coin flip that we can go either way. What we can say for sure is that bar() ran for at least 10ms. And some combination of 10ms spent likely in either run() or bar() or even a method call that ran too fast to be tracked.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The difference is that a classic profiler has usually lots of samples. Here even if you somehow combined samples from multiple requests to the same endpoint it still wouldn't help because all the samples are biased because they are taken at 10ms intervals.
If you already know that this is not going to work correctly then perhaps you should considered alternatives. Whatever we choose to implement now is what we are going to be stuck with for the foreseeable future. If you believe that taking an extra sample at the start and end of the trace is the best option then you will need to get this into the gdi spec along with whatever magic you need to do to compute the period. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arbitrary inaccuracy is too harsh, the main point is that the strategy is consistent. We always say for the time since last sample it is spent in the method that we see. We are looking at smaller time periods here, but still it is for the user to understand approximate time taken. |
||
| } | ||
| } | ||
|
|
||
| private class StackTraceGatherer implements Runnable { | ||
| private final String traceId; | ||
| private final Thread thread; | ||
| private long timestampNanos; | ||
|
||
|
|
||
| StackTraceGatherer(Duration samplingPeriod, String traceId, Thread thread) { | ||
| this.samplingPeriod = samplingPeriod; | ||
| StackTraceGatherer(String traceId, Thread thread, long timestampNanos) { | ||
| this.traceId = traceId; | ||
| this.thread = thread; | ||
| this.timestampNanos = timestampNanos; | ||
| } | ||
|
|
||
| @Override | ||
| public void run() { | ||
| long currentSampleTimestamp = System.nanoTime(); | ||
| try { | ||
| Instant now = Instant.now(); | ||
| ThreadInfo threadInfo = threadMXBean.getThreadInfo(thread.getId(), Integer.MAX_VALUE); | ||
| SpanContext spanContext = retrieveActiveSpan(thread); | ||
| Duration samplingPeriod = samplingPeriod(timestampNanos, currentSampleTimestamp); | ||
|
||
| String spanId = retrieveActiveSpan(thread).getSpanId(); | ||
| StackTrace stackTrace = | ||
| StackTrace.from(now, samplingPeriod, threadInfo, traceId, spanContext.getSpanId()); | ||
| StackTrace.from(Instant.now(), samplingPeriod, threadInfo, traceId, spanId); | ||
| stagingArea.stage(traceId, stackTrace); | ||
| } catch (Exception e) { | ||
| logger.log(Level.SEVERE, e, samplerErrorMessage(traceId, thread.getId())); | ||
| } finally { | ||
| timestampNanos = currentSampleTimestamp; | ||
| } | ||
| } | ||
|
|
||
| private Duration samplingPeriod(long fromNanos, long toNanos) { | ||
|
||
| return Duration.ofNanos(toNanos - fromNanos); | ||
| } | ||
|
|
||
| private SpanContext retrieveActiveSpan(Thread thread) { | ||
| return spanTracker.get().getActiveSpan(thread).orElse(SpanContext.getInvalid()); | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change in here?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation and scheduling of
StackTraceGathererinstances has moved into the construction of the newThreadSamplerwhich now assumes responsibility of managing thread profiling mechanism, including stopping the scheduled sampling and taking the final on demand sample when trace profiling is stopped.