Periodically Export Stacktraces From StagingArea#2298
Conversation
…odicallyExportingStagingArea.
…a and export StackTraces when closed.
|
|
||
| private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY = | ||
| "splunk.snapshot.profiler.staging.capacity"; | ||
| private static final int DEFAULT_SNAPSHOT_PROFILER_STAGING_CAPACITY = 2000; |
There was a problem hiding this comment.
I don't have a reason for the chosen defaults. Happy to change them.
There was a problem hiding this comment.
I think 2000 is reasonable for start.
profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StackTrace.java
Show resolved
Hide resolved
profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StagingArea.java
Show resolved
Hide resolved
...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java
Outdated
Show resolved
Hide resolved
...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java
Outdated
Show resolved
Hide resolved
| try { | ||
| scheduler.shutdown(); | ||
| if (!scheduler.awaitTermination(1, TimeUnit.SECONDS)) { | ||
| scheduler.shutdownNow(); | ||
| } | ||
| } catch (InterruptedException e) { | ||
| scheduler.shutdownNow(); | ||
| Thread.currentThread().interrupt(); | ||
| } |
There was a problem hiding this comment.
Imo calling shutdownNow does not make sense. If you really care whether it manages to export or not I think you should be using a more generous timeout than 1s.
There was a problem hiding this comment.
We can't wait too long; the OpenTelemetry SDK has I believe a 10 second timeout for the entire shutdown process.
There was a problem hiding this comment.
I think that limit doesn't really apply here. Sdk waits 10s for the CompletableResultCode to produce a result, but it does not apply a limit on how long producing that CompletableResultCode could take. Unfortunately the shutdown handling is still incomplete as we don't wait for the actual export to complete.
There was a problem hiding this comment.
Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?
There was a problem hiding this comment.
Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?
Before your last changes it was worker.join(TimeUnit.SECONDS.toMillis(5)); now it is just worker.join(). I wouldn't worry about this too much, it shouldn't really happen unless there is a bug (like now where the join never completes because shutdown is not called) or when there is abnormal load (e.g. gc runs in a loop).
…eeded from causing mulitple capacity related exports.
Simplify exporting snapshot stacks
| private static final Duration DEFAULT_SNAPSHOT_PROFILER_SAMPLING_INTERVAL = Duration.ofMillis(10); | ||
|
|
||
| private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL = | ||
| "splunk.snapshot.profiler.staging.empty.interval"; |
There was a problem hiding this comment.
Perhaps use splunk.snapshot.profiler.export.interval? The fact that the implementation stages the stacks somehow is irrelevant for users. Sdk uses otel.bsp.schedule.delay for similar purpose.
| private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL = | ||
| "splunk.snapshot.profiler.staging.empty.interval"; | ||
| private static final Duration DEFAULT_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL = | ||
| Duration.ofSeconds(5); |
There was a problem hiding this comment.
Perhaps we should go with a larger interval, like 30s, to allow for more stacks to accumulate. Are there any downsides for this?
There was a problem hiding this comment.
Longer delay means more likely that a trace is fully ingested and available in the UI before the callgraph which is poor UX. Is 30 seconds often enough for profiling data to win the race against trace ingestion?
There was a problem hiding this comment.
Poor UX because we indicate next to a trace ID that a call graph is available when in fact the call graph hasn't been (fully) ingested yet. This edge case is unavoidable in some circumstances but we want to avoid it as much as possible.
This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.
There was a problem hiding this comment.
This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.
This would only be true if the profiling ingestion operates faster than the trace ingestion. There could also be a collector between the app and ingest that further batches spans. I don't think we need to worry about this too much, there is inevitably a delay before the data appears in the apm. We can leave this to 5s for now.
| Duration.ofSeconds(5); | ||
|
|
||
| private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY = | ||
| "splunk.snapshot.profiler.staging.capacity"; |
There was a problem hiding this comment.
sdk uses otel.bsp.max.export.batch.size for similar purpose
This PR swaps out the
AccumulatingStagingAreafor aStagingAreaimplementation that will automatically export empty itself when 1) a certain amount of has passed or 2) it has reached its capacity. Both the time interval between exports and capacity are configurable.