Periodically Export Stacktraces From StagingArea by tduncan · Pull Request #2298 · signalfx/splunk-otel-java

tduncan · 2025-04-28T17:58:18Z

This PR swaps out the AccumulatingStagingArea for a StagingArea implementation that will automatically export empty itself when 1) a certain amount of has passed or 2) it has reached its capacity. Both the time interval between exports and capacity are configurable.

…emetry SDK.

…odicallyExportingStagingArea.

…gArea.

…ging area.

…a and export StackTraces when closed.

… of 2000.

tduncan · 2025-04-28T17:59:07Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java

+
+  private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY =
+      "splunk.snapshot.profiler.staging.capacity";
+  private static final int DEFAULT_SNAPSHOT_PROFILER_STAGING_CAPACITY = 2000;


I don't have a reason for the chosen defaults. Happy to change them.

I think 2000 is reasonable for start.

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StackTrace.java

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StagingArea.java

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java

laurit · 2025-04-29T13:14:57Z

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java

+    try {
+      scheduler.shutdown();
+      if (!scheduler.awaitTermination(1, TimeUnit.SECONDS)) {
+        scheduler.shutdownNow();
+      }
+    } catch (InterruptedException e) {
+      scheduler.shutdownNow();
+      Thread.currentThread().interrupt();
+    }


Imo calling shutdownNow does not make sense. If you really care whether it manages to export or not I think you should be using a more generous timeout than 1s.

We can't wait too long; the OpenTelemetry SDK has I believe a 10 second timeout for the entire shutdown process.

I think that limit doesn't really apply here. Sdk waits 10s for the CompletableResultCode to produce a result, but it does not apply a limit on how long producing that CompletableResultCode could take. Unfortunately the shutdown handling is still incomplete as we don't wait for the actual export to complete.

Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?

Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?

Before your last changes it was worker.join(TimeUnit.SECONDS.toMillis(5)); now it is just worker.join(). I wouldn't worry about this too much, it shouldn't really happen unless there is a bug (like now where the join never completes because shutdown is not called) or when there is abnormal load (e.g. gc runs in a loop).

…eeded from causing mulitple capacity related exports.

…hot-export

Simplify exporting snapshot stacks

laurit · 2025-05-05T12:45:03Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java

  private static final Duration DEFAULT_SNAPSHOT_PROFILER_SAMPLING_INTERVAL = Duration.ofMillis(10);

+  private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
+      "splunk.snapshot.profiler.staging.empty.interval";


Perhaps use splunk.snapshot.profiler.export.interval? The fact that the implementation stages the stacks somehow is irrelevant for users. Sdk uses otel.bsp.schedule.delay for similar purpose.

laurit · 2025-05-05T12:47:14Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java

+  private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
+      "splunk.snapshot.profiler.staging.empty.interval";
+  private static final Duration DEFAULT_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
+      Duration.ofSeconds(5);


Perhaps we should go with a larger interval, like 30s, to allow for more stacks to accumulate. Are there any downsides for this?

Longer delay means more likely that a trace is fully ingested and available in the UI before the callgraph which is poor UX. Is 30 seconds often enough for profiling data to win the race against trace ingestion?

Poor UX because we indicate next to a trace ID that a call graph is available when in fact the call graph hasn't been (fully) ingested yet. This edge case is unavoidable in some circumstances but we want to avoid it as much as possible.

This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.

This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.

This would only be true if the profiling ingestion operates faster than the trace ingestion. There could also be a collector between the app and ingest that further batches spans. I don't think we need to worry about this too much, there is inevitably a delay before the data appears in the apm. We can leave this to 5s for now.

laurit · 2025-05-05T12:47:49Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java

+      Duration.ofSeconds(5);
+
+  private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY =
+      "splunk.snapshot.profiler.staging.capacity";


sdk uses otel.bsp.max.export.batch.size for similar purpose

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java

…rting thread.

tduncan added 19 commits April 25, 2025 11:27

Add StagingArea implementation that periodically empties itself.

23e6c9c

Use the PeriodicallyExportingStagingArea when customizing the OpenTel…

4dcbfeb

…emetry SDK.

Add configuration option for the frequency in which to empty the Peri…

65922be

…odicallyExportingStagingArea.

Export a copy of collected stacktraces in PeriodicallyExportingStagin…

f291c53

…gArea.

Return StackTraceExporter into accepting a List.

2f631f8

Modify the StagingArea interface to remove trace ID parameters.

5f13f30

Apply spotless code formatting.

7a3eb30

Create a 'view' of the existing stacktraces prior to emptying the sta…

f7db0f1

…ging area.

Schedule daemon threads.

1ca2291

Remove unnecessary test code.

1dd6131

Retrieve staging area export period from properties.

55a58cd

Apply spotless code formatting.

fb83db2

Remove debug print statements in test.

0909970

Immediately shutdown the scheduler in PeriodicallyExportingStagingAre…

4d03248

…a and export StackTraces when closed.

Check whether there are stack traces to export before making a copy.

66dfc45

Use a Set to hold staged stack traces.

7e61fb1

Add capacity considerations to PeriodicallyExportingStagingArea.

7b48c90

Add property for configuring the staging area capacity, default value…

28da0f8

… of 2000.

Rename constants and configuration function name.

ce6fde4

tduncan requested review from a team as code owners April 28, 2025 17:58

tduncan commented Apr 28, 2025

View reviewed changes

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StackTrace.java Show resolved Hide resolved

tduncan commented Apr 28, 2025

View reviewed changes

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StagingArea.java Show resolved Hide resolved

tduncan commented Apr 28, 2025

View reviewed changes

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java Outdated Show resolved Hide resolved

laurit reviewed Apr 29, 2025

View reviewed changes

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java Outdated Show resolved Hide resolved

laurit reviewed Apr 29, 2025

View reviewed changes

tduncan added 3 commits April 29, 2025 13:45

Prevent multiple staged StackTraces that cause the capacity to be exc…

1327e7c

…eeded from causing mulitple capacity related exports.

Add test requiring that stack traces be continually exported over time.

f798040

Export final StackTraces and shutdown exportWorker.

50b0b17

tduncan and others added 12 commits May 1, 2025 14:39

Set thread name.

1c9dbda

Interrupt thread instead of throwing exception.

4fc150b

Remove unnecessary method.

3d630bb

Rename queue to stackTraces.

96acf30

Reorder methods.

279cccb

Minor refactors.

421978b

Apply spotless code formatting.

960516e

Merge branch 'batch-export-snapshot-profiling-stacktraces' into snaps…

07ea81f

…hot-export

Merge pull request #2 from laurit/snapshot-export

b7378d3

Simplify exporting snapshot stacks

Move initial expor time calculation into constructor.

1844b1f

Remove unnecessary if statement.

9e7d1a4

Move max queue size calculation into Worker constructor.

feddbd5

laurit reviewed May 5, 2025

View reviewed changes

spotless

fd0bb8e

laurit reviewed May 5, 2025

View reviewed changes

...c/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicallyExportingStagingArea.java Show resolved Hide resolved

laurit added 2 commits May 5, 2025 16:06

spotless

1b2819a

fix build on java17

e1141de

laurit approved these changes May 5, 2025

View reviewed changes

tduncan and others added 7 commits May 5, 2025 09:22

Rename config property to 'splunk.snapshot.profiler.export.interval'.

c27db01

Add CountDownLatch to wait for full shutdown of the staging area expo…

677203c

…rting thread.

Apply spotless code formatting.

39a5a90

We don't need a CountDownLatch...

2bb2c86

spotless

c0cb4a8

Merge branch 'main' into batch-export-snapshot-profiling-stacktraces

bb20b4c

Reinstate shutdown method call removed by mistake.

3ecf6f9

laurit merged commit ead27f8 into signalfx:main May 7, 2025
26 checks passed

github-actions bot locked and limited conversation to collaborators May 7, 2025

tduncan deleted the batch-export-snapshot-profiling-stacktraces branch May 7, 2025 16:48

Conversation

tduncan commented Apr 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurit May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tduncan May 5, 2025 •

edited

Loading

laurit May 5, 2025 •

edited

Loading