Export Snapshot Profiling Stack Traces As OpenTelemetry Log Messages by tduncan · Pull Request #2237 · signalfx/splunk-otel-java

tduncan · 2025-03-13T18:21:52Z

This PR plugs in the log exporting functionality of the snapshot profiler which completes the end-to-end extension data flow. The full extension sequence is now:

Trace is selected for snapshotting
Start of root span of trace is detected
Profiling for trace thread starts
Collected stack traces are staged in the AccumulatingStagingArea
End of root span of trace is detected
Profiling of trace thread stops
Stack traces for trace are emptied from StagingArea to StackTraceExporter
Stack traces are converted into the pprof data format
Otel LogRecordData is created using pprof as log body
Otel LogRecordData is emitted to the configured Otel LogRecordProcessor

The configuration of the Otel Logger required adding an additional AgentListener to obtain the Resource, which in turn necessitated a 2-step initialization and an additional abstraction layer within AccumulatingStagingArea.

The need to utilize the AgentListener callback also required further simulating of the agent startup process which was done by updating OpenTelemetrySdkExtension to support adding AgentListeners and calling into them prior to a test running. This did require OpenTelemetrySdkExtension to extend AutoConfiguredOpenTelemetrySdk and move into shadow package to actually implement the abstract methods along with renaming the builder function to configure so as to not conflict with builder function already defined.

…ted into the pprof data format as otel logs.

…rters.

…t profiler.

…io.opentelemetry.sdk.autoconfigure package.

…g is properly configured and that logs are exported to the OpenTelemetry log exporter.

…essage attributes.

…LogExportingTest.

tduncan · 2025-03-13T18:40:24Z

profiler/build.gradle.kts

  annotationProcessor("com.google.auto.service:auto-service")
  compileOnly("com.google.auto.service:auto-service")

+  testImplementation(project(":custom"))


I needed access to the AutoConfigureUtil class defined in this subproject. Used in StackTraceExporterActivator to access the ConfigProperties and Resource fields in the AutoConfiguredOpenTelemetrySdk when configuring the Otel Logger.

The compileOnly access already defined did not allow me to test that the configuration was taking place.

tduncan · 2025-03-13T18:42:35Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java


+  public static boolean isSnapshotProfilingEnabled(ConfigProperties properties) {
+    return properties.getBoolean(CONFIG_KEY_ENABLE_SNAPSHOT_PROFILER, false);
+  }


A 2nd use was added in StackTraceExporterActivator so I thought it made sense to roll this into the Configuration class.

tduncan · 2025-03-13T18:45:43Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/OtelLoggerFactory.java

Adapted from code in JfrActivator. I intentionally didn't modify the existing profiling code to minimize the scope of this PR.

tduncan · 2025-03-13T18:49:10Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/Pprof.java

Very similar to the existing Pprof class, mostly QoL/DX changes. I decided to port the code AS IN, duplications and all, due to timeline constraints. Once the extension functionality is in place my intention is to figure out how to unify the two slightly different strategies used by the continuous profiler and the snapshot profiler.

tduncan · 2025-03-13T18:51:21Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PprofTranslator.java

Similar to PprofCpuEventExporter though slightly simpler due to guarantees that thread and trace information will be available.

tduncan · 2025-03-13T18:54:34Z

...er/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StackTraceExporterActivator.java

Sets up the StackTraceExporter instance so that collected stack traces can be exported as Otel logs. If this step doesn't happen traces may be profiled but the data will be dropped, not exported.

tduncan · 2025-03-13T19:01:50Z

...ler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/StackTraceExporterProvider.java

This class acts a middleman between AccumulatingStagingArea and the StackTraceExporter. Its necessary because the StagingArea is instantiated during the SDK configuration phase (see updates in SnapshotProfilingSdkCustomizer) but the Otel Logger can't be configured until later in agent startup process (see newly added StackTraceExporterActivator).

To insulate the AccumulatingStagingArea from this 2-step initialization requirement it uses this Supplier to obtain a StackTraceExporter, always receiving a no-op implementation if the initialization process has yet to complete.

tduncan · 2025-03-13T19:06:39Z

.../test/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingLogExportingTest.java

Component test of the extension's ability to export logs. Selects a trace for snapshotting, starts and stops a root span, and verifies that a log record lands in the configured Otel LogRecordExporter. Utilizes the additional simulation of an agent startup in OpenTelemetrySdkExtension.

tduncan · 2025-03-13T19:14:29Z

profiler/src/test/java/io/opentelemetry/sdk/autoconfigure/OpenTelemetrySdkExtension.java

The primary changes in this class is the ability to add attach AgentListener instances and simulating that portion of the agent startup process. To do this requires a AutoConfiguredOpenTelemetrySdk instance which is difficult to obtain on a per-test basis. The easiest approach was to subclass AutoConfiguredOpenTelemetrySdk which unfortunately required moving the entire JUnit extension into this shadow package so two of the abstract methods could be implemented.

tduncan · 2025-03-13T19:23:20Z

There is obviously a lot of overlap between code added in this PR and existing code for the continuous profiler. My initial aim was to unify the two approaches by adapting the new snapshot profiling code to use the existing constructs, however as I started down that path it became clear there were enough nuances between the implementations that there wasn't a straightforward way to reconcile the differences. Given that I decided the safest thing to do is port the snapshot profiling log exporting code more-or-less as is and attempt the pprof translation and log data export unification effort later. Also looming is a fairly tight timeline for making this functionality available for demos.

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/AsyncStackTraceExporter.java

laurit · 2025-03-18T08:54:52Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/Pprof.java

+ * which is itself adapted from Google's Bazel build system <a
+ * href=https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/profiler/memory/AllocationTracker.java>here</a>.
+ */
+class Pprof {


To me this copy paste does not make sense. Have a look at https://github.com/signalfx/splunk-otel-java/pull/1306/files#diff-ca57673d0f9c0ab8a63ac33abeb7aa277d52fce7a74774eb6e2ace458c01c88e Would it be complicated to use a similar approach? If it helps I guess we could merge consider merging that PR (needs approval from @breedx-splk) or you could copy from that PR. Note that gdi spec dictates the values for missing stack trace items (line numbers, method, class and file names).

I tried to explain here and here why I went this way.

The longer version is that I was tinkering for a few hours on how to reconcile the differences between the StackTrace defined in the snapshot package with the StackTrace defined in StackTraceParser used in PprofCpuEventExporter. The two data classes are very similar with the primary differences being the osThreadId field and how the stack frames are represented (StackTraceLine vs the built-in StackTraceElement). The approach I was exploring was to define StackTrace and StackFrame interfaces with the relevant accessor methods.

The next obstacle was around the building and serialization of the profiling data. PprofCpuEventExporter maintains a mutable Pprof instance that it resets after being flushed where as the design of the snapshot profiling extension does not maintain any mutable state during the log export process. The location of the serialization code also differs with it being found in the Pprof class itself vs in the AsyncLogExporter for snapshots.

There's also a minor DX change on the snapshot profiling Pprof version where the label creation and adding of labels to Samples is separated into two distinct operations and pushed up a level into the PprofTranslator as well the addition a few convenience methods to accept a StackTraceElement. Ex:

// Always On ... if (stackTrace.getThreadId() != 0) { pprof.addLabel(sample, THREAD_ID, stackTrace.getThreadId()); pprof.addLabel(sample, THREAD_NAME, stackTrace.getThreadName()); } ... for (StackTraceParser.StackTraceLine stl : stackTrace.getStackTraceLines()) { sample.addLocationId( pprof.getLocationId( stl.getLocation(), stl.getClassName(), stl.getMethod(), stl.getLineNumber())); pprof.incFrameCount(); } // Snapshot Profiling ... sample.addLabel(pprof.newLabel(THREAD_ID, stackTrace.getThreadId())); sample.addLabel(pprof.newLabel(THREAD_NAME, stackTrace.getThreadName())); ... for (StackTraceElement stackFrame : stackTrace.getStackFrames()) { sample.addLocationId(pprof.getLocationId(stackFrame)); pprof.incFrameCount(); }

Finally, I also thought it was a better idea to tackle the unification of the two stack trace data models and their translation into the pprof data format separately so as to not unintentionally break the existing Always On profiling extension while the snapshot profiling extension is being ported.

Also see here for another place where I intentionally did not modify the existing profiling code to minimize the scope of this PR.

why not something like tduncan#1

@laurit, thanks for doing this! I've merged that PR into the branch, added tests, and also added new configuration properties to allow control over both the stack depth and sampling period of the snapshot profiler.

laurit · 2025-03-18T11:50:37Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

@@ -93,10 +97,12 @@ public void run() {
      Instant now = Instant.now();
      try {
        ThreadInfo threadInfo = threadMXBean.getThreadInfo(threadId, MAX_ENTRY_DEPTH);


In the non-snapshot profiler there is a configuration option for this

splunk-otel-java/profiler/src/main/java/com/splunk/opentelemetry/profiler/Configuration.java

Line 172 in 8a9c34a

public static int getStackDepth(ConfigProperties config) {

On thing you need to handle is that the for stack traces that are truncated you need to set an extra attribute. For this to be possible you may need to set a slightly larger limit here and manually truncate later.

...src/main/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingSdkCustomizer.java

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

laurit · 2025-03-18T12:42:51Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/AccumulatingStagingArea.java

+import java.util.concurrent.ConcurrentMap;
+import java.util.function.Supplier;
+
+class AccumulatingStagingArea implements StagingArea {


Here you collect stack traces per trace id and when the trace is done you export them. To me the staging area abstraction doesn't seem ideal. When there is a long request you end up accumulating many stack traces, when the request is short you export a small batch with only one or two traces. When there are multiple concurrent short requests you'll export many small batches. With the non-snapshot profiler we already get the input data in batches and the interval between the batches is large so we don't need to handle this. Did you consider alternative solutions like for example what is used in https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessorBuilder.java Place data in the queue, if queue has batch size items (there it is 512) export them, otherwise wait for a bit (there 5s) and export whatever is in the queue. If queue grows too large drop data.

This sounds like a very good idea. Would you be against doing this as a followup (this PR is a partial blocker to adding span correlation)?

I don't think a time-based approach is necessarily the right approach however. What I'm envisioning is a BufferingStagingArea with some number of max stack traces before it automatically empties itself in addition to the forced empty upon detection of the entry span ending. I don't know what that number should be; maybe 20?

This would control how many stack traces are bundled into a single log record. The frequency of the log record exporting is separately controlled by the LogRecordProcessor configured in OtelLoggerFactory. Right now it wires up a SimpleLogRecordProcessor which I believe immediately exports the log record. We could use In the BatchLogRecordProcessor instead, though I've had issues getting this to work in the past.

Would you be against doing this as a followup

sure

I don't think a time-based approach is necessarily the right approach however. What I'm envisioning is a BufferingStagingArea with some number of max stack traces before it automatically empties itself in addition to the forced empty upon detection of the entry span ending. I don't know what that number should be; maybe 20?

Flushing when there are no new stack traces in some time isn't that hard. When you write stacks from all ongoing traces to the same pprof then flushing on ending the entry span doesn't make sense. 20 is imo too low, 2000 or even larger would be better. I think that determining the batch size requires some experimentation.

This would control how many stack traces are bundled into a single log record. The frequency of the log record exporting is separately controlled by the LogRecordProcessor configured in OtelLoggerFactory. Right now it wires up a SimpleLogRecordProcessor which I believe immediately exports the log record. We could use In the BatchLogRecordProcessor instead, though I've had issues getting this to work in the past.

Using BatchLogRecordProcessor here isn't ideal because pprof works best when you put lots of stacks into the same batch. When the same string is used multiple times then the shared string pool will help in reducing the transfer size. We need to manage creating these batches ourself to not export too often so we really don't gain much from using the BatchLogRecordProcessor.

breedx-splk · 2025-03-18T22:20:22Z

Building an implementation of OpenTelemetry for testing is wild. Without having read this code yet, my instinct is that it's a mistake or at least a misstep.

Share pprof code

…s collected by the snaphot profiler.

…max stack trace depth for the snapshot profiler.

…ing rate.

laurit · 2025-03-24T14:22:00Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/AsyncStackTraceExporter.java

+  private static final java.util.logging.Logger logger =
+      java.util.logging.Logger.getLogger(AsyncStackTraceExporter.class.getName());
+
+  private final ExecutorService executor = Executors.newSingleThreadScheduledExecutor();


I think you meant to use newSingleThreadExecutor here.

Thanks. Updated!

laurit · 2025-03-24T14:26:54Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/AsyncStackTraceExporter.java

+        }
+        cpuEventExporter.flush();
+      } catch (Exception e) {
+        logger.log(Level.SEVERE, "an exception was thrown", e);


Perhaps change to something like An exception was thrown while exporting profiling snapshots.

laurit · 2025-03-24T14:28:09Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/AsyncStackTraceExporter.java

+              stackTrace.getStackFrames(),
+              stackTrace.getTimestamp(),
+              stackTrace.getTraceId(),
+              null);


I'm not sure how the profiling UI will handle the missing span id here. Perhaps it would be better to use the local root span id here?

The next PR adding span correlation will fill this detail in. To your concern a different set of UI screens is being created for snapshot profiling data and the two sources of callstacks will not overlap at any point in the UI.

laurit · 2025-03-24T14:33:45Z

.../java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSamplerTest.java

+
+    try {
+      sampler.start(spanContext);
+      await().atMost(HALF_SECOND).until(() -> !staging.allStackTraces().isEmpty());


it is best not to use short wait times like atMost(HALF_SECOND) as this can result in sporadic test failures

I removed all of the explicit timeouts from these tests so the 10 second default is used.

…duledExecutor.

…ceSamplerTest.

…ot profiling sampling interval.

tduncan added 16 commits March 11, 2025 09:51

Add an accumulating staging area.

72d1404

Add the AsyncStackTraceExporter.

28772e7

Merge branch 'main' into port-profile-log-export

2bec2fa

Add StackTraceExporter implementation that emits stack traces transla…

2eaf7e6

…ted into the pprof data format as otel logs.

Add the StackTraceExporterProvider.

b78aa4b

Update AccumulatingStagingArea to accept a Supplier of StackTraceExpo…

cff68ec

…rters.

Add an AgentListener to configure the Otel logger for for the snapsho…

4ec661d

…t profiler.

Add function to Configuration for the snapshot profiling feature flag.

62cbcfb

Apply spotless code formatting.

be77245

Replace noop staging area with a real implememtation.

a9a98ad

Update all usages of OpenTelemetrySdkExtension to the version in the …

7f70840

…io.opentelemetry.sdk.autoconfigure package.

Apply spotless code formatting.

1190956

Add test verifying that trace snapshot profiling log message exportin…

02ed77a

…g is properly configured and that logs are exported to the OpenTelemetry log exporter.

Include the overall stack frame count in the snapshot profiling log m…

30551e4

…essage attributes.

Properly reset the StackTraceExporterProvider after SnapshotProfiling…

aab9b3d

…LogExportingTest.

Remove unused method from PprofTranslator.

98d7943

tduncan requested review from a team as code owners March 13, 2025 18:21

tduncan commented Mar 13, 2025

View reviewed changes

tduncan added 2 commits March 17, 2025 14:31

Merging main into branch 'port-profile-log-export'.

7234dcf

Apply spotless code formatting.

fad3539

laurit reviewed Mar 18, 2025

View reviewed changes

tduncan and others added 10 commits March 19, 2025 02:42

Remove the trace ID from the snapshot profiling log message context.

b50f54c

Assume a consistent sampling duration rather than compute it.

072faa0

Apply spotless code formatting.

826cfae

Merge branch 'main' into port-profile-log-export

e0db0d6

Share pprof code

9b10d29

Merge pull request #1 from laurit/share-pprof

b5bc51d

Share pprof code

Remove commented out tests in AsyncStackTraceExporterTest.

cc4c27d

Add configuration option for controlling the max depth of stack trace…

475b5d9

…s collected by the snaphot profiler.

Add tests for new stack trace exporting and configuration option for …

a5bfcbc

…max stack trace depth for the snapshot profiler.

Add configuration option for controlling the snapshot profiling sampl…

bd562fd

…ing rate.

laurit reviewed Mar 24, 2025

View reviewed changes

tduncan added 4 commits March 24, 2025 08:21

Use a single threaded Executor in AsyncStackTraceExporter, not a Sche…

c89be8a

…duledExecutor.

Improve exception error message in AsyncStackTraceExporter.

f8ef40e

Remove configured wait timeout for tests in ScheduledExecutorStackTra…

b6f6055

…ceSamplerTest.

Use ConfigProperties method 'getDuration' when looking for the snapsh…

9358d52

…ot profiling sampling interval.

laurit approved these changes Mar 24, 2025

View reviewed changes

laurit merged commit 61d2d25 into signalfx:main Mar 25, 2025
26 checks passed

github-actions bot locked and limited conversation to collaborators Mar 25, 2025

tduncan deleted the port-profile-log-export branch April 8, 2025 16:05

Conversation

tduncan commented Mar 13, 2025

Uh oh!

tduncan Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tduncan commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

breedx-splk commented Mar 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tduncan Mar 13, 2025 •

edited

Loading

tduncan Mar 13, 2025 •

edited

Loading