Collect Callstacks For Threads Processing Traces by tduncan · Pull Request #2207 · signalfx/splunk-otel-java

tduncan · 2025-02-25T18:00:59Z

This PR adds the trace profiling ability to the snapshot profiling extension. Profiling is activated with the span processor when an entry span is encountered that has been selected for snapshotting, and profiling is stopped when that same entry span is ended.

At the moment this PR does not export the collected callstacks as logs. That ability will closely follow the merging of this PR.

…face.

…stomizer to insulate individual tests from additional fields being added to the customizer.

tduncan · 2025-02-25T21:36:34Z

profiler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/NoopStagingArea.java

Temporary solution. One of the very next PRs will include the stack trace data export and this will be replaced.

tduncan · 2025-02-25T21:38:48Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+
+  @Override
+  public void startSampling(String traceId, long threadId) {
+    ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();


The current implementation uses one profiling thread per trace selected for snapshotting. Should possibly consider a single profiling thread. Could be a followup change.

tduncan · 2025-02-25T22:01:59Z

...iler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotSpanAttributeTest.java

    }
  }
-
-  static class TogglableTraceRegistry extends TraceRegistry {


Extracted to a standalone class so it can be used in other tests.

tduncan · 2025-02-25T23:01:15Z

profiler/src/test/java/com/splunk/opentelemetry/profiler/snapshot/Profiling.java

This class is using a combination of the Object Mother and the Builder patterns to make it easier for tests to create instances of the profiling SDK customizer, plugging in different implementations of of the customizer's dependencies as needed. It also insulates tests from needed to respond directly to every dependency change made in the customizer.

The basic use will look like Profiling.customizer().build() and there are examples in the PR where tests plug in different TraceRegistry and StackTraceSampler implementations.

laurit · 2025-02-26T14:10:02Z

...src/main/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingSpanProcessor.java

  public void onEnd(ReadableSpan span) {
    if (isEntry(span)) {
      registry.unregister(span.getSpanContext());
+      sampler.stopSampling(Thread.currentThread().getId());


Unfortunately the assumption that server span is ended by the same thread that started it isn't true. For example async servlet requests could end after the thread that started the span and called the servlet's service method has exited from the servlet code.

Yes, lack of support for traces spanning multiple threads is a known limitation. We've intentionally not attempted to solve that problem just yet.

You should add a comment stating this limitation and if possible elaborate how this is mitigated and why it does not have an adverse effect. Currently it looks like a resource leak where element is never moved from a map and sampling is never stopped for the affected thread.

I was unaware that instrumentation for async frameworks started and ended spans on different threads. This will indeed lead to a resource leak and the continual sampling of the original thread. I'll add a test for this scenario and update the implementation to consider trace ID rather than thread ID so the profiling can be stopped appropriately.

It still won't support traces that delegate work to background processes or have multiple concurrent requests within the same trace.

laurit · 2025-02-26T15:13:25Z

...src/main/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingSpanProcessor.java

@@ -61,6 +64,7 @@ public boolean isStartRequired() {
  public void onEnd(ReadableSpan span) {
    if (isEntry(span)) {
      registry.unregister(span.getSpanContext());


Idk if this is an issue but the same service could have multiple requests concurrent requests from the same trace. For example imagine a reset service that gets called by some other service multiple times.

Yes, that scenario wouldn't be fully supported. The current implementation assumes everything is happening in the same thread.

Could you add a comment describing that limitation.

laurit · 2025-02-27T11:16:55Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+        StackTrace stackTrace = StackTrace.from(now, threadInfo);
+        stagingArea.stage(threadId, stackTrace);
+      } catch (Exception e) {
+        LOGGER.severe(e::getMessage);


Considering this is a user visible error it might be best to change the log message to a sentence describing the nature of failure and log the full stack trace along with the exception message. If this ever happens there is a chance that someone will need to figure out what went wrong.

laurit · 2025-02-27T11:17:18Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+import java.util.logging.Logger;
+
+class ScheduledExecutorStackTraceSampler implements StackTraceSampler {
+  private static final Logger LOGGER =


the convention in this project is to use lowercase logger

…ting callstacks from profilied threads.

…ntually track traces, not only thread IDs.

…reads managing a spans lifecycle (e.g. async web frameworks) and avoid a resource leak and runaway profiling thread.

…op trace profiling.

laurit · 2025-02-28T12:56:21Z

...main/java/com/splunk/opentelemetry/profiler/snapshot/ScheduledExecutorStackTraceSampler.java

+  @Override
+  public void start(SpanContext spanContext) {
+    ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
+    samplers.put(spanContext.getTraceId(), scheduler);


I think it would be a good idea to either not add a new scheduler when given trace is already in the map (multiple concurrent request for the same trace) or stop the scheduler that was already in the map or allow multiple schedulers for given trace. Otherwise you'll leak the ScheduledExecutorService that was already in the map.

Good catch!

laurit · 2025-02-28T13:02:37Z

...src/main/java/com/splunk/opentelemetry/profiler/snapshot/SnapshotProfilingSpanProcessor.java

+ * <b>Implementation Note</b><br>
+ * Only single-threaded traces are currently supported. Behavior of the extension in a multithreaded
+ * trace environment is unspecified and explicitly not supported.


I think you should rephrase the comment in simpler terms. Perhaps something along the lines of Profiling multiple concurrent traces with the same trace id is not supported. To me the meaning of single-threaded and multithreaded is not obvious here.

I'll think about how to make it more clear.

There are two primary use cases that are not supported: 1) a service is called multiple times concurrently within the same trace and 2) a service delegates some of its work to a background process within the same trace. Neither is explicitly supported by this initial implementation.

…ion more clear.

tduncan added 9 commits February 21, 2025 14:12

Add the StackTraceSampler interface.

e24264f

Add a ScheduledExecutor implementation of the StackTraceSampler inter…

b2827a0

…face.

Start and stop profiling when entry spans start and end respectively.

3e6620a

Add a noop StagingArea implementation and test builder for the SDK cu…

574c9c1

…stomizer to insulate individual tests from additional fields being added to the customizer.

Use a concurrent map in ScheduledExecutorStackTraceSampler.

5b3e086

Apply spotless code formatting.

5368c14

Convert SAMPLING_PERIOD constant to Duration.

0e33b89

Remove debug logging from ScheduledExecutorStackTraceSampler.

0d8b811

InMemoryStagingArea doesn't need to be a JUnit extension.

53c177c

tduncan requested review from a team as code owners February 25, 2025 18:01

tduncan commented Feb 25, 2025

View reviewed changes

Remove unused JUnit extension method from ObservableStackTraceSampler.

1921270

tduncan commented Feb 25, 2025

View reviewed changes

laurit reviewed Feb 26, 2025

View reviewed changes

laurit reviewed Feb 27, 2025

View reviewed changes

tduncan added 6 commits February 27, 2025 08:34

Rename LOGGER to lowercase.

02b6536

Log explanatory message and stacktrace if something goes wrong collec…

552f316

…ting callstacks from profilied threads.

Add explanatory comment to SnapshotProfilingSpanProcessor.

a8dee6d

Expand the StackTraceSampler interface to accept a SpanContext to eve…

aff6d6b

…ntually track traces, not only thread IDs.

Associate thread profilers with trace IDs to account for different th…

3f996f5

…reads managing a spans lifecycle (e.g. async web frameworks) and avoid a resource leak and runaway profiling thread.

Add test verifying that ending a span from a different thread will st…

f0d7207

…op trace profiling.

laurit reviewed Feb 28, 2025

View reviewed changes

tduncan added 2 commits February 28, 2025 09:55

Only allow a single thread per trace to be profiled at a time.

a589fe3

Merge branch 'main' into port-trace-profiling

79b1b47

tduncan added 2 commits February 28, 2025 16:10

Rename the 'Profilng' class to 'Snapshotting'.

b87e8d8

Attempt to make the explanation about the multithreaded trace limitat…

c2bcddb

…ion more clear.

laurit approved these changes Mar 4, 2025

View reviewed changes

breedx-splk merged commit ef0396b into signalfx:main Mar 5, 2025
26 checks passed

github-actions bot locked and limited conversation to collaborators Mar 5, 2025

tduncan deleted the port-trace-profiling branch April 8, 2025 16:05

Conversation

tduncan commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants