Skip to content

Collect Callstacks For Threads Processing Traces#2207

Merged
breedx-splk merged 20 commits intosignalfx:mainfrom
tduncan:port-trace-profiling
Mar 5, 2025
Merged

Collect Callstacks For Threads Processing Traces#2207
breedx-splk merged 20 commits intosignalfx:mainfrom
tduncan:port-trace-profiling

Conversation

@tduncan
Copy link
Copy Markdown
Contributor

@tduncan tduncan commented Feb 25, 2025

This PR adds the trace profiling ability to the snapshot profiling extension. Profiling is activated with the span processor when an entry span is encountered that has been selected for snapshotting, and profiling is stopped when that same entry span is ended.

At the moment this PR does not export the collected callstacks as logs. That ability will closely follow the merging of this PR.

@tduncan tduncan requested review from a team as code owners February 25, 2025 18:01
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary solution. One of the very next PRs will include the stack trace data export and this will be replaced.


@Override
public void startSampling(String traceId, long threadId) {
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation uses one profiling thread per trace selected for snapshotting. Should possibly consider a single profiling thread. Could be a followup change.

}
}

static class TogglableTraceRegistry extends TraceRegistry {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to a standalone class so it can be used in other tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is using a combination of the Object Mother and the Builder patterns to make it easier for tests to create instances of the profiling SDK customizer, plugging in different implementations of of the customizer's dependencies as needed. It also insulates tests from needed to respond directly to every dependency change made in the customizer.

The basic use will look like Profiling.customizer().build() and there are examples in the PR where tests plug in different TraceRegistry and StackTraceSampler implementations.

public void onEnd(ReadableSpan span) {
if (isEntry(span)) {
registry.unregister(span.getSpanContext());
sampler.stopSampling(Thread.currentThread().getId());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the assumption that server span is ended by the same thread that started it isn't true. For example async servlet requests could end after the thread that started the span and called the servlet's service method has exited from the servlet code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lack of support for traces spanning multiple threads is a known limitation. We've intentionally not attempted to solve that problem just yet.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a comment stating this limitation and if possible elaborate how this is mitigated and why it does not have an adverse effect. Currently it looks like a resource leak where element is never moved from a map and sampling is never stopped for the affected thread.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unaware that instrumentation for async frameworks started and ended spans on different threads. This will indeed lead to a resource leak and the continual sampling of the original thread. I'll add a test for this scenario and update the implementation to consider trace ID rather than thread ID so the profiling can be stopped appropriately.

It still won't support traces that delegate work to background processes or have multiple concurrent requests within the same trace.

@@ -61,6 +64,7 @@ public boolean isStartRequired() {
public void onEnd(ReadableSpan span) {
if (isEntry(span)) {
registry.unregister(span.getSpanContext());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if this is an issue but the same service could have multiple requests concurrent requests from the same trace. For example imagine a reset service that gets called by some other service multiple times.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that scenario wouldn't be fully supported. The current implementation assumes everything is happening in the same thread.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment describing that limitation.

StackTrace stackTrace = StackTrace.from(now, threadInfo);
stagingArea.stage(threadId, stackTrace);
} catch (Exception e) {
LOGGER.severe(e::getMessage);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this is a user visible error it might be best to change the log message to a sentence describing the nature of failure and log the full stack trace along with the exception message. If this ever happens there is a chance that someone will need to figure out what went wrong.

import java.util.logging.Logger;

class ScheduledExecutorStackTraceSampler implements StackTraceSampler {
private static final Logger LOGGER =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the convention in this project is to use lowercase logger

@Override
public void start(SpanContext spanContext) {
ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
samplers.put(spanContext.getTraceId(), scheduler);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a good idea to either not add a new scheduler when given trace is already in the map (multiple concurrent request for the same trace) or stop the scheduler that was already in the map or allow multiple schedulers for given trace. Otherwise you'll leak the ScheduledExecutorService that was already in the map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines +31 to +33
* <b>Implementation Note</b><br>
* Only single-threaded traces are currently supported. Behavior of the extension in a multithreaded
* trace environment is unspecified and explicitly not supported.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should rephrase the comment in simpler terms. Perhaps something along the lines of Profiling multiple concurrent traces with the same trace id is not supported. To me the meaning of single-threaded and multithreaded is not obvious here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about how to make it more clear.

There are two primary use cases that are not supported: 1) a service is called multiple times concurrently within the same trace and 2) a service delegates some of its work to a background process within the same trace. Neither is explicitly supported by this initial implementation.

@breedx-splk breedx-splk merged commit ef0396b into signalfx:main Mar 5, 2025
26 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 5, 2025
@tduncan tduncan deleted the port-trace-profiling branch April 8, 2025 16:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants