Skip to content

Periodically Export Stacktraces From StagingArea#2298

Merged
laurit merged 54 commits intosignalfx:mainfrom
tduncan:batch-export-snapshot-profiling-stacktraces
May 7, 2025
Merged

Periodically Export Stacktraces From StagingArea#2298
laurit merged 54 commits intosignalfx:mainfrom
tduncan:batch-export-snapshot-profiling-stacktraces

Conversation

@tduncan
Copy link
Copy Markdown
Contributor

@tduncan tduncan commented Apr 28, 2025

This PR swaps out the AccumulatingStagingArea for a StagingArea implementation that will automatically export empty itself when 1) a certain amount of has passed or 2) it has reached its capacity. Both the time interval between exports and capacity are configurable.

@tduncan tduncan requested review from a team as code owners April 28, 2025 17:58

private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY =
"splunk.snapshot.profiler.staging.capacity";
private static final int DEFAULT_SNAPSHOT_PROFILER_STAGING_CAPACITY = 2000;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a reason for the chosen defaults. Happy to change them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 2000 is reasonable for start.

Comment on lines +78 to +86
try {
scheduler.shutdown();
if (!scheduler.awaitTermination(1, TimeUnit.SECONDS)) {
scheduler.shutdownNow();
}
} catch (InterruptedException e) {
scheduler.shutdownNow();
Thread.currentThread().interrupt();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo calling shutdownNow does not make sense. If you really care whether it manages to export or not I think you should be using a more generous timeout than 1s.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't wait too long; the OpenTelemetry SDK has I believe a 10 second timeout for the entire shutdown process.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that limit doesn't really apply here. Sdk waits 10s for the CompletableResultCode to produce a result, but it does not apply a limit on how long producing that CompletableResultCode could take. Unfortunately the shutdown handling is still incomplete as we don't wait for the actual export to complete.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the shutdown process to wait for thread exit. Now the question is what happens if the thread itself has stalled during shutdown?

Before your last changes it was worker.join(TimeUnit.SECONDS.toMillis(5)); now it is just worker.join(). I wouldn't worry about this too much, it shouldn't really happen unless there is a bug (like now where the join never completes because shutdown is not called) or when there is abnormal load (e.g. gc runs in a loop).

private static final Duration DEFAULT_SNAPSHOT_PROFILER_SAMPLING_INTERVAL = Duration.ofMillis(10);

private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
"splunk.snapshot.profiler.staging.empty.interval";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use splunk.snapshot.profiler.export.interval? The fact that the implementation stages the stacks somehow is irrelevant for users. Sdk uses otel.bsp.schedule.delay for similar purpose.

private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
"splunk.snapshot.profiler.staging.empty.interval";
private static final Duration DEFAULT_SNAPSHOT_PROFILER_STAGING_EMPTY_INTERVAL =
Duration.ofSeconds(5);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should go with a larger interval, like 30s, to allow for more stacks to accumulate. Are there any downsides for this?

Copy link
Copy Markdown
Contributor Author

@tduncan tduncan May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer delay means more likely that a trace is fully ingested and available in the UI before the callgraph which is poor UX. Is 30 seconds often enough for profiling data to win the race against trace ingestion?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poor UX because we indicate next to a trace ID that a call graph is available when in fact the call graph hasn't been (fully) ingested yet. This edge case is unavoidable in some circumstances but we want to avoid it as much as possible.

This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an advantage of the "export on trace end" approach. Assuming there isn't a breakdown during the profiling ingestion process the call graph is nearly always persisted before a trace.

This would only be true if the profiling ingestion operates faster than the trace ingestion. There could also be a collector between the app and ingest that further batches spans. I don't think we need to worry about this too much, there is inevitably a delay before the data appears in the apm. We can leave this to 5s for now.

Duration.ofSeconds(5);

private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STAGING_CAPACITY =
"splunk.snapshot.profiler.staging.capacity";
Copy link
Copy Markdown
Collaborator

@laurit laurit May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sdk uses otel.bsp.max.export.batch.size for similar purpose

@laurit laurit merged commit ead27f8 into signalfx:main May 7, 2025
26 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 7, 2025
@tduncan tduncan deleted the batch-export-snapshot-profiling-stacktraces branch May 7, 2025 16:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants