Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
72d1404
Add an accumulating staging area.
tduncan Mar 6, 2025
28772e7
Add the AsyncStackTraceExporter.
tduncan Mar 7, 2025
2bec2fa
Merge branch 'main' into port-profile-log-export
tduncan Mar 11, 2025
2eaf7e6
Add StackTraceExporter implementation that emits stack traces transla…
tduncan Mar 12, 2025
b78aa4b
Add the StackTraceExporterProvider.
tduncan Mar 12, 2025
cff68ec
Update AccumulatingStagingArea to accept a Supplier of StackTraceExpo…
tduncan Mar 12, 2025
4ec661d
Add an AgentListener to configure the Otel logger for for the snapsho…
tduncan Mar 12, 2025
62cbcfb
Add function to Configuration for the snapshot profiling feature flag.
tduncan Mar 12, 2025
be77245
Apply spotless code formatting.
tduncan Mar 12, 2025
a9a98ad
Replace noop staging area with a real implememtation.
tduncan Mar 12, 2025
7f70840
Update all usages of OpenTelemetrySdkExtension to the version in the …
tduncan Mar 12, 2025
1190956
Apply spotless code formatting.
tduncan Mar 12, 2025
02ed77a
Add test verifying that trace snapshot profiling log message exportin…
tduncan Mar 13, 2025
30551e4
Include the overall stack frame count in the snapshot profiling log m…
tduncan Mar 13, 2025
aab9b3d
Properly reset the StackTraceExporterProvider after SnapshotProfiling…
tduncan Mar 13, 2025
98d7943
Remove unused method from PprofTranslator.
tduncan Mar 13, 2025
7234dcf
Merging main into branch 'port-profile-log-export'.
tduncan Mar 17, 2025
fad3539
Apply spotless code formatting.
tduncan Mar 17, 2025
b50f54c
Remove the trace ID from the snapshot profiling log message context.
tduncan Mar 19, 2025
072faa0
Assume a consistent sampling duration rather than compute it.
tduncan Mar 19, 2025
826cfae
Apply spotless code formatting.
tduncan Mar 19, 2025
e0db0d6
Merge branch 'main' into port-profile-log-export
tduncan Mar 19, 2025
9b10d29
Share pprof code
laurit Mar 19, 2025
b5bc51d
Merge pull request #1 from laurit/share-pprof
tduncan Mar 21, 2025
cc4c27d
Remove commented out tests in AsyncStackTraceExporterTest.
tduncan Mar 21, 2025
475b5d9
Add configuration option for controlling the max depth of stack trace…
tduncan Mar 21, 2025
a5bfcbc
Add tests for new stack trace exporting and configuration option for …
tduncan Mar 21, 2025
bd562fd
Add configuration option for controlling the snapshot profiling sampl…
tduncan Mar 21, 2025
c89be8a
Use a single threaded Executor in AsyncStackTraceExporter, not a Sche…
tduncan Mar 24, 2025
f8ef40e
Improve exception error message in AsyncStackTraceExporter.
tduncan Mar 24, 2025
b6f6055
Remove configured wait timeout for tests in ScheduledExecutorStackTra…
tduncan Mar 24, 2025
9358d52
Use ConfigProperties method 'getDuration' when looking for the snapsh…
tduncan Mar 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions profiler/build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ dependencies {
annotationProcessor("com.google.auto.service:auto-service")
compileOnly("com.google.auto.service:auto-service")

testImplementation(project(":custom"))
Copy link
Copy Markdown
Contributor Author

@tduncan tduncan Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed access to the AutoConfigureUtil class defined in this subproject. Used in StackTraceExporterActivator to access the ConfigProperties and Resource fields in the AutoConfiguredOpenTelemetrySdk when configuring the Otel Logger.

The compileOnly access already defined did not allow me to test that the configuration was taking place.

testImplementation("io.opentelemetry.javaagent:opentelemetry-javaagent-extension-api")
testImplementation("io.grpc:grpc-netty")
testImplementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-api")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,12 @@ public class Configuration implements AutoConfigurationCustomizerProvider {
public static final String CONFIG_KEY_SNAPSHOT_SELECTION_RATE = "splunk.snapshot.selection.rate";
private static final double DEFAULT_SNAPSHOT_SELECTION_RATE = 0.01;
private static final double MAX_SNAPSHOT_SELECTION_RATE = 0.10;
private static final String CONFIG_KEY_SNAPSHOT_PROFILER_STACK_DEPTH =
"splunk.snapshot.profiler.max.stack.depth";
private static final int DEFAULT_SNAPSHOT_PROFILER_STACK_DEPTH = 1024;
private static final String CONFIG_KEY_SNAPSHOT_PROFILER_SAMPLING_INTERVAL =
"splunk.snapshot.profiler.sampling.interval";
public static final Duration DEFAULT_SNAPSHOT_PROFILER_SAMPLING_INTERVAL = Duration.ofMillis(20);

@Override
public void customize(AutoConfigurationCustomizer autoConfiguration) {
Expand Down Expand Up @@ -181,6 +187,10 @@ private static int getJavaVersion() {
return Integer.parseInt(javaSpecVersion);
}

public static boolean isSnapshotProfilingEnabled(ConfigProperties properties) {
return properties.getBoolean(CONFIG_KEY_ENABLE_SNAPSHOT_PROFILER, false);
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 2nd use was added in StackTraceExporterActivator so I thought it made sense to roll this into the Configuration class.


public static double getSnapshotSelectionRate(ConfigProperties properties) {
String selectionRatePropertyValue =
properties.getString(
Expand All @@ -207,4 +217,15 @@ public static double getSnapshotSelectionRate(ConfigProperties properties) {
return DEFAULT_SNAPSHOT_SELECTION_RATE;
}
}

public static int getSnapshotProfilerStackDepth(ConfigProperties properties) {
return properties.getInt(
CONFIG_KEY_SNAPSHOT_PROFILER_STACK_DEPTH, DEFAULT_SNAPSHOT_PROFILER_STACK_DEPTH);
}

public static Duration getSnapshotProfilerSamplingInterval(ConfigProperties properties) {
return properties.getDuration(
CONFIG_KEY_SNAPSHOT_PROFILER_SAMPLING_INTERVAL,
DEFAULT_SNAPSHOT_PROFILER_SAMPLING_INTERVAL);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@
import com.splunk.opentelemetry.profiler.allocation.exporter.AllocationEventExporter;
import com.splunk.opentelemetry.profiler.allocation.exporter.PprofAllocationEventExporter;
import com.splunk.opentelemetry.profiler.context.SpanContextualizer;
import com.splunk.opentelemetry.profiler.events.EventPeriods;
import com.splunk.opentelemetry.profiler.exporter.CpuEventExporter;
import com.splunk.opentelemetry.profiler.exporter.PprofCpuEventExporter;
import com.splunk.opentelemetry.profiler.util.HelpfulExecutors;
Expand Down Expand Up @@ -130,13 +129,12 @@ private void activateJfrAndRunForever(ConfigProperties config, Resource resource

EventReader eventReader = new EventReader();
SpanContextualizer spanContextualizer = new SpanContextualizer(eventReader);
EventPeriods periods = new EventPeriods(jfrSettings::get);
LogRecordExporter logsExporter = LogExporterBuilder.fromConfig(config);

CpuEventExporter cpuEventExporter =
PprofCpuEventExporter.builder()
.otelLogger(buildOtelLogger(SimpleLogRecordProcessor.create(logsExporter), resource))
.eventPeriods(periods)
.period(Configuration.getCallStackInterval(config))
.stackDepth(stackDepth)
.build();

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
/*
* Copyright Splunk Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.splunk.opentelemetry.profiler;

import com.google.common.annotations.VisibleForTesting;
import io.opentelemetry.api.logs.Logger;
import io.opentelemetry.sdk.autoconfigure.spi.ConfigProperties;
import io.opentelemetry.sdk.logs.LogRecordProcessor;
import io.opentelemetry.sdk.logs.SdkLoggerProvider;
import io.opentelemetry.sdk.logs.export.LogRecordExporter;
import io.opentelemetry.sdk.logs.export.SimpleLogRecordProcessor;
import io.opentelemetry.sdk.resources.Resource;
import java.util.function.Function;

public class OtelLoggerFactory {
private final Function<ConfigProperties, LogRecordExporter> logRecordExporter;

public OtelLoggerFactory() {
this(LogExporterBuilder::fromConfig);
}

@VisibleForTesting
public OtelLoggerFactory(Function<ConfigProperties, LogRecordExporter> logRecordExporter) {
this.logRecordExporter = logRecordExporter;
}

public Logger build(ConfigProperties properties, Resource resource) {
LogRecordExporter exporter = createLogRecordExporter(properties);
LogRecordProcessor processor = SimpleLogRecordProcessor.create(exporter);
return buildOtelLogger(processor, resource);
}

private LogRecordExporter createLogRecordExporter(ConfigProperties properties) {
return logRecordExporter.apply(properties);
}

private Logger buildOtelLogger(LogRecordProcessor logProcessor, Resource resource) {
return SdkLoggerProvider.builder()
.addLogRecordProcessor(logProcessor)
.setResource(resource)
.build()
.loggerBuilder(ProfilingSemanticAttributes.OTEL_INSTRUMENTATION_NAME)
.setInstrumentationVersion(ProfilingSemanticAttributes.OTEL_INSTRUMENTATION_VERSION)
.build();
}
}

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,20 @@
package com.splunk.opentelemetry.profiler.exporter;

import com.splunk.opentelemetry.profiler.context.StackToSpanLinkage;
import java.time.Instant;

public interface CpuEventExporter {

void export(StackToSpanLinkage stackToSpanLinkage);

default void export(
long threadId,
String threadName,
Thread.State threadState,
StackTraceElement[] stackTrace,
Instant eventTime,
String traceId,
String spanId) {}

default void flush() {}
}
Original file line number Diff line number Diff line change
Expand Up @@ -30,26 +30,27 @@
import com.splunk.opentelemetry.profiler.InstrumentationSource;
import com.splunk.opentelemetry.profiler.ProfilingDataType;
import com.splunk.opentelemetry.profiler.context.StackToSpanLinkage;
import com.splunk.opentelemetry.profiler.events.EventPeriods;
import com.splunk.opentelemetry.profiler.exporter.StackTraceParser.StackTrace;
import com.splunk.opentelemetry.profiler.pprof.Pprof;
import io.opentelemetry.api.logs.Logger;
import io.opentelemetry.api.trace.SpanContext;
import io.opentelemetry.api.trace.SpanId;
import io.opentelemetry.api.trace.TraceId;
import java.time.Duration;
import java.time.Instant;

public class PprofCpuEventExporter implements CpuEventExporter {
private final EventPeriods eventPeriods;
private final Duration period;
private final int stackDepth;
private final PprofLogDataExporter pprofLogDataExporter;
private Pprof pprof = createPprof();

private PprofCpuEventExporter(Builder builder) {
this.eventPeriods = builder.eventPeriods;
this.period = builder.period;
this.stackDepth = builder.stackDepth;
this.pprofLogDataExporter =
new PprofLogDataExporter(
builder.otelLogger, ProfilingDataType.CPU, InstrumentationSource.CONTINUOUS);
builder.otelLogger, ProfilingDataType.CPU, builder.instrumentationSource);
}

@Override
Expand Down Expand Up @@ -80,10 +81,7 @@ public void export(StackToSpanLinkage stackToSpanLinkage) {

String eventName = stackToSpanLinkage.getSourceEventName();
pprof.addLabel(sample, SOURCE_EVENT_NAME, eventName);
Duration eventPeriod = eventPeriods.getDuration(eventName);
if (!EventPeriods.UNKNOWN.equals(eventPeriod)) {
pprof.addLabel(sample, SOURCE_EVENT_PERIOD, eventPeriod.toMillis());
}
pprof.addLabel(sample, SOURCE_EVENT_PERIOD, period.toMillis());
Instant time = stackToSpanLinkage.getTime();
pprof.addLabel(sample, SOURCE_EVENT_TIME, time.toEpochMilli());

Expand All @@ -96,6 +94,52 @@ public void export(StackToSpanLinkage stackToSpanLinkage) {
pprof.getProfileBuilder().addSample(sample);
}

@Override
public void export(
long threadId,
String threadName,
Thread.State threadState,
StackTraceElement[] stackTrace,
Instant eventTime,
String traceId,
String spanId) {
Sample.Builder sample = Sample.newBuilder();

pprof.addLabel(sample, THREAD_ID, threadId);
pprof.addLabel(sample, THREAD_NAME, threadName);
pprof.addLabel(sample, THREAD_STATE, threadState.name());

if (stackTrace.length > stackDepth) {
pprof.addLabel(sample, THREAD_STACK_TRUNCATED, true);
}

for (int i = 0; i < Math.min(stackDepth, stackTrace.length); i++) {
StackTraceElement ste = stackTrace[i];

String fileName = ste.getFileName();
if (fileName == null) {
fileName = "unknown";
}
String className = ste.getClassName();
String methodName = ste.getMethodName();
int lineNumber = Math.max(ste.getLineNumber(), 0);
sample.addLocationId(pprof.getLocationId(fileName, className, methodName, lineNumber));
pprof.incFrameCount();
}

pprof.addLabel(sample, SOURCE_EVENT_PERIOD, period.toMillis());
pprof.addLabel(sample, SOURCE_EVENT_TIME, eventTime.toEpochMilli());

if (TraceId.isValid(traceId)) {
pprof.addLabel(sample, TRACE_ID, traceId);
}
if (SpanId.isValid(spanId)) {
pprof.addLabel(sample, SPAN_ID, spanId);
}

pprof.getProfileBuilder().addSample(sample);
}

private static Pprof createPprof() {
return new Pprof();
}
Expand Down Expand Up @@ -123,8 +167,9 @@ public static Builder builder() {

public static class Builder {
private Logger otelLogger;
private EventPeriods eventPeriods;
private Duration period;
private int stackDepth;
private InstrumentationSource instrumentationSource = InstrumentationSource.CONTINUOUS;

public PprofCpuEventExporter build() {
return new PprofCpuEventExporter(this);
Expand All @@ -135,14 +180,19 @@ public Builder otelLogger(Logger otelLogger) {
return this;
}

public Builder eventPeriods(EventPeriods eventPeriods) {
this.eventPeriods = eventPeriods;
public Builder period(Duration period) {
this.period = period;
return this;
}

public Builder stackDepth(int stackDepth) {
this.stackDepth = stackDepth;
return this;
}

public Builder instrumentationSource(InstrumentationSource instrumentationSource) {
this.instrumentationSource = instrumentationSource;
return this;
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
/*
* Copyright Splunk Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.splunk.opentelemetry.profiler.snapshot;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.function.Supplier;

class AccumulatingStagingArea implements StagingArea {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you collect stack traces per trace id and when the trace is done you export them. To me the staging area abstraction doesn't seem ideal. When there is a long request you end up accumulating many stack traces, when the request is short you export a small batch with only one or two traces. When there are multiple concurrent short requests you'll export many small batches. With the non-snapshot profiler we already get the input data in batches and the interval between the batches is large so we don't need to handle this. Did you consider alternative solutions like for example what is used in https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/export/BatchSpanProcessorBuilder.java Place data in the queue, if queue has batch size items (there it is 512) export them, otherwise wait for a bit (there 5s) and export whatever is in the queue. If queue grows too large drop data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a very good idea. Would you be against doing this as a followup (this PR is a partial blocker to adding span correlation)?

I don't think a time-based approach is necessarily the right approach however. What I'm envisioning is a BufferingStagingArea with some number of max stack traces before it automatically empties itself in addition to the forced empty upon detection of the entry span ending. I don't know what that number should be; maybe 20?

This would control how many stack traces are bundled into a single log record. The frequency of the log record exporting is separately controlled by the LogRecordProcessor configured in OtelLoggerFactory. Right now it wires up a SimpleLogRecordProcessor which I believe immediately exports the log record. We could use In the BatchLogRecordProcessor instead, though I've had issues getting this to work in the past.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be against doing this as a followup

sure

I don't think a time-based approach is necessarily the right approach however. What I'm envisioning is a BufferingStagingArea with some number of max stack traces before it automatically empties itself in addition to the forced empty upon detection of the entry span ending. I don't know what that number should be; maybe 20?

Flushing when there are no new stack traces in some time isn't that hard. When you write stacks from all ongoing traces to the same pprof then flushing on ending the entry span doesn't make sense. 20 is imo too low, 2000 or even larger would be better. I think that determining the batch size requires some experimentation.

This would control how many stack traces are bundled into a single log record. The frequency of the log record exporting is separately controlled by the LogRecordProcessor configured in OtelLoggerFactory. Right now it wires up a SimpleLogRecordProcessor which I believe immediately exports the log record. We could use In the BatchLogRecordProcessor instead, though I've had issues getting this to work in the past.

Using BatchLogRecordProcessor here isn't ideal because pprof works best when you put lots of stacks into the same batch. When the same string is used multiple times then the shared string pool will help in reducing the transfer size. We need to manage creating these batches ourself to not export too often so we really don't gain much from using the BatchLogRecordProcessor.

private final ConcurrentMap<String, List<StackTrace>> stackTraces = new ConcurrentHashMap<>();
private final Supplier<StackTraceExporter> exporter;

AccumulatingStagingArea(Supplier<StackTraceExporter> exporter) {
this.exporter = exporter;
}

@Override
public void stage(String traceId, StackTrace stackTrace) {
stackTraces.compute(
traceId,
(id, stackTraces) -> {
if (stackTraces == null) {
stackTraces = new ArrayList<>();
}
stackTraces.add(stackTrace);
return stackTraces;
});
}

@Override
public void empty(String traceId) {
List<StackTrace> stackTraces = this.stackTraces.remove(traceId);
if (stackTraces != null) {
exporter.get().export(stackTraces);
}
}
}
Loading