-
Notifications
You must be signed in to change notification settings - Fork 749
[GOBBLIN-2209] Emit GaaS Executor Otel Metrics #4118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2209] Emit GaaS Executor Otel Metrics #4118
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for emitting OpenTelemetry metrics in Gobblin, including the integration of exponential histogram metrics and long counter metrics via newly introduced classes and activity types.
- Added new activity and helper classes to emit and manage OpenTelemetry metrics (e.g., EmitOTelMetrics, EmitOTelMetricsImpl).
- Updated workflows, workers, and job launchers to incorporate metric emission, and added unit tests for the new metrics functionality.
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| CommitStepWorkflowImpl.java | Added metric emission using EmitOTelMetrics with new attribute assignments. |
| WorkFulfillmentWorker.java | Included EmitOTelMetricsImpl in the activity implementation array. |
| ExecuteGobblinJobLauncher.java | Integrated metric emission before and after workflow execution, including latency histograms. |
| EmitOTelMetricsImpl.java & EmitOTelMetrics.java | New activity interface and implementation for OpenTelemetry metric emission. |
| ActivityType.java & GobblinTemporalConfigurationKeys.java | Updated to define new activity type EMIT_OTEL_METRICS and configuration keys. |
| Various files under gobblin-metrics | Added support for OpenTelemetry metrics instrumentation, including counters, histograms, and tests. |
| ConfigurationKeys.java (gobblin-api) | Updated to include OpenTelemetry-related configuration keys. |
Comments suppressed due to low confidence (1)
gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java:51
- Consider replacing the literal 'currState' with the constant GaaSOpenTelemetryMetricsConstants.DimensionKeys.CURR_STATE for consistency with the rest of the codebase.
attributes.put("currState", "processWUStart");
...mporal/src/main/java/org/apache/gobblin/temporal/ddm/launcher/ExecuteGobblinJobLauncher.java
Outdated
Show resolved
Hide resolved
| @Getter | ||
| @AllArgsConstructor | ||
| public enum GaaSOpenTelemetryMetrics { | ||
| GAAS_JOB_STATUS("gaas_job_status", "Gaas job status counter", "1", OpenTelemetryMetricType.LONG_COUNTER), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is metric unit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly s in the below one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 is default unit here , although we should use curly annotation as mentioned in wiki but have kept default for simplicity
s represents second to measure latency metrics
https://opentelemetry.io/docs/specs/semconv/general/metrics/
...-metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryMetricType.java
Show resolved
Hide resolved
...blin-metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryHelper.java
Show resolved
Hide resolved
...metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/GaaSOpenTelemetryMetrics.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/metrics/opentelemetry/GaaSOpenTelemetryMetricsConstants.java
Outdated
Show resolved
Hide resolved
| String openTelemetryClassName = state.getProp(ConfigurationKeys.METRICS_REPORTING_OPENTELEMETRY_CLASSNAME, | ||
| ConfigurationKeys.DEFAULT_METRICS_REPORTING_OPENTELEMETRY_CLASSNAME); | ||
| Class<?> metricsClass = Class.forName(openTelemetryClassName); | ||
| Method getInstanceMethod = metricsClass.getMethod("getInstance", State.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use MethodUtils helper class instead of invoking methods ourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do but this is the standard pattern being followed across codebase
Line 92 in ddd9468
| Class<?> datasetFinderClass = Class.forName(className); |
| name, | ||
| attrs, | ||
| this.meter.histogramBuilder(name) | ||
| .setDescription(metric.getMetricDescription()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description and unit can be used directly
...ics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryInstrumentation.java
Outdated
Show resolved
Hide resolved
| log.info("FINISHED - ExecuteGobblinWorkflow.execute = {}", execGobblinStats); | ||
| attributes.put(CURR_STATE, JOB_COMPLETE); | ||
| emitOTelMetrics.emitLongCounterMetric(GaaSOpenTelemetryMetrics.GAAS_JOB_STATUS, 1L, attributes, finalProps); | ||
| attributes.remove(CURR_STATE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please elaborate on why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing specific, just avoiding creating a new Map object
...oral/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/ProcessWorkUnitsWorkflowImpl.java
Show resolved
Hide resolved
...metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/GaaSOpenTelemetryMetrics.java
Outdated
Show resolved
Hide resolved
| public static final String STATE = "state"; | ||
| public static final String CURR_STATE = "currState"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between state and currState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State -- GenWU, ProcessWU, CommitStep
CurrState -- GenWUStart, GenWUComplete, ...
...metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/GaaSOpenTelemetryMetrics.java
Outdated
Show resolved
Hide resolved
...blin-metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryHelper.java
Outdated
Show resolved
Hide resolved
...ics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryDoubleHistogram.java
Outdated
Show resolved
Hide resolved
...blin-metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryHelper.java
Show resolved
Hide resolved
| public OpenTelemetryMetric getOrCreate(GaaSOpenTelemetryMetrics metric) { | ||
| return this.metrics.computeIfAbsent(metric.getMetricName(), name -> createMetric(metric)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metrics map uses only metricName as key. If two metrics share the same name but differ in type (eg, LONG_COUNTER vs DOUBLE_HISTOGRAM), this can silently cause incorrect caching behavior. It would be better to use a composite key like metricName + "_" + metricType to avoid collisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two attributes, two metrics, or two events MUST NOT share the same name. Different entities (attribute and metric, metric and event) MAY share the same name.
https://opentelemetry.io/docs/specs/semconv/general/naming/#name-reuse-prohibition
...metrics/src/main/java/org/apache/gobblin/metrics/opentelemetry/OpenTelemetryLongCounter.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Getter | ||
| @AllArgsConstructor | ||
| public enum GaaSOpenTelemetryMetrics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move the metric creation logic into the GaaSOpenTelemetryMetrics enum itself instead of centralizing it in OpenTelemetryInstrumentation.createMetric.
Currently, the enum acts as a metadata holder, while the actual instantiation logic(via switch-case) is external and hence scattered. This becomes harder to extend if we introduce new metric types(eg gauges, timers, etc). A more extensible approach would be to let each enum constant hold a factory method and expose a createMetric() method
d96262a to
832a86c
Compare
# Conflicts: # gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java
� Conflicts: � gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java
6d21c96 to
3148265
Compare
| * Metric Unit: 1 represents each increment will add one data point to the counter. | ||
| * */ | ||
| GOBBLIN_JOB_STATE("gobblin.job.state", "Gobblin job state counter", "1", OpenTelemetryMetricType.LONG_COUNTER), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are we using metricUnit? 1 is string here.. the metric is incremented directly on emission emitOTelMetrics.emitLongCounterMetric(GobblinOpenTelemetryMetrics.GOBBLIN_JOB_STATE, 1L, attributes, finalProps);
| public static final String CHILD_WORKFLOW_ID_BASE = "NestingExecWorkUnits"; | ||
| public static final String COMMIT_STEP_WORKFLOW_ID_BASE = "CommitStepWorkflow"; | ||
|
|
||
| private EmitOTelMetrics emitOTelMetricsActivityStub; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the workflow implementations is now directly dependent on OpenTelemetry which creates tight coupling with Otel, this makes it harder to switch metrics system in future. We can introduce a metrics abstraction and use Otel metrics as one implementation.
public interface JobMetricsRecorder {
...
}
public class OpenTelemetryJobMetricsRecorder implements JobMetricsRecorder {
private final OpenTelemetryInstrumentation instrumentation;
...
}
public class ProcessWorkUnitsWorkflowImpl implements ProcessWorkUnitsWorkflow {
private JobMetricsRecorder metricsRecorder;
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is tight coupling wrt name only , we can look at the name change end to end later on if required as OpenTelemetry name is being used across orchestrator as well to emit final status metric.
In suggested interface & its impl we are just adding extra hop and the object of that impl can't be directly used inside temporal workflow as OpenTelemetryJobMetricsRecorder -> OpenTelemetryInstrumentation -> meter and meter object is not serializable so again we will need a temporal activity stub to connect to this JobMetricRecorder and that is what current impl is doing as well (if we ignore the name of class) , if we want to use other metric system in future we can just change EmitOTelMetricsImpl itself regardless of name.
I will raise a followup PR to address the name change of Activity interface and impl
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
This pull request introduces new support for OpenTelemetry-based metrics in Gobblin, with a focus on extensibility and configurability for metric reporting. It adds configuration options, implements new metric types and helpers, and introduces standardized metric names and dimensions for Gobblin jobs.
The most important changes are:
OpenTelemetry Metrics Configuration:
ConfigurationKeysfor OpenTelemetry metrics, including class name, enablement flag, histogram settings, dimensions, and group name. Also fixed the interval millis key and provided default values. [1] [2]OpenTelemetry Metrics Implementation:
OpenTelemetryMetricsto use a thread-safe singleton pattern, allow configuration of histogram aggregation (max buckets and scale), and register a histogram view with the meter provider. [1] [2] [3] [4]New Metric Types and Utilities:
OpenTelemetryDoubleHistogram, a wrapper for OpenTelemetry's double histogram metric, supporting recording with base and additional attributes.OpenTelemetryHelper, a utility class for handling OpenTelemetry attributes, including merging and converting maps to attributes.Standardized Gobblin Metric Names and Dimensions:
GobblinOpenTelemetryMetrics, an enum defining standard metric names and types for Gobblin job state and latency, with methods to create metrics.GobblinOpenTelemetryMetricsConstants, defining standard dimension keys and values for Gobblin job metrics (e.g., state transitions, job phases).JIRA
Description
Here are some details about my PR, including screenshots (if applicable):
Added change to
OpenTelemetryMetricsto emit ExponentialHistogramMetricsCreated
package org.apache.gobblin.metrics.opentelemetry;which contains different classes required for OTelMetrics emissionTests
My PR adds the following unit tests OR does not need testing for this extremely good reason:
OpenTelemetryHelperTest
OpenTelemetryInstrumentationTest
OpenTelemetryMetricTest
Commits