Skip to content

Conversation

@Blazer-007
Copy link
Member

@Blazer-007 Blazer-007 commented Jun 11, 2025

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

This pull request introduces new support for OpenTelemetry-based metrics in Gobblin, with a focus on extensibility and configurability for metric reporting. It adds configuration options, implements new metric types and helpers, and introduces standardized metric names and dimensions for Gobblin jobs.

The most important changes are:

OpenTelemetry Metrics Configuration:

  • Added new configuration keys in ConfigurationKeys for OpenTelemetry metrics, including class name, enablement flag, histogram settings, dimensions, and group name. Also fixed the interval millis key and provided default values. [1] [2]

OpenTelemetry Metrics Implementation:

  • Updated OpenTelemetryMetrics to use a thread-safe singleton pattern, allow configuration of histogram aggregation (max buckets and scale), and register a histogram view with the meter provider. [1] [2] [3] [4]
  • Added imports for new OpenTelemetry SDK classes to support histogram aggregation and instrument selection.

New Metric Types and Utilities:

  • Introduced OpenTelemetryDoubleHistogram, a wrapper for OpenTelemetry's double histogram metric, supporting recording with base and additional attributes.
  • Added OpenTelemetryHelper, a utility class for handling OpenTelemetry attributes, including merging and converting maps to attributes.

Standardized Gobblin Metric Names and Dimensions:

  • Added GobblinOpenTelemetryMetrics, an enum defining standard metric names and types for Gobblin job state and latency, with methods to create metrics.
  • Added GobblinOpenTelemetryMetricsConstants, defining standard dimension keys and values for Gobblin job metrics (e.g., state transitions, job phases).

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

  • Added change to OpenTelemetryMetrics to emit ExponentialHistogramMetrics

  • Created package org.apache.gobblin.metrics.opentelemetry; which contains different classes required for OTelMetrics emission

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

  • OpenTelemetryHelperTest

  • OpenTelemetryInstrumentationTest

  • OpenTelemetryMetricTest

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@Blazer-007 Blazer-007 requested a review from Copilot June 11, 2025 14:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for emitting OpenTelemetry metrics in Gobblin, including the integration of exponential histogram metrics and long counter metrics via newly introduced classes and activity types.

  • Added new activity and helper classes to emit and manage OpenTelemetry metrics (e.g., EmitOTelMetrics, EmitOTelMetricsImpl).
  • Updated workflows, workers, and job launchers to incorporate metric emission, and added unit tests for the new metrics functionality.

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
CommitStepWorkflowImpl.java Added metric emission using EmitOTelMetrics with new attribute assignments.
WorkFulfillmentWorker.java Included EmitOTelMetricsImpl in the activity implementation array.
ExecuteGobblinJobLauncher.java Integrated metric emission before and after workflow execution, including latency histograms.
EmitOTelMetricsImpl.java & EmitOTelMetrics.java New activity interface and implementation for OpenTelemetry metric emission.
ActivityType.java & GobblinTemporalConfigurationKeys.java Updated to define new activity type EMIT_OTEL_METRICS and configuration keys.
Various files under gobblin-metrics Added support for OpenTelemetry metrics instrumentation, including counters, histograms, and tests.
ConfigurationKeys.java (gobblin-api) Updated to include OpenTelemetry-related configuration keys.
Comments suppressed due to low confidence (1)

gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java:51

  • Consider replacing the literal 'currState' with the constant GaaSOpenTelemetryMetricsConstants.DimensionKeys.CURR_STATE for consistency with the rest of the codebase.
attributes.put("currState", "processWUStart");

@Getter
@AllArgsConstructor
public enum GaaSOpenTelemetryMetrics {
GAAS_JOB_STATUS("gaas_job_status", "Gaas job status counter", "1", OpenTelemetryMetricType.LONG_COUNTER),
Copy link
Contributor

@abhishekmjain abhishekmjain Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is metric unit here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly s in the below one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is default unit here , although we should use curly annotation as mentioned in wiki but have kept default for simplicity
s represents second to measure latency metrics
https://opentelemetry.io/docs/specs/semconv/general/metrics/

String openTelemetryClassName = state.getProp(ConfigurationKeys.METRICS_REPORTING_OPENTELEMETRY_CLASSNAME,
ConfigurationKeys.DEFAULT_METRICS_REPORTING_OPENTELEMETRY_CLASSNAME);
Class<?> metricsClass = Class.forName(openTelemetryClassName);
Method getInstanceMethod = metricsClass.getMethod("getInstance", State.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use MethodUtils helper class instead of invoking methods ourself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name,
attrs,
this.meter.histogramBuilder(name)
.setDescription(metric.getMetricDescription())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description and unit can be used directly

log.info("FINISHED - ExecuteGobblinWorkflow.execute = {}", execGobblinStats);
attributes.put(CURR_STATE, JOB_COMPLETE);
emitOTelMetrics.emitLongCounterMetric(GaaSOpenTelemetryMetrics.GAAS_JOB_STATUS, 1L, attributes, finalProps);
attributes.remove(CURR_STATE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please elaborate on why are we doing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing specific, just avoiding creating a new Map object

Comment on lines 23 to 24
public static final String STATE = "state";
public static final String CURR_STATE = "currState";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between state and currState?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State -- GenWU, ProcessWU, CommitStep
CurrState -- GenWUStart, GenWUComplete, ...

Comment on lines 108 to 110
public OpenTelemetryMetric getOrCreate(GaaSOpenTelemetryMetrics metric) {
return this.metrics.computeIfAbsent(metric.getMetricName(), name -> createMetric(metric));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metrics map uses only metricName as key. If two metrics share the same name but differ in type (eg, LONG_COUNTER vs DOUBLE_HISTOGRAM), this can silently cause incorrect caching behavior. It would be better to use a composite key like metricName + "_" + metricType to avoid collisions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two attributes, two metrics, or two events MUST NOT share the same name. Different entities (attribute and metric, metric and event) MAY share the same name.
https://opentelemetry.io/docs/specs/semconv/general/naming/#name-reuse-prohibition


@Getter
@AllArgsConstructor
public enum GaaSOpenTelemetryMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move the metric creation logic into the GaaSOpenTelemetryMetrics enum itself instead of centralizing it in OpenTelemetryInstrumentation.createMetric.

Currently, the enum acts as a metadata holder, while the actual instantiation logic(via switch-case) is external and hence scattered. This becomes harder to extend if we introduce new metric types(eg gauges, timers, etc). A more extensible approach would be to let each enum constant hold a factory method and expose a createMetric() method

@Blazer-007 Blazer-007 force-pushed the virai_add_mdm_status_metric branch from d96262a to 832a86c Compare July 28, 2025 06:27
# Conflicts:
#	gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java
� Conflicts:
�	gobblin-temporal/src/main/java/org/apache/gobblin/temporal/ddm/workflow/impl/CommitStepWorkflowImpl.java
@Blazer-007 Blazer-007 force-pushed the virai_add_mdm_status_metric branch from 6d21c96 to 3148265 Compare December 11, 2025 05:10
Comment on lines +31 to +33
* Metric Unit: 1 represents each increment will add one data point to the counter.
* */
GOBBLIN_JOB_STATE("gobblin.job.state", "Gobblin job state counter", "1", OpenTelemetryMetricType.LONG_COUNTER),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are we using metricUnit? 1 is string here.. the metric is incremented directly on emission emitOTelMetrics.emitLongCounterMetric(GobblinOpenTelemetryMetrics.GOBBLIN_JOB_STATE, 1L, attributes, finalProps);

public static final String CHILD_WORKFLOW_ID_BASE = "NestingExecWorkUnits";
public static final String COMMIT_STEP_WORKFLOW_ID_BASE = "CommitStepWorkflow";

private EmitOTelMetrics emitOTelMetricsActivityStub;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the workflow implementations is now directly dependent on OpenTelemetry which creates tight coupling with Otel, this makes it harder to switch metrics system in future. We can introduce a metrics abstraction and use Otel metrics as one implementation.

public interface JobMetricsRecorder {
...
}

public class OpenTelemetryJobMetricsRecorder implements JobMetricsRecorder {
    private final OpenTelemetryInstrumentation instrumentation;
    ...
}

public class ProcessWorkUnitsWorkflowImpl implements ProcessWorkUnitsWorkflow {
    private JobMetricsRecorder metricsRecorder;
    ...
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is tight coupling wrt name only , we can look at the name change end to end later on if required as OpenTelemetry name is being used across orchestrator as well to emit final status metric.
In suggested interface & its impl we are just adding extra hop and the object of that impl can't be directly used inside temporal workflow as OpenTelemetryJobMetricsRecorder -> OpenTelemetryInstrumentation -> meter and meter object is not serializable so again we will need a temporal activity stub to connect to this JobMetricRecorder and that is what current impl is doing as well (if we ignore the name of class) , if we want to use other metric system in future we can just change EmitOTelMetricsImpl itself regardless of name.

I will raise a followup PR to address the name change of Activity interface and impl

@Blazer-007 Blazer-007 merged commit 663929c into apache:master Dec 17, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants