Skip to content

[FLINK-39541] Improve operator metrics documentation and bundle addit…#1102

Open
Dennis-Mircea wants to merge 1 commit intoapache:mainfrom
Dennis-Mircea:FLINK-39541
Open

[FLINK-39541] Improve operator metrics documentation and bundle addit…#1102
Dennis-Mircea wants to merge 1 commit intoapache:mainfrom
Dennis-Mircea:FLINK-39541

Conversation

@Dennis-Mircea
Copy link
Copy Markdown
Contributor

…ional metric reporters

What is the purpose of the change

This pull request rewrites the Metrics and Logging documentation page of the Flink Kubernetes Operator to make the operator's metric surface discoverable and unambiguous, extends the operator image with two additional metric reporter plugins (Dropwizard, OpenTelemetry), and clarifies in code that every Flink metrics.* key is honoured by the operator under the kubernetes.operator.metrics.* prefix.

Brief change log

  • Documentation rewrite (docs/content/docs/operations/metrics-logging.md + Chinese translation):
    • Expanded the Scope section with a new How Metric Identifiers Are Built subsection explaining the difference between scope components and logical scope, and how non-labeling (SLF4J/JMX/Graphite) vs. labeling (Prometheus/Datadog/InfluxDB) reporters assemble metric identifiers. Added concrete Prometheus and SLF4J/JMX examples for System / Namespace / Resource scopes.
    • Added a new Operator Custom Resource Metrics table following the Flink metrics reference styling, grouped by Scope / Resource type / Metrics / Description / Type, covering FlinkDeployment, FlinkSessionJob, FlinkBlueGreenDeployment and FlinkStateSnapshot across System / Namespace / Resource scopes (including autoscaler counters, version / resource-usage gauges, blue-green Failures counter, snapshot state gauges).
    • Reordered and introduced per-topic subsections with high-level explanatory paragraphs: FlinkDeployment Version and Resource Usage, FlinkDeployment / FlinkSessionJob Lifecycle metrics, FlinkBlueGreenDeployment Lifecycle metrics, FlinkDeployment / FlinkSessionJob JobStatus Tracking, FlinkBlueGreenDeployment JobStatus Tracking, FlinkStateSnapshot State Tracking, and Scaling metrics.
    • Added the new Scaling metrics subsection with a high-level paragraph and an alphabetically sorted <ScalingMetric> table (previously not documented).
    • Converted the Kubernetes Client Metrics and Kubernetes client metrics by Http Response Code tables to the same <table class="table table-bordered"> styling used by the other operator metric tables, with the metric names sorted alphabetically.
    • JOSDK Metrics: linked to the upstream JOSDK metrics documentation and clarified that those metrics are subject to the same scope/reporter rules.
    • Metric Reporters: updated the bundled-reporters list (adds Dropwizard and OpenTelemetry), added an Operator-scoped Metric Configuration subsection explaining the kubernetes.operator.metrics.*metrics.* prefix stripping at startup, and a Configuring Reporters on a FlinkDeployment example clarifying that spec.flinkConfiguration uses the plain metrics.reporter.* prefix.
  • Image / packaging (flink-kubernetes-operator/pom.xml): added flink-metrics-dropwizard and flink-metrics-otel to the maven-dependency-plugin artifactItems so both plugins end up under /opt/flink/plugins/ in the operator image.
  • Code clarification (KubernetesOperatorMetricOptions.java): expanded the class-level javadoc to state that only operator-specific toggles and k8soperator.* scope formats are declared here, and that Flink metrics.* keys are honoured when prefixed with kubernetes.operator. (stripped and forwarded by OperatorMetricUtils#createMetricConfig). Reporter options are intentionally not redeclared as typed ConfigOptions.

Verifying this change

This change is a documentation / packaging / javadoc change without any new runtime logic.

  • Documentation: built the docs site locally and visually reviewed the rewritten page (scope examples, tables, hint blocks, navigation).
  • Image: built the operator image and verified that /opt/flink/plugins/flink-metrics-dropwizard and /opt/flink/plugins/flink-metrics-otel are present.
  • Regression: configured an operator with kubernetes.operator.metrics.reporter.prom.factory.class=... and confirmed metrics are exposed on the configured port (no behaviour change expected).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant