Skip to content

[FLINK-36932][metrics] Added resource-level metrics for different states/statuses #926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pgrefviau
Copy link

What is the purpose of the change

This PR adds new metrics that help track the current value of different states/statuses at the resource level. In some cases, metrics already exists for some of these statuses/states, but those metrics represent namespace or system-wide counts, as opposed to per-resource gauges that indicate whether or not a deployment/session job is in a particular state.

In other cases, some statuses/states that weren't yet tracked through a dedicated metric (ex: job status) now have a resource-level gauge and namespace-level counter.

Brief change log

Summary of the changes for each state/status:

  • JobManagerDeploymentStatus: state gauge added at resource-level (FlinkDeployment only)
  • JobStatus: status gauge added at resource-level (FlinkDeployment only), status counter at namespace-level
  • ResourceLifecycleState: state gauge added at resource-level

Verifying this change

This change added tests and can be verified as follows:

  • Updated test cases for resource lifecycle metrics and Flink deployment metrics to account for new resource-level metrics
    • Also added utility methods to test classes to reduce duplicated test logic
  • Changes were deployed and tested using our own fork/instance of the operator

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no

N.B. While these changes might not represent a full-on "feature", I'm planning to update the documentation that generates this page. However, I've held off doing this as part of this initial commit in order to settle the naming and implementation. Once this is done, I can update the documentation accordingly.

@gyfora
Copy link
Contributor

gyfora commented Feb 5, 2025

I am wondering if this is a good use of the metric system in Flink. We are introducing an explosion of new metrics and we are using a large number of gauges to represent ENUM values for a single resource.

Maybe we should have a single resource.status gauge that returns the value of the enum instead? Not sure if that would work with string/enum values but I don't think the current approach is good either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants