Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-36172][metrics][rest] Optimize transient metric cleanup #26204

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fdc91
Copy link

@fdc91 fdc91 commented Feb 24, 2025

What is the purpose of the change

This PR addresses a performance regression in the MetricStore that impacts clients fetching metrics, such as the autoscaler or web UI. The issue occurs when the /metrics endpoint becomes unresponsive due to delays in removing transient metrics for completed subtasks. This cleanup process is executed synchronously during metric retrieval, leading to significant slowdowns—particularly when the JM has multiple jobs or subtasks in a terminal state. These delays prevent timely metric fetching, disrupting latency-sensitive systems like the autoscaler. The root cause, identified via flamegraph analysis, is the inefficient synchronous execution of the cleanup routine introduced with FLINK-31650.

Brief change log

  • Optimized the metrics cleanup process in MetricStore by caching the names of transient metrics when first stored
  • Improved metric removal efficiency by executing the cleanup routine only once

Verifying this change

Relying on UT added in #23988

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@fdc91 fdc91 changed the title [FLINK-36172][runtime] Optimize transient metric cleanup [FLINK-36172][metrics][rest] Optimize transient metric cleanup Feb 24, 2025
@flinkbot
Copy link
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants