Skip to content

Conversation

@jjyao
Copy link
Collaborator

@jjyao jjyao commented Nov 24, 2025

Description

This gives us observability when metrics failed to be exported to the dashboard agent.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jjyao jjyao requested a review from a team as a code owner November 24, 2025 06:51
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Nov 24, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a wrapper for OtlpGrpcMetricExporter to add logging for export failures, which enhances observability. The implementation is straightforward and correct. Additionally, there are changes in MetricsAgentClientImpl to optimize the handling of a std::function callback by using move semantics. I've found a small issue with this optimization where mutable is missing on lambdas, preventing the intended move from happening, and I've provided a suggestion to correct it.

Comment on lines +47 to +80
HealthCheck(rpc::HealthCheckRequest(),
[this,
init_exporter_fn = std::move(init_exporter_fn),
retry_count,
max_retry,
retry_interval_ms](auto &status, auto &&reply) {
if (status.ok()) {
if (exporter_initialized_) {
return;
}
init_exporter_fn(status);
exporter_initialized_ = true;
RAY_LOG(INFO) << "Exporter initialized.";
return;
}
if (retry_count >= max_retry) {
init_exporter_fn(Status::RpcError(
"Running out of retries to initialize the metrics agent.", 14));
return;
}
io_service_.post(
[this,
init_exporter_fn = std::move(init_exporter_fn),
retry_count,
max_retry,
retry_interval_ms]() {
WaitForServerReadyWithRetry(std::move(init_exporter_fn),
retry_count + 1,
max_retry,
retry_interval_ms);
},
"MetricsAgentClient.WaitForServerReadyWithRetry",
retry_interval_ms * 1000);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure that std::move(init_exporter_fn) actually performs a move and not a copy, the lambdas need to be marked as mutable. By default, a lambda's operator() is const, which prevents moving from its members (init-captures become members). std::move on a const object results in a copy, defeating the purpose of this optimization. Adding mutable removes the const qualifier from the operator(), allowing the move to happen as intended.

  HealthCheck(rpc::HealthCheckRequest(),
              [this,
               init_exporter_fn = std::move(init_exporter_fn),
               retry_count,
               max_retry,
               retry_interval_ms](auto &status, auto &&reply) mutable {
                if (status.ok()) {
                  if (exporter_initialized_) {
                    return;
                  }
                  init_exporter_fn(status);
                  exporter_initialized_ = true;
                  RAY_LOG(INFO) << "Exporter initialized.";
                  return;
                }
                if (retry_count >= max_retry) {
                  init_exporter_fn(Status::RpcError(
                      "Running out of retries to initialize the metrics agent.", 14));
                  return;
                }
                io_service_.post(
                    [this,
                     init_exporter_fn = std::move(init_exporter_fn),
                     retry_count,
                     max_retry,
                     retry_interval_ms]() mutable {
                      WaitForServerReadyWithRetry(std::move(init_exporter_fn),
                                                  retry_count + 1,
                                                  max_retry,
                                                  retry_interval_ms);
                    },
                    "MetricsAgentClient.WaitForServerReadyWithRetry",
                    retry_interval_ms * 1000);
              });

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants