Skip to content

[SLI Metrics] kuberay_job_execution_duration_seconds#3488

Merged
kevin85421 merged 16 commits intoray-project:masterfrom
troychiu:kuberay-job-execution-duration-seconds
Apr 30, 2025
Merged

[SLI Metrics] kuberay_job_execution_duration_seconds#3488
kevin85421 merged 16 commits intoray-project:masterfrom
troychiu:kuberay-job-execution-duration-seconds

Conversation

@troychiu
Copy link
Collaborator

@troychiu troychiu commented Apr 25, 2025

Why are these changes needed?

  • refactor rayJobMetricsCollector, which allows
    • easier unit testing
  • Implement metric kuberay_job_execution_duration_seconds proposed in 1.4.0.
Metric Name Type Description Labels
kuberay_job_execution_duration_seconds Gauge Duration from when the RayJob CR’s JobDeploymentStatus transitions from Initializing to either the Retrying state or a terminal state, such as Complete or Failed. The Retrying state indicates that the CR previously failed and that spec.backoffLimit is enabled. name
namespace
job_deployment_status (Complete/Failed/Retrying)
retry_count

End-to-end test

Create ray jobs and check actual metrics. Below is an example result

# HELP kuberay_job_execution_duration_seconds Duration from when the RayJob CR’s JobDeploymentStatus transitions from Initializing to either the Retrying state or a terminal state, such as Complete or Failed. The Retrying state indicates that the CR previously failed and that spec.backoffLimit is enabled.
# TYPE kuberay_job_execution_duration_seconds gauge
kuberay_job_execution_duration_seconds{job_deployment_status="Complete",name="rayjob-fail-sample",namespace="test",retry_count="1"} 35.219411124
kuberay_job_execution_duration_seconds{job_deployment_status="Retrying",name="rayjob-fail-sample",namespace="test",retry_count="0"} 36.091843603

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

troychiu and others added 9 commits April 20, 2025 22:03
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
@troychiu troychiu changed the title Kuberay job execution duration seconds [SLI Metrics] kuberay_job_execution_duration_seconds Apr 25, 2025
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
rayJobExecutionDurationSeconds: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "kuberay_job_execution_duration_seconds",
Help: "Duration from RayJob CR initialization to reaching a terminal state.",
Copy link
Member

@win5923 win5923 Apr 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description should also include retrying state

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Updated. Thank you

win5923

This comment was marked as duplicate.

Copy link
Member

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
@troychiu
Copy link
Collaborator Author

@kevin85421 PTAL. Thank you.

Name: "kuberay_job_execution_duration_seconds",
Help: "Duration from RayJob CR initialization to reaching a terminal state or retrying state, where retrying state indicates the CR was previously failed and backoff is enabled.",
},
[]string{"name", "namespace", "result", "retry_count"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better name for "result"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to job_deployment_result. Wdyt?

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left nit comments. Others LGTM


rayJobOptions := ray.RayJobReconcilerOptions{
RayJobMetricsCollector: rayJobMetricsCollector,
RayJobMetricsObserver: rayJobMetricsManager,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RayJobMetricsObserver: rayJobMetricsManager,
RayJobMetricsManager: rayJobMetricsManager,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this intentionally since ray job controller only call methods in RayJobMetricsObserver. Do you think using RayJobMetricsManager would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d prefer to use RayJobMetricsManager: rayJobMetricsManager. The inconsistency between RayJobMetricsObserver and rayJobMetricsManager looks odd and may confuse future readers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Fixed. Thank you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only keep emitRayJobExecutionDuration to accept observer so that it can be unit tested

Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
@kevin85421 kevin85421 merged commit 62302d8 into ray-project:master Apr 30, 2025
23 checks passed
laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 7, 2025
laurafitzgerald pushed a commit to laurafitzgerald/kuberay that referenced this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants