[SLI Metrics] kuberay_job_execution_duration_seconds#3488
[SLI Metrics] kuberay_job_execution_duration_seconds#3488kevin85421 merged 16 commits intoray-project:masterfrom
Conversation
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
| rayJobExecutionDurationSeconds: prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "kuberay_job_execution_duration_seconds", | ||
| Help: "Duration from RayJob CR initialization to reaching a terminal state.", |
There was a problem hiding this comment.
I think the description should also include retrying state
There was a problem hiding this comment.
Sure! Updated. Thank you
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
|
@kevin85421 PTAL. Thank you. |
| Name: "kuberay_job_execution_duration_seconds", | ||
| Help: "Duration from RayJob CR initialization to reaching a terminal state or retrying state, where retrying state indicates the CR was previously failed and backoff is enabled.", | ||
| }, | ||
| []string{"name", "namespace", "result", "retry_count"}, |
There was a problem hiding this comment.
is there a better name for "result"?
There was a problem hiding this comment.
change to job_deployment_result. Wdyt?
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
kevin85421
left a comment
There was a problem hiding this comment.
left nit comments. Others LGTM
ray-operator/main.go
Outdated
|
|
||
| rayJobOptions := ray.RayJobReconcilerOptions{ | ||
| RayJobMetricsCollector: rayJobMetricsCollector, | ||
| RayJobMetricsObserver: rayJobMetricsManager, |
There was a problem hiding this comment.
| RayJobMetricsObserver: rayJobMetricsManager, | |
| RayJobMetricsManager: rayJobMetricsManager, |
There was a problem hiding this comment.
I did this intentionally since ray job controller only call methods in RayJobMetricsObserver. Do you think using RayJobMetricsManager would be better?
There was a problem hiding this comment.
I’d prefer to use RayJobMetricsManager: rayJobMetricsManager. The inconsistency between RayJobMetricsObserver and rayJobMetricsManager looks odd and may confuse future readers.
There was a problem hiding this comment.
That makes sense. Fixed. Thank you
There was a problem hiding this comment.
I only keep emitRayJobExecutionDuration to accept observer so that it can be unit tested
Signed-off-by: Troy Chiu <y.troychiu@gmail.com>
Why are these changes needed?
kuberay_job_execution_duration_secondsnamenamespacejob_deployment_status(Complete/Failed/Retrying)retry_countEnd-to-end test
Create ray jobs and check actual metrics. Below is an example result
Related issue number
Checks