Skip to content

gha_job_execution_duration_seconds records ~2023 years for jobs canceled before runner assignment #4444

@talsuk5

Description

@talsuk5

Summary

gha_job_execution_duration_seconds records a nonsensical duration of roughly 63,911,499,556 (about 2023 years) for any job that is canceled before it is assigned to a runner. This pollutes Grafana dashboards, skews histogram buckets, and makes any per-(job_name, repository) average derived from _sum / _count unusable unless job_result="canceled" is explicitly filtered out.

Evidence

Raw metric from our Prometheus for job_result="canceled":

gha_job_execution_duration_seconds_sum{
  job_name="Pre-commit (prek)",
  repository="gitops",
  job_result="canceled",
  ...
} = 63911499556

The same (job_name, repository) combination with job_result="succeeded" has realistic values (sum=626 over 4 observations, ~156s avg). Only the canceled path is affected.

63,911,499,556 seconds ≈ 2023 years ≈ Unix epoch (~1,775,904,000) minus the Unix representation of the zero value of Go's time.Time{} (-62,135,596,800, which is year 1 AD).

Root cause

In cmd/ghalistener/metrics/metrics.go:

func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}

When a job is canceled before ever being assigned to a runner, msg.RunnerAssignTime is the zero value time.Time{}. (time.Time{}).Unix() returns -62135596800 (year 1 AD in Unix seconds). The subtraction msg.FinishTime.Unix() - (-62135596800) then yields ~63.9 billion, which gets recorded into the _seconds histogram.

No check exists for msg.RunnerAssignTime.IsZero() before the subtraction.

Impact

  • Histogram _sum is effectively poisoned by every pre-assignment cancellation, because a single such observation dwarfs thousands of real runs.
  • Any dashboard that computes sum(rate(x_sum)) / sum(rate(x_count)) per repo/job becomes useless without an explicit job_result!="canceled" filter.
  • histogram_quantile over _bucket is similarly affected — all values end up in the +Inf bucket.

The official ARC Grafana dashboard (docs/gha-runner-scale-set-controller/samples/grafana-dashboard/ARC-Autoscaling-Runner-Set-Monitoring.json) uses this metric for its "Job Execution" heatmap; the heatmap is silently broken for anyone whose workflow ever cancels a pending job.

Suggested fix

Skip the histogram observation entirely when RunnerAssignTime is zero (the job was canceled before a runner was assigned, so there is no meaningful execution duration to record):

func (e *exporter) RecordJobCompleted(msg *scaleset.JobCompleted) {
    l := e.completedJobLabels(msg)
    e.incCounter(MetricCompletedJobsTotal, l)

    if msg.RunnerAssignTime.IsZero() {
        // Job was canceled before runner assignment; no execution phase to measure.
        return
    }
    executionDuration := msg.FinishTime.Unix() - msg.RunnerAssignTime.Unix()
    e.observeHistogram(MetricJobExecutionDurationSeconds, l, float64(executionDuration))
}

The gha_completed_jobs_total counter is still incremented before the early return, so cancellation counts are preserved — only the meaningless duration observation is dropped.

Alternative: record 0 as the duration for pre-assignment cancellations. This keeps the observation in the histogram but biases the _sum toward zero for the cancel path. Skipping the observation is cleaner.

Note: RecordJobStarted at line 491 also subtracts two times without a zero-value check, but that path is only invoked on the JobStarted message which by definition has both ScaleSetAssignTime and RunnerAssignTime populated, so it's not affected in practice.

Environment

  • ARC controller chart: gha-runner-scale-set-controller 0.13.0
  • Runner scale set chart: gha-runner-scale-set 0.13.0
  • Observed on commit 79e7b17b563d5492e6c8537e3d6fd52eb6002c3a of this repo; code path unchanged on main as of today.

Local workaround

For anyone else hitting this before the upstream fix lands, the dashboard-side workaround is to add job_result!="canceled" to every duration query:

sum by (job_name, repository) (
  gha_job_execution_duration_seconds_sum{job_result!="canceled"}
)
/
sum by (job_name, repository) (
  gha_job_execution_duration_seconds_count{job_result!="canceled"}
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set mode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions