Skip to content

Analysis Run Returns Successful Phase When All Measurement Phases Are Either Errored Or Failed #4479

@viv-ng

Description

@viv-ng

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

I have a DataDog analysis run result that shows 'Successful' where all measurement phases are either 'Errored' or 'Failed'.

Here's the DataDog Analysis Template for dd-analysis2 (incorrect successful phase):

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: dd-analysis2
  ...
spec:
  args:
    ...
  metrics:
  - name: dd-analysis2
    interval: 30s
    count: 5
    successCondition: "result < 0.02"
    failureLimit: 2
    provider:
      datadog:
        interval: 5m
        query:           sum:kubernetes.cpu.limits{service:{{ args.service }},workload:{{ args.workload }},env:{{ args.environment }},rollout_revision:{{ args.rollout_revision }}} by {container_name} * 1000

(This is just a test analysis template, and the query is not that useful. I just use a query that will return some data, since my test service receives no traffic.)

Here's the result from argo-rollouts for dd-analysis2 which has the 'successful' phase:

            {
              "name": "dd-analysis2",
              "phase": "Successful",
              "measurements": [
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:19:40Z",
                  "finishedAt": "2025-10-01T04:19:40Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:19:50Z",
                  "finishedAt": "2025-10-01T04:19:50Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:20:00Z",
                  "finishedAt": "2025-10-01T04:20:00Z"
                },
                {
                  "phase": "Failed",
                  "startedAt": "2025-10-01T04:20:10Z",
                  "finishedAt": "2025-10-01T04:20:10Z",
                  "value": "1000"
                }
              ],
              "count": 1,
              "failed": 1,
              "error": 3
            }

Here's another DataDog analysis template dd-analysis3 that resulted in correct error phase:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: dd-analysis3
  ...
spec:
  args:
    ...
  metrics:
  - name: dd-analysis3
    interval: 30s
    count: 5
    successCondition: "result < 0.05"
    failureLimit: 2
    provider:
      datadog:
        interval: 5m
        query:           sum:(
           sum:trace.fastapi.request.hits{service:{{args.service}},http.status_code:500,env:{{args.environment}}}.as_count() /
           sum:trace.http.request.hits{service:{{args.service }},env:{{args.environment }}}.as_count()
         )

Here's the argo-rollouts result for dd-analysis3 which shows error phase correctly (error due to no data):

{
              "name": "dd-analysis3",
              "phase": "Error",
              "measurements": [
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:19:40Z",
                  "finishedAt": "2025-10-01T04:19:40Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:19:50Z",
                  "finishedAt": "2025-10-01T04:19:50Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:20:00Z",
                  "finishedAt": "2025-10-01T04:20:00Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:20:10Z",
                  "finishedAt": "2025-10-01T04:20:10Z"
                },
                {
                  "phase": "Error",
                  "message": "invalid operation: < (mismatched types <nil> and float64)",
                  "startedAt": "2025-10-01T04:20:20Z",
                  "finishedAt": "2025-10-01T04:20:20Z"
                }
              ],
              "message": "invalid operation: < (mismatched types <nil> and float64)",
              "error": 5,
              "consecutiveError": 5
            },

The rollout of the service (the other analysis: promql-analysis, dd-analysis succeeded as expected):

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: pd-ws-test-workload
 ....
spec:
  strategy:
    canary:
      maxUnavailable: 10%
      maxSurge: 10%
      steps:
        - setWeight: 30
        - analysis:
            args:
            - name: rollout_revision
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['rollout.argoproj.io/revision']
            - name: workload
              value: pd-ws-test-workload
            templates:
            - templateName: promql-analysis
            - templateName: dd-analysis
            - templateName: dd-analysis2
            - templateName: dd-analysis3
        - setWeight: 70
        - pause: {}
        - setWeight: 100
...
  analysis:
    successfulRunHistoryLimit: 10
    unsuccessfulRunHistoryLimit: 10
  revisionHistoryLimit: 20
  minReadySeconds: 0

Argo-Rollouts version: 1.6.6

To Reproduce

Expected behavior
I expect dd-analysis2 should have 'Failed' phase instead of 'Successful' phase since all its measurement are either 'Errored' or 'Failed'.

Screenshots

Version
1.6.6

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME
time="2025-10-01T04:20:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"AnalysisRun\", Namespace:\"pd-ws-test-dev\", Name:\"pd-ws-test-workload-6b447b6ff5-36-1\", UID:\"e6e35c8d-e0dc-46f5-87b0-df26092c3c59\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"2375908069\", FieldPath:\"\"}): type: 'Normal' reason: 'MetricSuccessful' Metric 'dd-analysis2' Completed. Result: Successful"
time="2025-10-01T04:20:20Z" level=info msg="Metric 'dd-analysis2' Completed. Result: Successful" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 event_reason=MetricSuccessful namespace=pd-ws-test-dev
time="2025-10-01T04:20:20Z" level=info msg="Metric 'dd-analysis2' transitioned from Running -> Successful" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:20Z" level=info msg="Metric Assessment Result - Successful: Run Terminated" metric=dd-analysis2
time="2025-10-01T04:20:20Z" level=info msg="Skipping measurement: run is terminating" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:10Z" level=info msg="Measurement Completed. Result: Failed" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:10Z" level=warning msg="Datadog will soon deprecate their API v1. Please consider switching to v2 soon." analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:10Z" level=info msg="Running overdue measurement" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:00Z" level=warning msg="Measurement had error: invalid operation: < (mismatched types <nil> and float64)" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:00Z" level=info msg="Measurement Completed. Result: Error" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:00Z" level=warning msg="Datadog will soon deprecate their API v1. Please consider switching to v2 soon." analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:20:00Z" level=info msg="Running overdue measurement" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:50Z" level=warning msg="Measurement had error: invalid operation: < (mismatched types <nil> and float64)" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:50Z" level=info msg="Measurement Completed. Result: Error" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:50Z" level=warning msg="Datadog will soon deprecate their API v1. Please consider switching to v2 soon." analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:50Z" level=info msg="Running overdue measurement" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:40Z" level=warning msg="Measurement had error: invalid operation: < (mismatched types <nil> and float64)" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:40Z" level=info msg="Measurement Completed. Result: Error" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:40Z" level=warning msg="Datadog will soon deprecate their API v1. Please consider switching to v2 soon." analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev
time="2025-10-01T04:19:40Z" level=info msg="Running initial measurement" analysisrun=pd-ws-test-workload-6b447b6ff5-36-1 metric=dd-analysis2 namespace=pd-ws-test-dev

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Metadata

Metadata

Assignees

No one assigned

    Labels

    analysisRelated to Analysis CRDbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions