Skip to content

Conversation

@ronaldngounou
Copy link
Member

@ronaldngounou ronaldngounou commented Jan 20, 2026

Context

This prometheus-operator manifest is too old from 10 months ago. This PR upgrades the prometheus-operator version to 0.88.0 as it is the latest version at the time this PR is raised.

image

Summary

  1. To bump the prometheus-operator, I have reconciled every file in the manifests folder against the latest upstream manifest and did a 3-way merge.
  2. I created a new presubmit in the same folder as xref in order to have the presubmit run at 100 nodes. As a consequence, the 100 nodes scale tests are now becoming mandatory as presubmits before running 5000 scale tests. This helps to have the kops job capture logs from prometheus-operator. DONE in PR Add presubmit pull-perf-tests-ec2-master-scale-performance-100 test-infra#36280
  3. Updated the prometheus to the latest version at this time, v3.9.1 as found in the latest release history.

Impact

This PR fixes the APIResponsivenessPrometheus regression at the source in order to:

  • address any kubernetes release blocking by vendors.
  • unblock the release for kubernetes 1.36 by the 1.36 k8s Release Team.

Logs

At the moment, the job (https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/2013507147345694720) has been failing due to as observed in the build logs below as well as in scalability test failures.

From the logs:

{ Failure :0
[measurement call APIResponsivenessPrometheus - APIResponsivenessPrometheus error: top latency metric: there should be no high-latency requests, but: [got: &{Resource:events Subresource: Verb:DELETE Scope:namespace Latency:perc50: 59.925s, perc90: 1m0s, perc99: 1m0s Count:66 SlowCount:66}; expected perc99 <= 30s]]
:0}

From https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/2013507147345694720/build-log.txt , the ec2-master-scale test is failing due to:

�������Failure�3no endpoints available for service "prometheus-k8s""�ServiceUnavailable0����"�
W0120 07:27:54.118590   22340 util.go:72] error while calling prometheus api: the server is currently unable to handle the request (get services http:prometheus-k8s:9090), response: k8s�

�v1��Status�g

image

References

https://kubernetes.slack.com/archives/C09QZTRH7/p1767977126452879
image

Related PRs:

cc: @upodroid @dims @mengqiy

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 20, 2026
@ronaldngounou ronaldngounou force-pushed the bump-prometheus-operator-to088 branch from 3c19ec2 to 6a310e0 Compare January 21, 2026 00:49
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 21, 2026
@ronaldngounou ronaldngounou force-pushed the bump-prometheus-operator-to088 branch 3 times, most recently from bf0702e to 6871432 Compare January 21, 2026 00:55
@ronaldngounou
Copy link
Member Author

/retest

@ronaldngounou
Copy link
Member Author

/retest

@upodroid
Copy link
Member

/test pull-perf-tests-ec2-master-scale-performance-100

@ronaldngounou ronaldngounou force-pushed the bump-prometheus-operator-to088 branch from f2798eb to 3a979ac Compare January 22, 2026 08:08
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

@ronaldngounou ronaldngounou force-pushed the bump-prometheus-operator-to088 branch from 3a979ac to 68c64be Compare January 22, 2026 09:27
In order to help address the ec2-master-scale-performance job failing in https://testgrid.k8s.io/amazon-ec2-release\#ec2-master-scale-performance,
this PR upgrades the prometheus-operator version to 0.88.0.
As a result, it will help address the job (https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kops-aws-scale-amazonvpc-using-cl2/2013507147345694720)
failing due to <prometheus no endpoints available for service ServiceUnavailable error> observed in the build logs.

Signed-off-by: Ronald Ngounou <rngounou@amazon.com>
@ronaldngounou ronaldngounou force-pushed the bump-prometheus-operator-to088 branch from 68c64be to 51068fa Compare January 22, 2026 09:33
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ronaldngounou
Once this PR has been reviewed and has the lgtm label, please assign marseel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

@ronaldngounou ronaldngounou changed the title Bump prometheus-operator version to 0.88.0 to resolve ec2-master-scale-performance failures Resolve ec2-master-scale-performance regressions by updating prometheus-operator version to 0.88.0 Jan 22, 2026
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-gce-master-scale-performance-100

1 similar comment
@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-gce-master-scale-performance-100

@ronaldngounou
Copy link
Member Author

/test pull-perf-tests-ec2-master-scale-performance-100

podMonitorSelector: {}
priorityClassName: system-node-critical
version: v2.40.0
version: v3.9.1
Copy link
Member Author

@ronaldngounou ronaldngounou Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logLevel: debug
enableAdminAPI: true
baseImage: gcr.io/k8s-testimages/quay.io/prometheus/prometheus
image: quay.io/prometheus/prometheus
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronaldngounou ronaldngounou changed the title Resolve ec2-master-scale-performance regressions by updating prometheus-operator version to 0.88.0 Bunmp prometheus-operator version to 0.88.0 to address ec2-master-scale-performance slowness Jan 24, 2026
@ronaldngounou ronaldngounou changed the title Bunmp prometheus-operator version to 0.88.0 to address ec2-master-scale-performance slowness Bump prometheus-operator version to 0.88.0 to mitigate ec2-master-scale-performance slowness failures Jan 24, 2026
@ronaldngounou ronaldngounou changed the title Bump prometheus-operator version to 0.88.0 to mitigate ec2-master-scale-performance slowness failures Bump prometheus-operator version to 0.88.0 Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants