-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate image-pushing jobs hitting GetRequestsPerMinutePerProject quota for prow build cluster project #20652
Comments
Neat, I think I just caught this happening for k8s-testimages jobs that run in the "test-infra-trusted" cluster too |
@cpanato -- You were last working on this. What are the next steps? /unassign |
@justaugustus @spiffxp sorry for the delay in replying on this, I was doing some investigations, and I will describe my findings and possible options that I can see (you all might have other options :) ) Issue: When the cloudbuild is triggered, sometimes it failed because we receive Aaron said this is something we cannot increase, so I did some tests using my account to simulate the same environment. For example, in some releng cases, we have a PR that might trigger some images to build after we merge it. Those images have more than one variant (in some cases have four variants), which means we will trigger the cloudbuild 4+ simultaneously, and that can cause the quota issue we receive failing some jobs. The image-builder in this code snippet is responsible for triggering the jobs when having variants test-infra/images/builder/main.go Lines 309 to 321 in 30af69f
I reproduce the issue using my account. triggered ~15 jobs in parallel.
Having a service account per job might fix this issue, but I think the quota What are your thoughts? |
I will respond in more detail next week. I still think we should isolate service accounts and the projects they can build. But. A more surgical fix might be to update image-builder to invoke gcloud with the --billing-project flag with the staging project as its value. That should cause quota to get counted against the staging project instead of the project associated with the service account running image-builder |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale
IMO neither of these are show-stopper tradeoffs, and I'm happy to help ensure whomever is interested has the appropriate permissions to play around on a project. In the meantime I'm opening a PR to create a single testgrid dashboard for all image pushing jobs, so we can get a better sense of when and how often we're hitting this. |
FYI @chaodaiG this might be worth keeping in mind given the GCB design proposal you presented at today's SIG Testing meeting |
I see this is happening all over the place in kubernetes/release, e.g. kubernetes/release#2266 New theory: We could update the builder to |
Screenshot from https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?project=k8s-infra-prow-build-trusted&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null)). Not entirely sure who can see this page, but members of [email protected] should be able to at least. The bottom graph shows quota violations over time. As a reminder this is against a quota we can't raise. So I think it's both. The shared gcb-builder service account will hit this problem if it's triggering builds across too many k8s-staging projects in general. And then also the gcb-builder-releng-test service account hits this problem when triggering too many builds in parallel within its single project (maybe because of log tailing) |
/milestone v1.24 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
This seems like a valid "we really ought to fix this someday" but backlog/low-priority and no movement. Punting to someday milestone. |
/remove-priority important-soon |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
What happened:
Context: kubernetes/k8s.io#1576 (comment)
As the volume of image-pushing jobs running on the prow build cluster in k8s-infra-prow-build-trusted has grown, we're starting to bump into a GCB service quota (GetRequestsPerMinutePerProject) for the project. This isn't something we can request to raise like other quota (e.g. max gcp instances per region)
What you expected to happen:
Have GCB service requests charged to the project running the GCB builds instead of a central shared project. Avoid bumping into API-related quota.
How to reproduce it (as minimally and precisely as possible):
Merge a PR to kubernetes/kubernetes that updates multiple test/images subdirectories, or otherwise induce a high volume of image-pushing jobs on k8s-infra-prow-build-trusted
Ignore whether you bump into the concurrent builds quota (also a GCB service quota)
Can visualize usage (and whether quota is hit) here if a member of [email protected]: https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?orgonly=true&project=k8s-infra-prow-build-trusted&supportedpurview=project&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null))
Please provide links to example occurrences, if any:
Don't have link to jobs that encountered this specifically, but kubernetes/k8s.io#1576 describes the issue, and the metric explorer link above shows roughly when we've bumped into quota.
Anything else we need to know?:
Parent issue: kubernetes/release#1869
My guess is that we need to move away from using a shared service account in the build cluster's project (gcb-builder@k8s-infra-prow-build-trusted), and instead setup service accounts per staging project.
It's unclear to me whether these would all need access to something in the build cluster project.
A service-account-per-project would add a bunch of boilerplate to the service accounts loaded into the build cluster, and add another field to job configs that needs to be set manually vs. copy-pasted. We could offset this by verifying configs are correct via presubmit enforcement.
I'm open to other suggestions to automate the boilerplate away, or a solution that involves image-builder consuming less API quota.
/milestone v1.21
/priority important-soon
/wg k8s-infra
/sig testing
/area images
/sig release
/area release-eng
/assign @cpanato @justaugustus
as owners of parent issue
The text was updated successfully, but these errors were encountered: