Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate image-pushing jobs hitting GetRequestsPerMinutePerProject quota for prow build cluster project #20652

Open
spiffxp opened this issue Jan 28, 2021 · 18 comments
Labels
area/images area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@spiffxp
Copy link
Member

spiffxp commented Jan 28, 2021

What happened:
Context: kubernetes/k8s.io#1576 (comment)

As the volume of image-pushing jobs running on the prow build cluster in k8s-infra-prow-build-trusted has grown, we're starting to bump into a GCB service quota (GetRequestsPerMinutePerProject) for the project. This isn't something we can request to raise like other quota (e.g. max gcp instances per region)

What you expected to happen:
Have GCB service requests charged to the project running the GCB builds instead of a central shared project. Avoid bumping into API-related quota.

How to reproduce it (as minimally and precisely as possible):
Merge a PR to kubernetes/kubernetes that updates multiple test/images subdirectories, or otherwise induce a high volume of image-pushing jobs on k8s-infra-prow-build-trusted

Ignore whether you bump into the concurrent builds quota (also a GCB service quota)

Can visualize usage (and whether quota is hit) here if a member of [email protected]: https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?orgonly=true&project=k8s-infra-prow-build-trusted&supportedpurview=project&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null))

Please provide links to example occurrences, if any:
Don't have link to jobs that encountered this specifically, but kubernetes/k8s.io#1576 describes the issue, and the metric explorer link above shows roughly when we've bumped into quota.

Anything else we need to know?:
Parent issue: kubernetes/release#1869

My guess is that we need to move away from using a shared service account in the build cluster's project (gcb-builder@k8s-infra-prow-build-trusted), and instead setup service accounts per staging project.

It's unclear to me whether these would all need access to something in the build cluster project.

A service-account-per-project would add a bunch of boilerplate to the service accounts loaded into the build cluster, and add another field to job configs that needs to be set manually vs. copy-pasted. We could offset this by verifying configs are correct via presubmit enforcement.

I'm open to other suggestions to automate the boilerplate away, or a solution that involves image-builder consuming less API quota.

/milestone v1.21
/priority important-soon
/wg k8s-infra
/sig testing
/area images
/sig release
/area release-eng
/assign @cpanato @justaugustus
as owners of parent issue

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Jan 28, 2021
@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. wg/k8s-infra labels Jan 28, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Jan 28, 2021
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. area/images sig/release Categorizes an issue or PR as relevant to SIG Release. area/release-eng Issues or PRs related to the Release Engineering subproject labels Jan 28, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

Neat, I think I just caught this happening for k8s-testimages jobs that run in the "test-infra-trusted" cluster too

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/post-test-infra-push-kettle/1357442531670888448#1:build-log.txt%3A57

@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

Yup, it doesn't happen often, but it does happen for test-infra-trusted (screenshot instead of link since access is restricted to google.com)

Screen Shot 2021-02-05 at 11 29 59 AM

@spiffxp
Copy link
Member Author

spiffxp commented Feb 5, 2021

Migrating away from a central gcb-builder service account:

This is the more generic / less one-off version of steps I listed in #20703 (comment)

@justaugustus
Copy link
Member

@cpanato -- You were last working on this. What are the next steps?

/unassign
/milestone v1.22

@cpanato
Copy link
Member

cpanato commented Mar 25, 2021

@justaugustus @spiffxp sorry for the delay in replying on this, I was doing some investigations, and I will describe my findings and possible options that I can see (you all might have other options :) )

Issue: When the cloudbuild is triggered, sometimes it failed because we receive Quota exceeded for quota metric 'Build and Operation Get requests'

Aaron said this is something we cannot increase, so I did some tests using my account to simulate the same environment.

For example, in some releng cases, we have a PR that might trigger some images to build after we merge it. Those images have more than one variant (in some cases have four variants), which means we will trigger the cloudbuild 4+ simultaneously, and that can cause the quota issue we receive failing some jobs.

The image-builder in this code snippet is responsible for triggering the jobs when having variants

for k, v := range vs {
go func(job string, vc map[string]string) {
defer w.Done()
log.Printf("Starting job %q...\n", job)
if err := runSingleJob(o, job, uploaded, tag, mergeMaps(extraSubs, vc)); err != nil {
errors = append(errors, fmt.Errorf("job %q failed: %v", job, err))
log.Printf("Job %q failed: %v\n", job, err)
} else {
var imageName, _ = getImageName(o, tag, job)
log.Printf("Successfully built image: %v \n", imageName)
log.Printf("Job %q completed.\n", job)
}
}(k, v)
We can add some delay to trigger the next and then avoid all jobs being push almost at the same time.

I reproduce the issue using my account. triggered ~15 jobs in parallel.
(cleaning the logs for better visualization)

DEBUG: Retrying request to url https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json after exception HttpError accessing <https://cloudbuild.googleapis.com/v1/projects/cpanato-capg-test/locations/global/builds/4365-a9c1-?alt=json>: response: "error": {
    "code": 429,
    "message": "Quota exceeded for quota metric 'Build and Operation Get requests' and limit 'Build and Operation Get requests per minute' of service 'cloudbuild.googleapis.com' for consumer 'project_number:'.",
    "status": "RESOURCE_EXHAUSTED",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.ErrorInfo",
        "reason": "RATE_LIMIT_EXCEEDED",
        "domain": "googleapis.com",
        "metadata": {
          "consumer": "projects/985606222016",
          "quota_limit": "GetRequestsPerMinutePerProject",
          "quota_metric": "cloudbuild.googleapis.com/get_requests",
          "service": "cloudbuild.googleapis.com"
        }
      }
    ]
  }
}

Having a service account per job might fix this issue, but I think the quota Build and Operation Get requests is per project and not per service account.
So maybe if we delay the start when jobs have variants might work in this case.

What are your thoughts?
I can make the change in the image-builder code to add this delay or even maybe check the response and if matches that we wait and retry.

@spiffxp
Copy link
Member Author

spiffxp commented Mar 27, 2021

I will respond in more detail next week. I still think we should isolate service accounts and the projects they can build. But.

A more surgical fix might be to update image-builder to invoke gcloud with the --billing-project flag with the staging project as its value. That should cause quota to get counted against the staging project instead of the project associated with the service account running image-builder

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Jul 15, 2021

/remove-lifecycle stale
This still remains an issue. I still think either approach I've recommended above is valid:

  • a per-project service account still seems to me like the more correct approach from a security perspective, and the code in k8s.io is nearly there to provision all of these (see comments referencing ensure_staging_gcb_builder_service_account in https://github.com/kubernetes/k8s.io/blob/main/infra/gcp/ensure-staging-storage.sh); it might result in more boilerplate though
  • modifying image-builder to use the equivalent of gcloud --billing-project for whichever project it's pushing to might be a quicker fix, assuming everything runs in and pushes to its same respective GCP project; it might require granting serviceusage.use on a lot of projects to a singular service account, which feels less great from a security perspective

IMO neither of these are show-stopper tradeoffs, and I'm happy to help ensure whomever is interested has the appropriate permissions to play around on a project.

In the meantime I'm opening a PR to create a single testgrid dashboard for all image pushing jobs, so we can get a better sense of when and how often we're hitting this.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2021
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.22, v1.23 Jul 27, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Sep 21, 2021

FYI @chaodaiG this might be worth keeping in mind given the GCB design proposal you presented at today's SIG Testing meeting

@spiffxp
Copy link
Member Author

spiffxp commented Sep 24, 2021

I see this is happening all over the place in kubernetes/release, e.g. kubernetes/release#2266

New theory: gcloud builds submit streams logs by default. I suspect behind the scenes it's doing periodic polling at a high enough rate that N gets/s per gcloud builds submit X M PRs x O cloudbuilds > 750/s

We could update the builder to gcloud builds submit --async, periodically poll at a low rate of our choosing, and then gcloud builds logs once a build is done. This starts to look an awful lot like mini GCB controller, which is what was proposed at SIG Testing this week. Unfortunately that's going to take a few months to land.

@spiffxp
Copy link
Member Author

spiffxp commented Sep 29, 2021

Screenshot from https://console.cloud.google.com/apis/api/cloudbuild.googleapis.com/quotas?project=k8s-infra-prow-build-trusted&pageState=(%22duration%22:(%22groupValue%22:%22P30D%22,%22customValue%22:null)). Not entirely sure who can see this page, but members of [email protected] should be able to at least.

Screen Shot 2021-09-28 at 5 59 37 PM

The bottom graph shows quota violations over time. As a reminder this is against a quota we can't raise.

So I think it's both. The shared gcb-builder service account will hit this problem if it's triggering builds across too many k8s-staging projects in general. And then also the gcb-builder-releng-test service account hits this problem when triggering too many builds in parallel within its single project (maybe because of log tailing)

@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@spiffxp
Copy link
Member Author

spiffxp commented Nov 24, 2021

/milestone v1.24

@k8s-ci-robot k8s-ci-robot modified the milestones: v1.23, v1.24 Nov 24, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2022
@BenTheElder BenTheElder modified the milestones: v1.24, someday Apr 19, 2022
@BenTheElder BenTheElder added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 19, 2022
@BenTheElder
Copy link
Member

This seems like a valid "we really ought to fix this someday" but backlog/low-priority and no movement. Punting to someday milestone.

@spiffxp
Copy link
Member Author

spiffxp commented Oct 10, 2022

/remove-priority important-soon
/priority backlog

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 10, 2022
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/images area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

7 participants