Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout of addon-resizer build job #34380

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

raywainman
Copy link
Contributor

The current addon-resizer build job is taking just over 2 hours to complete (this is a slow process due to it building 5 different images) which is much slower in a Cloud Build environment than locally.

The default timeout is 2 hours.

Looking to override this to 3 hours to give the job enough time to finish.

See https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-autoscaler-push-addon-resizer-images/1892289674458697728 for a recent failure.

This work is part of the onboarding of this repo into automatic builds, see kubernetes/autoscaler#7615.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 20, 2025
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 20, 2025
@raywainman
Copy link
Contributor Author

/assign @BenTheElder

@BenTheElder
Copy link
Member

Do we know if the actual build is taking that long, as opposed to the prowjob hanging fetching logs?

Some of these are running in shared projects and the prowjob sometimes times out just on streaming the build logs due to exceeding quota, not the actual build in cloudbuild taking long.

What timeout does the cloudbuild have?

@raywainman
Copy link
Contributor Author

Yeah as far as I can tell, the build is indeed taking that long.

The latest build was triggered on 03-05 14:25 EDT.

I can see the various architecture images built in our GCR repo:
2:30 PM
3:18 PM
4:02 PM
4:37 PM
5:14 PM
Manifest: 5:15 PM

And then the Prow workflow fails with this error message:

{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2025-03-05T21:25:59Z"}
{"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15m0s grace period","severity":"error","time":"2025-03-05T21:40:59Z"}

We have the Cloud Build timeout set to 6 hours.

@BenTheElder - do you see anything weird here? To me it looks like the build is eating up most of the 2 hour time limit here but I'd love to lean on your experience here and see if you can spot anything :)

@BenTheElder
Copy link
Member

BenTheElder commented Mar 10, 2025

We have the Cloud Build timeout set to 6 hours.

wow! that's a lot

I would recommend splitting the build between different images if that makes sense at all (like here: https://github.com/kubernetes-sigs/kind/tree/main/images these images are distinct, so then they each have a postsubmit that only runs if those images are affected, with their own cloudbuild files

run_if_changed: '(^images/base)|(^images/Makefile)|(^.go-version)'

If you can't write an accurate "if these files are changed, build this one" (which can be hard to scope accurately) and need to build all of them in one job every time the whole repo changes,

I'd bump up the machine size in the cloudbuild.

https://cloud.google.com/build/docs/api/reference/rest/v1/projects.builds#machinetype

But if possible, please check if you can avoid unnecessary builds first, if it would be possible to use multiple jobs that run only sometimes, then you may speed up building individual images that have changed

EDIT: this seems to be done to the extent possible already?

@BenTheElder
Copy link
Member

The other thing you can do is avoid compiling under emulation versus cross compiling, since you are building static go binary you can take advantage of:

https://www.docker.com/blog/faster-multi-platform-builds-dockerfile-cross-compilation-guide/

Example:
https://github.com/kubernetes-sigs/kind/blob/022bedd4943892014ad10757707e32263aee3ecc/images/local-path-provisioner/Dockerfile#L16

Basically you can tell docker "build in step one on the host platform, then copy it to an image from the target platform"

I think that would be a better move, but we can bump the timeout in the meantime
/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, raywainman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2025
@BenTheElder
Copy link
Member

BenTheElder commented Mar 10, 2025

The other thing you can do is avoid compiling under emulation versus cross compiling, since you are building static go binary you can take advantage of:

I would prioritize this, compiling under emulation is unnecessarily slow, running the build steps as cross-compilation on the host platform instead of emulating the target platform via docker and binfmt_misc + qemu-userspace has big performance wins.

It's also usually a pretty small change. You need to set GOARCH and the --from platforms.

@raywainman
Copy link
Contributor Author

It's also usually a pretty small change. You need to set GOARCH and the --from platforms.

Thanks Ben! Let me try it and will report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants