Skip to content

worker: run repo maintenance during idle time (Bug 2037216)#1135

Open
cgsheeh wants to merge 19 commits into
mozilla-conduit:mainfrom
cgsheeh:idle-strip
Open

worker: run repo maintenance during idle time (Bug 2037216)#1135
cgsheeh wants to merge 19 commits into
mozilla-conduit:mainfrom
cgsheeh:idle-strip

Conversation

@cgsheeh
Copy link
Copy Markdown
Member

@cgsheeh cgsheeh commented May 5, 2026

Lando's workers typically run repo-cleaning commands at
the beginning of each job processing time. In hg workers,
we run hg strip at the start of each job to remove
previously-created stale commits, despite those commits
not interfering with the job completion. In Git, we take
the opposite approach and simply ignore the temporary
work branch - it is never cleaned up.

Add a new "repo maintenance" step that runs while the
worker is being throttled due to no jobs remaining in
the queue. To avoid running excessive maintenance, the
runtime of the last maintenance run for each repo is
recorded in the worker, and maintenance is skipped if
it has been completed within the threshold.

For Mercurial, move the hg strip command into this maintenance
task, which should save us about 8s for each push to try.
For Git, add a cleanup of the stale working branches,
so we no longer have thousands of temp branches in our
worker repos.

After this change, each HgSCM.clean_repo call sites
always pass strip_non_public_commits=False, while
GitSCM.clean_repo call sites always pass True.
Remove the kwarg and make each behaviour the default.

Lando's workers typically run repo-cleaning commands at
the beginning of each job processing time. In hg workers,
we run `hg strip` at the start of each job to remove
previously-created stale commits, despite those commits
not interfering with the job completion. In Git, we take
the opposite approach and simply ignore the temporary
work branch - it is never cleaned up.

Add a new "repo maintenance" step that runs while the
worker is being throttled due to no jobs remaining in
the queue. To avoid running excessive maintenance, the
runtime of the last maintenance run for each repo is
recorded in the worker, and maintenance is skipped if
it has been completed within the threshold.

For Mercurial, move the `hg strip` command into this maintenance
task, which should save us about 8s for each push to try.
For Git, add a cleanup of the stale working branches,
so we no longer have thousands of temp branches in our
worker repos.

After this change, each `HgSCM.clean_repo` call sites
always pass `strip_non_public_commits=False`, while
`GitSCM.clean_repo` call sites always pass `True`.
Remove the kwarg and make each behaviour the default.
@cgsheeh cgsheeh requested a review from a team as a code owner May 5, 2026 19:47
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

View this pull request in Lando to land it once approved.

Copy link
Copy Markdown
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small comments (and note failing tests).

Comment thread src/lando/main/scm/git.py Outdated
Comment thread src/lando/main/scm/git.py
Comment thread src/lando/main/scm/hg.py Outdated
Comment thread src/lando/main/scm/hg.py Outdated
Comment thread src/lando/main/scm/hg.py
Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
Comment thread src/lando/api/legacy/workers/base.py Outdated
@cgsheeh cgsheeh requested review from shtrom and zzzeid May 6, 2026 03:10
Comment thread src/lando/api/tests/test_hg.py Outdated
Comment thread src/lando/api/tests/test_hg.py Outdated
@@ -31,7 +31,7 @@ def test_integrated_hgrepo_clean_repo(hg_clone):
repo = HgSCM(hg_clone.strpath)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could we rename this variable to scm while we're at it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo = HgSCM() is the convention in this file, oddly enough. I updated this instance, but we should fix the others in a follow-up. :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It pre-dates the SCM split (;

Comment thread src/lando/api/tests/test_worker.py
Comment thread src/lando/api/tests/test_worker.py
Comment thread src/lando/api/tests/test_worker.py
Copy link
Copy Markdown
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few additional comments but looks good otherwise.

Comment thread src/lando/main/scm/git.py Outdated
Comment on lines +508 to +509
# `git branch -D` refuses to delete the currently checked-out branch,
# so move off any `lando-*` branch first.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting bit here. In the future it might make more sense to ensure we are back on the default branch after a job is finished.

Comment thread src/lando/main/scm/git.py Outdated
Comment thread src/lando/main/scm/git.py Outdated
Comment thread src/lando/main/scm/hg.py
Comment thread src/lando/api/legacy/workers/base.py Outdated
Copy link
Copy Markdown
Contributor

@zzzeid zzzeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that there may be some unexpected behaviour (not mentioned here) as to what would happen (1) on first deploy (i.e., the first maintenance run) but also (2) when the landing worker is busy for a long time (which is typical) and things accumulate in such a way that the maintenance run may take longer than expected (or possibly take more resources than expected). Worth testing those scenarios before deploying.

@cgsheeh
Copy link
Copy Markdown
Member Author

cgsheeh commented May 8, 2026

It occurred to me that there may be some unexpected behaviour (not mentioned here) as to what would happen (1) on first deploy (i.e., the first maintenance run) but also (2) when the landing worker is busy for a long time (which is typical) and things accumulate in such a way that the maintenance run may take longer than expected (or possibly take more resources than expected). Worth testing those scenarios before deploying.

I'm going to throw this up on dev and test it out before merging/deploying. 👍

To handle 2), we should check that we haven't exceeded the throttle time between each repo maintenance call, so we only clean up a few repos at a time before looking for another landing job. We could also sort the repos by time since the last maintenance to decide which one to process, that way each repo is guaranteed to have maintenance run on it eventually, and we don't accidentally spend several minutes running maintenance tasks instead of processing landing jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants