Try to remove items from completed builds even if supposedly already cancelled#259
Try to remove items from completed builds even if supposedly already cancelled#259jglick merged 4 commits intojenkinsci:masterfrom
Conversation
|
Proposed validation script: something like this (from memory, untested!) Queue.instance.items.grep {
it.task.class.simpleName == 'PlaceholderTask' && it.task.stopping
}.each {
println(it)
println(it.task)
println(Queue.instance.cancel(it.task))
} |
| // We only do this if resumption is complete so that we do not load the run and resume its execution in this context in normal scenarios. | ||
| Run<?, ?> run = runForDisplay(); | ||
| if (!stopping && run != null && !run.isLogUpdated()) { | ||
| if (run != null && !run.isLogUpdated()) { |
There was a problem hiding this comment.
IDK about this. It was only ever intended to run once after a Jenkins restart (and Damien says that in the current case ci.jenkins.io has not yet restarted), so I am worried that removing the guard may lead to log spam or further confusion in some cases. Maybe instead we could add a call to Timer.execute after
I think it would be harmless to test this new code, but if ci.jenkins.io really has not yet restarted, I would be interested to see if the previous code is enough to cancel the task after a restart or not.
There was a problem hiding this comment.
I think it would be harmless to test this new code, but if ci.jenkins.io really has not yet restarted, I would be interested to see if the previous code is enough to cancel the task after a restart or not.
Oh I'm sorry, our comments crossed: #259 (comment)
There was a problem hiding this comment.
I am worried that removing the guard may lead to log spam
I was also concerned about this. But @dduportal said that the script (which mimics what this PR would do) indeed canceled the bogus items successfully.
There was a problem hiding this comment.
Even if it is fine in this case, it seems clearer and safer to me to trigger an explicit delayed cancellation when the original cancellation attempt fails rather than have getCauseOfBlockage potentially cancel things more than once. For example, if the Timer thread pool is somewhat congested and the task does not run for more than 5 seconds, I think a second task could be scheduled with the proposed code (not really a big deal).
Either way, at this point it seems that the issue seen on ci.jenkins.io happens while the controller is running rather than being a result of some problem with persisted state after a restart, which I did not know was possible and is distinct from what I was trying to fix when I filed #185. Ideally we would try to reproduce the new issue in a test to have a chance at fixing the underlying cause.
There was a problem hiding this comment.
the issue seen on ci.jenkins.io happens while the controller is running rather than being a result of some problem with persisted state after a restart
Yes, that was also my understanding.
trigger an explicit delayed cancellation when the original cancellation attempt fails
My reservation is that we do not understand the cause of the original failure to cancel, so we cannot know whether that would actually work. The idea with this PR is to be a final fallback in case other things failed for unknown reasons, to make sure the queue items do not remain there indefinitely.
try to reproduce the new issue in a test
Yes of course that would be best, but I have no clear notion of what happened. BlockedItem.leave not working? One case appeared to relate to a Dependabot PR being closed, but that does nothing to explain why Queue.cancel would fail.
Not sure what to do at this point other than see how things behave on ci.jenkins.io.
There was a problem hiding this comment.
I haven't been paying attention too closely but I think I've noticed this on ci.jenkins.io when a PR build is aborted due to a new build starting, i.e.:
- A PR build starts, some branches might begin executing right away but others might be queued waiting for an agent to come up
- A new commit is made
- A new PR build starts
- Due to
disableConcurrentBuilds(abortPrevious: true)the first PR build is aborted - The zombie queue items from the first build remain
There was a problem hiding this comment.
Right, that would be a common trigger for
cancel would be returning false (as logs confirm) and the item left in the queue.
There was a problem hiding this comment.
My reservation is that we do not understand the cause of the original failure to cancel, so we cannot know whether that would actually work. The idea with this PR is to be a final fallback in case other things failed for unknown reasons, to make sure the queue items do not remain there indefinitely.
Well, we do not really know whether this PR works yet either. As far as we know, the existing code would have fixed the issue when Damien restarted ci.jenkins.io, and we have no way to tell whether the new case is being hit. I think we should at least use a distinct log message for the case where stopping is already true so that we would know for sure when it applies and since "Refusing to build ..." does not really make sense as an explanation in this scenario anyway (more like "Desperately trying to cancel ...").
Maybe we should use both approaches, at least temporarily, with a distinct log message in each case, just to see what happens. Knowing whether an asynchronous cancellation right after the synchronous cancellation in stop works or not could be a useful clue to anyone who ends up trying to track down the root cause in the future.
There was a problem hiding this comment.
we do not really know whether this PR works yet either
Not directly, but analogous calls made from /script did work, so we have reason to believe it would have.
As far as we know, the existing code would have fixed the issue when Damien restarted ci.jenkins.io
Yes it probably would have, but we do not want to require admins to restart the server just to clean up items we know are invalid.
use both approaches, at least temporarily, with a distinct log message
Could try that. @dduportal do you believe this issue happens enough that you would be likely to see it again?
There was a problem hiding this comment.
|
Just installed https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/workflow/workflow-durable-task-step/1201.vb_0a_ea_ca_96d6c/ on ci.jenkins.io which restarted the controller |
|
@dduportal are you still running this? Any issues? |
|
@jglick @dwnusbaum We're seeing the issue again since yesterday, jenkins-infra/helpdesk#3211 has been opened today. We're currently running workflow-durable-task-step version Here is the output of the first part of the console script from ci.jenkins.io: LogsBefore running the second part of the script, is there any log or info we should retrieve first? Or should we try again an incrementals version of this PR? FTR, I've retrieved the support bundle from the last 3 days but they are incompletes (ex: /nodes/master/logs/all_2022-10-26_11.29.38.log contains only one hour of logs) |
|
https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/259/checks?check_run_id=9163865636 not reproducible locally; likely a flake. |
|
@lemeurherve you can try https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/workflow/workflow-durable-task-step/1213.vf72d3ea_b_4a_72/workflow-durable-task-step-1213.vf72d3ea_b_4a_72.hpi (based on the latest release, 1210) and see if the problem ever recurs (in which case a special log message should be printed). Or just run it for a few days and verify that nothing is obviously broken, so I can just release this and hope for the best. |
|
@lemeurherve @dduportal ping |
Many many thanks @jglick ! I would like to delay this until the new LTS (and the train of OpenSSL RCEs) is applied (given that the 1st of November is a banking day in most of EU and the LTS is planned for the 2nd) |
|
Sure, whenever you prefer, just was making sure this did not get forgotten. |
LTS finished. Ping @lemeurherve are you interested into following up and install the incremental plugin on ci.jenkins.io? |
Amending #226. @dduportal notes seeing such dead items on ci.jenkins.io.