[JENKINS-49707] Agent missing after controller restart to fail resumption of node step, not kill whole build#180
Conversation
| * @author Kohsuke Kawaguchi | ||
| * @deprecated Normally now done via {@link ExecutorStepDynamicContext}. | ||
| */ | ||
| @Deprecated |
There was a problem hiding this comment.
Can the factories for these pickles be deleted or replaced with an implementation that just throws an exception if it encounters an object of the specified type? Or do you want to keep them around in case some other step or wild Groovy code is relying on them?
There was a problem hiding this comment.
I plan to retain them for compatibility. Weird Pipeline script is one possibility. They could also be used in other Pipeline steps; I know PushdStep binds FilePathDynamicContext, so that is not affected, but there may be others. Seems harmless enough to leave them here.
…e-task-step-plugin into retry-JENKINS-49707
…icContext.FilePathTranslator`
… longer makes sense
…n` by copying some diagnostic code from `FilePathDynamicContext`
…aitingMessage` is unused
…utorStepExecution`
…e-task-step-plugin into retry-JENKINS-49707
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Outdated
Show resolved
Hide resolved
After jenkinsci/workflow-step-api-plugin#73 this caused `SecretsMasker` to block on a `KubernetesComputer` which in turn blocked provisioning of a `TaskListener`, causing `RestartPipelineTest.terminatedPodAfterRestart` to fail to find a message: ``` FINE o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: rediscovering that terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed and timeout has expired FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: could not get TaskListener in CpsStepContext[9:sh]:Owner[terminated Pod After Restart/1:terminated Pod After Restart jenkinsci#1] java.util.concurrent.TimeoutException at hudson.remoting.AsyncFutureImpl.get(AsyncFutureImpl.java:102) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:115) Caused: java.io.IOException at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepDynamicContext$Translator.get(ExecutorStepDynamicContext.java:118) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:97) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet$1Delegate.doGet(ContextVariableSet.java:98) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:85) at org.csanchez.jenkins.plugins.kubernetes.pipeline.SecretsMasker$Factory.get(SecretsMasker.java:73) at org.jenkinsci.plugins.workflow.steps.DynamicContext$Typed.get(DynamicContext.java:95) at org.jenkinsci.plugins.workflow.cps.ContextVariableSet.get(ContextVariableSet.java:139) at org.jenkinsci.plugins.workflow.cps.CpsThread.getContextVariable(CpsThread.java:135) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:297) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:75) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.getListener(DefaultStepContext.java:127) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:87) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution._listener(DurableTaskStep.java:421) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.listener(DurableTaskStep.java:412) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.getWorkspace(DurableTaskStep.java:363) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:570) at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:553) at … FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener#getLogger: creating filtered stream FINE o.j.p.w.s.d.DurableTaskStep$Execution#_listener: terminated-pod-after-restart-1-8fb2m-j206k-8x125 has been removed for 15 sec, assuming it is not coming back ```
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepDynamicContext.java
Outdated
Show resolved
Hide resolved
src/main/java/org/jenkinsci/plugins/workflow/support/steps/ExecutorStepExecution.java
Show resolved
Hide resolved
This comment was marked as resolved.
This comment was marked as resolved.
…e-task-step-plugin into retry-JENKINS-49707
|
Breaks |
JENKINS-49707
Upstream of
jenkinsci/workflow-basic-steps-plugin#195 andjenkinsci/kubernetes-plugin#1211.Subsumes #231. Subsumes #249.Relevant prior work: #104, jenkinsci/kubernetes-plugin#461, #47, #48, #115, #101, jenkinsci/workflow-basic-steps-plugin#86, #175, #176, #179.
Plan of record:
ExecutorStepDynamicContextand associated machinery as a replacement forExecutorPickleand other pickles (see below).retrystep to be given configurable behavior.kuberneteswhich takes into account pod status.agenttypesanyandlabel(i.e., generic) andkubernetesnodebut outsidesh.ExecutorPickleshould not reschedule anything until afterQueue.load#184 to the new code, as Warn inQueue.initif items scheduled during startup will be clobbered jenkins#5934 can be reproduced again.queue.xml: [JENKINS-49707] Agent missing after controller restart to fail resumption ofnodestep, not kill whole build #180 (comment)AgentErrorConditionimprovements #249 toExecutorStepDynamicContextThe idea is that users updating plugins would see a somewhat altered behavior for resumption inside
nodeblocks by default, which I hope would go mostly unnoticed; and could adjust pipelines to look something likepodTemplate(…) { retry(count: 3, conditions: [kubernetesAgent(), nonresumable()]) { node(POD_LABEL) { checkout scm sh 'make world' } } }or
pipeline { agent { kubernetes { yaml: '…' retries: 3 } } stages { stage('main') { steps { sh 'make world' } } } }The key is to avoid relying on
Pickles forExecutorStepExecutionto resume properly. (More precisely, pickles usingTryRepeatedly;SecretPickleand the like are harmless in this context.) Instead we take explicit control over how the step is resumed and what happens when. The problem withPicklehere is that it is too magical; we would like to be able to detect after resumption that an agent is gone and then throw an exception at the Groovy level, which is not possible if the build is aborted at the CPS VM level due to an inability to rehydrate.I had planned to offer a system property to revert to
ExecutorPicklein case of catastrophic problems (along with a couple sanity tests of that mode) but after reviewing the changes toExecutorStepExecutionthis does not seem straightforward: the major simplifications and fixes to it rely on moving away fromExecutorPickle.