draft POC used to validate a hunch (do not merge) #263
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After investigating an issue where jobs on windows agents can hang indefinitely it seems the root cause of the job not being terminated is because there is no implementation of exitStatus for the WindowsBatchScript and PowershellScript classes (see this implementation in BourneshellScript)
My understanding is that the controller should be checking for an updated heartbeat which is written by the durable-task-lib binary wrapper. If it detects that this heartbeat is not being updated then the job should be terminated.
I did a quick and dirty test to see if this theory is correct (#263 ) and it appears to support the findings. (this PR is not mean to be a ready solution!). I noticed that even though it seems like the heartbeat check should occur every 30 seconds, it was only occuring 30 seconds after the agent is first brought back online. If you repeat this process twice (needed because the first time is the initial check and does not really count) you will see that the job will then exit correctly. I expected that this check should occur even if the agent is not back online, so there are likely issues with my test PR or my understanding of how things work - this behavior should be confirmed as part of the bug fix
Testing done
Submitter checklist