-
Notifications
You must be signed in to change notification settings - Fork 28
Fix hot polling bug and add graceful AGENT_TIMEOUT support #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Fix should_run_ensemble to check completed runs instead of started count - Add AGENT_TIMEOUT environment variable support with graceful shutdown - Add timestamped logging for better debugging
johscheuer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment about the redundant import.
| def log(outputText, newline=True): | ||
| import datetime | ||
| timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
| message = f"[{timestamp}] {outputText}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use the logging package? That would add the timestamp. Probably more refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be better but was thinking minimal change. Should I go for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that in another PR if we think it's worth to refactor :)
|
Address review comment (and added a few timestamps to agent-scaler.sh ouputs) |
johscheuer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
| def log(outputText, newline=True): | ||
| import datetime | ||
| timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
| message = f"[{timestamp}] {outputText}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that in another PR if we think it's worth to refactor :)
This was an interesting one. Some of the nightly jobs were failing. They were 'timing out' running joshua tests. In the nightly report, the jobs would be 'in progress' usually and then much later in the morning report as joshua test runs timed out with NO results. Looking at the joshua cluster that runs the nightlies, joshua-agent pod counts had us pegged up at near the 10k limit. Poking around it seemed odd... pods didn't seem to be doing anything.
Turns out ensembles stick around in the database for a while after they finish -- 7 days -- but for some of the ensembles, even though they were 'done', they would report that they were still 'alive' ("Can run: True "). This would happen when the start count fell below max_runs count which could happen if an agent died for whatever reason (many): when an agent dies, start is decremented in the cleanup and should_run_ensemble method in joshua_model.py would return 'True'.
The cluster had 70 odd ensembles when I went to look at it. They were left over mostly from December 30/31st. These were reporting they had work still to be done. So agents and scheduler would asking the ensemble for work... but there was none. This happened thousands of times. This behavior kept the pod count elevated.
joshua_model.py is used by two images... joshua-agent and agent-scaler. I deployed both with the fixes here to see how they do. With the fixes deployed, the problem ensembles are not longer reporting 'True' out of should_run_ensemble for these old ensembles queued last year.
While in here updated logging to include timestamp and read a timeout environment varible that was previously ignored (nit).