-
Notifications
You must be signed in to change notification settings - Fork 0
Allow control of hm dockers from jackhammer. #481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if use_hostmanager: | ||
| cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER) | ||
| hm = OCSClient(hm_instance_id) | ||
| hm.update(requests=[('all', 'down')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this for just the requested slots? I could imagine it would be disruptive in some cases to restart the det-controller instance of all slots when you are only hammering one of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can change it to only bring down the controllers for the slots of interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is a good idea, and matches the behavior that jackhammer used to have before it was updated for use with Host Manager.
As currently written, jackhammer will also bring down the monitor, suprsync, and det-crate agents.
|
Sorry not to have looked into this earlier, but I think a consequence of this change is that we will never be writing logs for the ocs-managed containers to disk (which may be ok) Here's my understanding of what
So if we add as a first step to bring down the OCS-managed ones, they will not be added to the list of containers from which to scrape the logs. It's confusing that this mechanism would have become an issue after moving to OCS management, since before that the containers would not have been killed prior to dumping the logs either. Is it possible OCS is starting them in a different way that is somehow causing the logs command to hang? I think my takeaway is I don't understand why this is happening, and while this will likely fix the problem, it's not for the reason that is being advertised, and the result will be to no longer have these logs. But if dumping the logs is unreliable and we don't want to spend the time to figure out why, maybe just eliminating that step could be the solution. Logs are forwarded to |
|
and I guess I was assuming dumping the logs is what was hanging, but I'm not sure if that's true. In any case, restarting the OCS dockers sounds like it is the solution, I think we should think about how this changes the behaviour of the logs dump. (that does seem like the most likely place it is hanging though) |
|
All of the OCS container logs are being stored in Loki so we can always grab those logs. Is it so important to have them dumped out to the logs folder? |
|
I don't think it's important to dump them to the log folder, and if others agree then should we just remove that step entirely? |
That's probably fine. Mostly we want to use the |
This is a good point, can the
I agree it's also confusing. If it's really the logs command hanging, can someone try to run a OCS is also just running a
They are forwarded to |
Looking back at my issue and thinking about past instances where I've helped hammer -- I think the problem might stem from the agent logs being too verbose. It might not be actually hanging, but spending a long time writing the logs because they're extremely long. This is probably why the problem isn't seen every time we hammer. And the reason taking them down first works is because the logs are now really short. If you restart all of the OCS agents (i.e. clear their logs and have them running), then hammer, does it still hang? |
BrianJKoopman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on (and sorry I never ended up getting to it)! A couple of comments below related to how I was imagining implementing this.
| if use_hostmanager: | ||
| cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER) | ||
| hm = OCSClient(hm_instance_id) | ||
| hm.update(requests=[('all', 'down')]) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would at least move this to after the prompt for "Are you sure (y/n)?". Hammering and then backing out with a 'n' response would result in everything on the ocs side of things being offline otherwise.
Following the conversation in this issue -- I think you should move it a bit further down even, below the dump_docker_logs() line so that we get the agent logs too.
If my theory on the logs being too long is correct, then maybe we just limit the log dump to the past X lines via the -n X flag?
| use_hostmanager: bool = sys_config.get('use_hostmanager', False) | ||
| hm_instance_id: str = sys_config.get('hostmanager_instance_id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my original issue on this:
This may require adding the host manager instance-id to sys_config.yaml, or maybe just assuming there is only one host manager (which there should be now, in all cases on site).
What I was trying to suggest here is that having use_hostmanager and hm_instance_id feels redundant. If hm_instance_id is present, we'll be using the HM. And we never want to provide an hm_instance_id and set use_hostmanager = False.
Instead I'd suggest something like this:
| use_hostmanager: bool = sys_config.get('use_hostmanager', False) | |
| hm_instance_id: str = sys_config.get('hostmanager_instance_id') | |
| hm_instance_id: str = sys_config.get('hostmanager_instance_id') | |
| if hm_instance_id: | |
| use_hostmanager: bool = True | |
| else: | |
| if 'use_hostmanager' in sys_config: | |
| print("'use_hostmanager' config is deprecated. Provide " + | |
| "'hostmanager_instance_id' or remove from sys_config.yml.") | |
| use_hostmanager: bool = sys_config.get('use_hostmanager', False) |
This is my attempt to address the issue raised in #463 where we now need to manually bring down the det-controllers and monitors via ocs-web before hammer and back up after. This should allow jackhammer to do this but it hasn't been tested yet.