Allow control of hm dockers from jackhammer. #481

msilvafe · 2025-11-19T22:38:36Z

This is my attempt to address the issue raised in #463 where we now need to manually bring down the det-controllers and monitors via ocs-web before hammer and back up after. This should allow jackhammer to do this but it hasn't been tested yet.

tristpinsm · 2025-11-19T23:53:19Z

hammers/jackhammer

+    if use_hostmanager:
+        cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER)
+        hm = OCSClient(hm_instance_id)
+        hm.update(requests=[('all', 'down')])


can you do this for just the requested slots? I could imagine it would be disruptive in some cases to restart the det-controller instance of all slots when you are only hammering one of them

Yes, we can change it to only bring down the controllers for the slots of interest.

Yes, I think this is a good idea, and matches the behavior that jackhammer used to have before it was updated for use with Host Manager.

As currently written, jackhammer will also bring down the monitor, suprsync, and det-crate agents.

tristpinsm · 2025-11-20T00:23:12Z

Sorry not to have looked into this earlier, but I think a consequence of this change is that we will never be writing logs for the ocs-managed containers to disk (which may be ok)

Here's my understanding of what jackhammer is doing on hammer:

dump logs for all running containers
kill containers associated with given slot (except those managed by OCS) using docker stop followed by docker rm
reboot boards etc and restart containers

So if we add as a first step to bring down the OCS-managed ones, they will not be added to the list of containers from which to scrape the logs.

It's confusing that this mechanism would have become an issue after moving to OCS management, since before that the containers would not have been killed prior to dumping the logs either. Is it possible OCS is starting them in a different way that is somehow causing the logs command to hang?

I think my takeaway is I don't understand why this is happening, and while this will likely fix the problem, it's not for the reason that is being advertised, and the result will be to no longer have these logs.

But if dumping the logs is unreliable and we don't want to spend the time to figure out why, maybe just eliminating that step could be the solution. Logs are forwarded to loki anyway right? Do those log files get archived anywhere, is anyone looking at them? Or would it be safe to just get rid of them?

tristpinsm · 2025-11-20T00:30:18Z

and I guess I was assuming dumping the logs is what was hanging, but I'm not sure if that's true. In any case, restarting the OCS dockers sounds like it is the solution, I think we should think about how this changes the behaviour of the logs dump. (that does seem like the most likely place it is hanging though)

msilvafe · 2025-11-20T00:46:41Z

All of the OCS container logs are being stored in Loki so we can always grab those logs. Is it so important to have them dumped out to the logs folder?

tristpinsm · 2025-11-20T00:50:27Z

I don't think it's important to dump them to the log folder, and if others agree then should we just remove that step entirely?

msilvafe · 2025-11-20T01:23:29Z

I don't think it's important to dump them to the log folder, and if others agree then should we just remove that step entirely?

That's probably fine. Mostly we want to use the jackhammer dump --dump-rogue feature to dump the rogue and pcie registers those I think are the only information that are not stored in the loki logs which we'd use for debugging.

BrianJKoopman · 2025-11-20T15:18:28Z

I think a consequence of this change is that we will never be writing logs for the ocs-managed containers to disk (which may be ok)

This is a good point, can the Bringing down all agents via hostmanager part just be moved to after the log dump? (Reading on, I see that the theory is the log dump is what's hanging things up...)

It's confusing that this mechanism would have become an issue after moving to OCS management, since before that the containers would not have been killed prior to dumping the logs either. Is it possible OCS is starting them in a different way that is somehow causing the logs command to hang?

I agree it's also confusing. If it's really the logs command hanging, can someone try to run a docker logs on one of the containers the next time they hammer and things hang?

OCS is also just running a docker compose up -d on them.

But if dumping the logs is unreliable and we don't want to spend the time to figure out why, maybe just eliminating that step could be the solution. Logs are forwarded to loki anyway right? Do those log files get archived anywhere, is anyone looking at them? Or would it be safe to just get rid of them?

They are forwarded to loki, but those are not archived anywhere, and in fact currently loki drops them after a month. I think we should extend this retention period, but there should still be a retention period.

BrianJKoopman · 2025-11-20T15:32:03Z

If it's really the logs command hanging, can someone try to run a docker logs on one of the containers the next time they hammer and things hang?

Looking back at my issue and thinking about past instances where I've helped hammer -- I think the problem might stem from the agent logs being too verbose. It might not be actually hanging, but spending a long time writing the logs because they're extremely long.

This is probably why the problem isn't seen every time we hammer. And the reason taking them down first works is because the logs are now really short. If you restart all of the OCS agents (i.e. clear their logs and have them running), then hammer, does it still hang?

BrianJKoopman

Thanks for taking this on (and sorry I never ended up getting to it)! A couple of comments below related to how I was imagining implementing this.

BrianJKoopman · 2025-11-20T15:20:56Z

hammers/jackhammer

+    if use_hostmanager:
+        cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER)
+        hm = OCSClient(hm_instance_id)
+        hm.update(requests=[('all', 'down')])
+


I would at least move this to after the prompt for "Are you sure (y/n)?". Hammering and then backing out with a 'n' response would result in everything on the ocs side of things being offline otherwise.

~~Following the conversation in this issue -- I think you should move it a bit further down even, below the dump_docker_logs() line so that we get the agent logs too.~~

If my theory on the logs being too long is correct, then maybe we just limit the log dump to the past X lines via the -n X flag?

BrianJKoopman · 2025-11-20T15:49:05Z

hammers/jackhammer

 use_hostmanager: bool = sys_config.get('use_hostmanager', False)
+hm_instance_id: str = sys_config.get('hostmanager_instance_id')


From my original issue on this:

This may require adding the host manager instance-id to sys_config.yaml, or maybe just assuming there is only one host manager (which there should be now, in all cases on site).

What I was trying to suggest here is that having use_hostmanager and hm_instance_id feels redundant. If hm_instance_id is present, we'll be using the HM. And we never want to provide an hm_instance_id and set use_hostmanager = False.

Instead I'd suggest something like this:

Suggested change

use_hostmanager: bool = sys_config.get('use_hostmanager', False)

hm_instance_id: str = sys_config.get('hostmanager_instance_id')

hm_instance_id: str = sys_config.get('hostmanager_instance_id')

if hm_instance_id:

use_hostmanager: bool = True

else:

if 'use_hostmanager' in sys_config:

print("'use_hostmanager' config is deprecated. Provide " +

"'hostmanager_instance_id' or remove from sys_config.yml.")

use_hostmanager: bool = sys_config.get('use_hostmanager', False)

Allow control of hm dockers from jackhammer.

1b1c538

msilvafe requested review from BrianJKoopman and tristpinsm November 19, 2025 22:38

tristpinsm reviewed Nov 19, 2025

View reviewed changes

BrianJKoopman mentioned this pull request Nov 20, 2025

jackhammer hangs until controller agents restarted #463

Open

BrianJKoopman requested changes Nov 20, 2025

View reviewed changes

		use_hostmanager: bool = sys_config.get('use_hostmanager', False)
		hm_instance_id: str = sys_config.get('hostmanager_instance_id')

-use_hostmanager: bool = sys_config.get('use_hostmanager', False)
-hm_instance_id: str = sys_config.get('hostmanager_instance_id')
+hm_instance_id: str = sys_config.get('hostmanager_instance_id')
+if hm_instance_id:
+    use_hostmanager: bool = True
+else:
+    if 'use_hostmanager' in sys_config:
+        print("'use_hostmanager' config is deprecated. Provide " +
+              "'hostmanager_instance_id' or remove from sys_config.yml.")
+    use_hostmanager: bool = sys_config.get('use_hostmanager', False)

Allow control of hm dockers from jackhammer. #481

Are you sure you want to change the base?

Allow control of hm dockers from jackhammer. #481

Uh oh!

Conversation

msilvafe commented Nov 19, 2025

Uh oh!

tristpinsm Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

msilvafe Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

BrianJKoopman Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

tristpinsm commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tristpinsm commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msilvafe commented Nov 20, 2025

Uh oh!

tristpinsm commented Nov 20, 2025

Uh oh!

msilvafe commented Nov 20, 2025

Uh oh!

BrianJKoopman commented Nov 20, 2025

Uh oh!

BrianJKoopman commented Nov 20, 2025

Uh oh!

BrianJKoopman left a comment

Choose a reason for hiding this comment

Uh oh!

BrianJKoopman Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

BrianJKoopman Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tristpinsm commented Nov 20, 2025 •

edited

Loading

tristpinsm commented Nov 20, 2025 •

edited

Loading