Skip to content

Conversation

@msilvafe
Copy link
Contributor

This is my attempt to address the issue raised in #463 where we now need to manually bring down the det-controllers and monitors via ocs-web before hammer and back up after. This should allow jackhammer to do this but it hasn't been tested yet.

if use_hostmanager:
cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER)
hm = OCSClient(hm_instance_id)
hm.update(requests=[('all', 'down')])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this for just the requested slots? I could imagine it would be disruptive in some cases to restart the det-controller instance of all slots when you are only hammering one of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can change it to only bring down the controllers for the slots of interest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is a good idea, and matches the behavior that jackhammer used to have before it was updated for use with Host Manager.

As currently written, jackhammer will also bring down the monitor, suprsync, and det-crate agents.

@tristpinsm
Copy link
Contributor

tristpinsm commented Nov 20, 2025

Sorry not to have looked into this earlier, but I think a consequence of this change is that we will never be writing logs for the ocs-managed containers to disk (which may be ok)

Here's my understanding of what jackhammer is doing on hammer:

  • dump logs for all running containers
  • kill containers associated with given slot (except those managed by OCS) using docker stop followed by docker rm
  • reboot boards etc and restart containers

So if we add as a first step to bring down the OCS-managed ones, they will not be added to the list of containers from which to scrape the logs.

It's confusing that this mechanism would have become an issue after moving to OCS management, since before that the containers would not have been killed prior to dumping the logs either. Is it possible OCS is starting them in a different way that is somehow causing the logs command to hang?

I think my takeaway is I don't understand why this is happening, and while this will likely fix the problem, it's not for the reason that is being advertised, and the result will be to no longer have these logs.

But if dumping the logs is unreliable and we don't want to spend the time to figure out why, maybe just eliminating that step could be the solution. Logs are forwarded to loki anyway right? Do those log files get archived anywhere, is anyone looking at them? Or would it be safe to just get rid of them?

@tristpinsm
Copy link
Contributor

tristpinsm commented Nov 20, 2025

and I guess I was assuming dumping the logs is what was hanging, but I'm not sure if that's true. In any case, restarting the OCS dockers sounds like it is the solution, I think we should think about how this changes the behaviour of the logs dump. (that does seem like the most likely place it is hanging though)

@msilvafe
Copy link
Contributor Author

All of the OCS container logs are being stored in Loki so we can always grab those logs. Is it so important to have them dumped out to the logs folder?

@tristpinsm
Copy link
Contributor

I don't think it's important to dump them to the log folder, and if others agree then should we just remove that step entirely?

@msilvafe
Copy link
Contributor Author

I don't think it's important to dump them to the log folder, and if others agree then should we just remove that step entirely?

That's probably fine. Mostly we want to use the jackhammer dump --dump-rogue feature to dump the rogue and pcie registers those I think are the only information that are not stored in the loki logs which we'd use for debugging.

@BrianJKoopman
Copy link
Member

I think a consequence of this change is that we will never be writing logs for the ocs-managed containers to disk (which may be ok)

This is a good point, can the Bringing down all agents via hostmanager part just be moved to after the log dump? (Reading on, I see that the theory is the log dump is what's hanging things up...)

It's confusing that this mechanism would have become an issue after moving to OCS management, since before that the containers would not have been killed prior to dumping the logs either. Is it possible OCS is starting them in a different way that is somehow causing the logs command to hang?

I agree it's also confusing. If it's really the logs command hanging, can someone try to run a docker logs on one of the containers the next time they hammer and things hang?

OCS is also just running a docker compose up -d on them.

But if dumping the logs is unreliable and we don't want to spend the time to figure out why, maybe just eliminating that step could be the solution. Logs are forwarded to loki anyway right? Do those log files get archived anywhere, is anyone looking at them? Or would it be safe to just get rid of them?

They are forwarded to loki, but those are not archived anywhere, and in fact currently loki drops them after a month. I think we should extend this retention period, but there should still be a retention period.

@BrianJKoopman
Copy link
Member

If it's really the logs command hanging, can someone try to run a docker logs on one of the containers the next time they hammer and things hang?

Looking back at my issue and thinking about past instances where I've helped hammer -- I think the problem might stem from the agent logs being too verbose. It might not be actually hanging, but spending a long time writing the logs because they're extremely long.

This is probably why the problem isn't seen every time we hammer. And the reason taking them down first works is because the logs are now really short. If you restart all of the OCS agents (i.e. clear their logs and have them running), then hammer, does it still hang?

Copy link
Member

@BrianJKoopman BrianJKoopman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on (and sorry I never ended up getting to it)! A couple of comments below related to how I was imagining implementing this.

Comment on lines +409 to +413
if use_hostmanager:
cprint("Bringing down all agents via hostmanager", style=TermColors.HEADER)
hm = OCSClient(hm_instance_id)
hm.update(requests=[('all', 'down')])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least move this to after the prompt for "Are you sure (y/n)?". Hammering and then backing out with a 'n' response would result in everything on the ocs side of things being offline otherwise.

Following the conversation in this issue -- I think you should move it a bit further down even, below the dump_docker_logs() line so that we get the agent logs too.

If my theory on the logs being too long is correct, then maybe we just limit the log dump to the past X lines via the -n X flag?

Comment on lines 53 to +54
use_hostmanager: bool = sys_config.get('use_hostmanager', False)
hm_instance_id: str = sys_config.get('hostmanager_instance_id')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my original issue on this:

This may require adding the host manager instance-id to sys_config.yaml, or maybe just assuming there is only one host manager (which there should be now, in all cases on site).

What I was trying to suggest here is that having use_hostmanager and hm_instance_id feels redundant. If hm_instance_id is present, we'll be using the HM. And we never want to provide an hm_instance_id and set use_hostmanager = False.

Instead I'd suggest something like this:

Suggested change
use_hostmanager: bool = sys_config.get('use_hostmanager', False)
hm_instance_id: str = sys_config.get('hostmanager_instance_id')
hm_instance_id: str = sys_config.get('hostmanager_instance_id')
if hm_instance_id:
use_hostmanager: bool = True
else:
if 'use_hostmanager' in sys_config:
print("'use_hostmanager' config is deprecated. Provide " +
"'hostmanager_instance_id' or remove from sys_config.yml.")
use_hostmanager: bool = sys_config.get('use_hostmanager', False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants