Skip to content

Conversation

@titoiride
Copy link
Member

@titoiride titoiride commented Nov 13, 2025

In this PR we add a script that checks periodically if the code execution on Frontier is proceeding as expected.
In case the code is hanging, which is verified by checking the output file timestamp, the execution is killed to save computational time.

Compared to previous attempts, this script has a double timer: a frequent check and an output modification check. The check is ran frequently (every check_interval seconds) and checks that the output file has been modified not more than timeout_sec seconds before. Differentiating the timers, compared to just running a single check, prevents the script to stall a simulation that, for instance, finishes correctly between two checks. This way, if the simulation ends between two timeout_sec checks, the script will wait at most check_interval before closing.

Close #5584

@titoiride titoiride requested a review from ax3l November 14, 2025 23:00
@ax3l ax3l self-assigned this Dec 2, 2025
@ax3l ax3l added component: documentation Docs, readme and manual machine / system Machine or system-specific issue labels Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component: documentation Docs, readme and manual machine / system Machine or system-specific issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPC: Monitoring Progress & Aborting

3 participants