Selector: add timeout to PlannedHome to prevent deadlock#359
Conversation
Add a plannedHomeTimer counter that increments each main-loop iteration while the selector is in PlannedHome waiting for the idler to become HomingValid. If 30 000 iterations elapse without the idler becoming valid, HomeFailed() is called, transitioning to HomingFailed which surfaces an error screen to the user. This prevents an indefinite deadlock when the idler is stuck in a non-error state (e.g. Ready with homingValid==false) that WaitForModulesErrorRecovery cannot detect.
|
All values in bytes. Δ Delta to base
|
Automated Test Code Coverage ReportView details...
TOTAL: 2723 lines of code, 2021 lines executed, 74% covered. |
|
Hi @tchejunior 👋 thanks for making a PR. For the fix to be merged, we must ideally be able to reproduce the issue in the unit tests. The firmware is one big state machine, so it should be possible unless I am missing something. Do you have a reliable way to reproduce the issue on real hardware? With detailed step by step instructions. I have not encountered this issue on my MK4S. I am not convinced adding a timeout like this solves the real underlying issue. But am open to be wrong. |
|
Hi!
I am sorry I did not include the exact steps to replicate this.
I do not know it right now, so I will re-test this, but basically, You
simply have to wait a long time to make the filament change, 5+ hours... I
have already heavily modified both firmwares, so, I will re-flash the
current versions and do a more "controlled" test. I will do this next week,
I am in the process of moving, and everything is in boxes right now. :)
|
|
Currently, I don't see a way how to cause the assumption documented in the PR:
The Idler MUST finish homing in order to proceed with further steps. There is NO way around. If it fails, it must retry until homing succeeds. No timeout can solve that. Relying on a timer results only in having a non-homed Idler which implies failing load/unload. @tchejunior technically: we have real time / timers implemented in the infrastructure, it looks like you failed to instruct Claude properly to fix whatever it was supposed to fix. Even playing with time is simulated, so the unit test may step +5h if necessary. @gudnimg is correct, if there is a deadlock, it must be reproduced in the unit tests. The whole logic level is already covered, so adding a dedicated test case shouldn't be that hard (at least for Claude). |
Disclaimer: I used Claude code for help with this fix.
Add a plannedHomeTimer counter that increments each main-loop iteration while the selector is in PlannedHome waiting for the idler to become HomingValid. If 30 000 iterations elapse without the idler becoming valid, HomeFailed() is called, transitioning to HomingFailed which surfaces an error screen to the user.
This prevents an indefinite deadlock when the idler is stuck in a non-error state (e.g. Ready with homingValid==false) that WaitForModulesErrorRecovery cannot detect.
This used to happen to me when I took too long to load filament when it finishes (in the middle of the night, for example).
In the above example, I normally lost the print, nothing would work, pressing any buttons both in the MMU3 unit, or in the display. Power cycling (turning it off and back on again) used to fail more often than not.
Unfortunately, I could only test this on a Core One, however, it should work with all printer models.