Skip to content

Selector: add timeout to PlannedHome to prevent deadlock#359

Open
tchejunior wants to merge 1 commit into
prusa3d:mainfrom
tchejunior:fix/selector-plannedhome-deadlock-timeout
Open

Selector: add timeout to PlannedHome to prevent deadlock#359
tchejunior wants to merge 1 commit into
prusa3d:mainfrom
tchejunior:fix/selector-plannedhome-deadlock-timeout

Conversation

@tchejunior
Copy link
Copy Markdown

Disclaimer: I used Claude code for help with this fix.

Add a plannedHomeTimer counter that increments each main-loop iteration while the selector is in PlannedHome waiting for the idler to become HomingValid. If 30 000 iterations elapse without the idler becoming valid, HomeFailed() is called, transitioning to HomingFailed which surfaces an error screen to the user.

This prevents an indefinite deadlock when the idler is stuck in a non-error state (e.g. Ready with homingValid==false) that WaitForModulesErrorRecovery cannot detect.

This used to happen to me when I took too long to load filament when it finishes (in the middle of the night, for example).

In the above example, I normally lost the print, nothing would work, pressing any buttons both in the MMU3 unit, or in the display. Power cycling (turning it off and back on again) used to fail more often than not.

Unfortunately, I could only test this on a Core One, however, it should work with all printer models.

Add a plannedHomeTimer counter that increments each main-loop
iteration while the selector is in PlannedHome waiting for the
idler to become HomingValid. If 30 000 iterations elapse without
the idler becoming valid, HomeFailed() is called, transitioning
to HomingFailed which surfaces an error screen to the user.

This prevents an indefinite deadlock when the idler is stuck in
a non-error state (e.g. Ready with homingValid==false) that
WaitForModulesErrorRecovery cannot detect.
@github-actions
Copy link
Copy Markdown

All values in bytes. Δ Delta to base

ΔFlash ΔSRAM Used Flash Used SRAM Free Flash Free SRAM
66 2 28230 1675 442 885

@github-actions
Copy link
Copy Markdown

Automated Test Code Coverage Report

View details...

File Lines Exec Cover
src/application.cpp 169 14 8%
src/application.h 3 3 100%
src/hal/circular_buffer.h 52 52 100%
src/hal/eeprom.h 1 0 0%
src/hal/gpio.h 18 7 38%
src/hal/progmem.h 6 6 100%
src/hal/tmc2130.cpp 113 9 7%
src/hal/tmc2130.h 31 27 87%
src/logic/command_base.cpp 139 118 84%
src/logic/command_base.h 4 3 75%
src/logic/cut_filament.cpp 117 80 68%
src/logic/eject_filament.cpp 77 60 77%
src/logic/feed_to_bondtech.cpp 60 57 95%
src/logic/feed_to_bondtech.h 1 1 100%
src/logic/feed_to_finda.cpp 48 45 93%
src/logic/feed_to_finda.h 1 1 100%
src/logic/home.cpp 18 12 66%
src/logic/load_filament.cpp 84 76 90%
src/logic/load_filament.h 1 0 0%
src/logic/move_selector.cpp 21 0 0%
src/logic/no_command.h 2 1 50%
src/logic/retract_from_finda.cpp 27 21 77%
src/logic/retract_from_finda.h 1 1 100%
src/logic/set_mode.cpp 5 0 0%
src/logic/set_mode.h 1 0 0%
src/logic/start_up.cpp 38 26 68%
src/logic/start_up.h 4 4 100%
src/logic/tool_change.cpp 107 85 79%
src/logic/unload_filament.cpp 81 61 75%
src/logic/unload_to_finda.cpp 50 46 92%
src/logic/unload_to_finda.h 1 1 100%
src/modules/axisunit.h 21 21 100%
src/modules/buttons.cpp 11 11 100%
src/modules/buttons.h 7 7 100%
src/modules/crc.h 13 13 100%
src/modules/debouncer.cpp 28 24 85%
src/modules/debouncer.h 7 7 100%
src/modules/finda.cpp 7 3 42%
src/modules/finda.h 2 2 100%
src/modules/fsensor.cpp 6 6 100%
src/modules/fsensor.h 3 3 100%
src/modules/globals.cpp 47 42 89%
src/modules/globals.h 34 24 70%
src/modules/idler.cpp 80 73 91%
src/modules/idler.h 12 12 100%
src/modules/leds.cpp 56 54 96%
src/modules/leds.h 16 15 93%
src/modules/math.h 6 6 100%
src/modules/motion.cpp 59 40 67%
src/modules/motion.h 66 64 96%
src/modules/movable_base.cpp 73 69 94%
src/modules/movable_base.h 16 16 100%
src/modules/permanent_storage.cpp 144 89 61%
src/modules/protocol.cpp 216 184 85%
src/modules/protocol.h 72 70 97%
src/modules/pulley.cpp 26 25 96%
src/modules/pulley.h 9 5 55%
src/modules/pulse_gen.cpp 95 89 93%
src/modules/pulse_gen.h 53 51 96%
src/modules/selector.cpp 72 64 88%
src/modules/selector.h 6 6 100%
src/modules/speed_table.h 26 24 92%
src/modules/user_input.cpp 39 39 100%
src/modules/user_input.h 12 12 100%
src/modules/voltage.cpp 4 0 0%
src/registers.cpp 86 22 25%
src/unit.h 12 12 100%
TOTAL 2723 2021 74%

TOTAL: 2723 lines of code, 2021 lines executed, 74% covered.

@gudnimg
Copy link
Copy Markdown
Collaborator

gudnimg commented Mar 30, 2026

Hi @tchejunior 👋 thanks for making a PR.

For the fix to be merged, we must ideally be able to reproduce the issue in the unit tests. The firmware is one big state machine, so it should be possible unless I am missing something.

Do you have a reliable way to reproduce the issue on real hardware? With detailed step by step instructions.

I have not encountered this issue on my MK4S.

I am not convinced adding a timeout like this solves the real underlying issue. But am open to be wrong.

@gudnimg gudnimg requested review from DRracer and gudnimg March 30, 2026 08:05
@tchejunior
Copy link
Copy Markdown
Author

tchejunior commented Mar 30, 2026 via email

@DRracer
Copy link
Copy Markdown
Collaborator

DRracer commented Mar 30, 2026

Currently, I don't see a way how to cause the assumption documented in the PR:

// Idler did not become homing-valid within ~30 000 main-loop iterations.
// This breaks a potential deadlock where the idler is stuck in a non-error
// state (e.g. Ready with homingValid==false) that WaitForModulesErrorRecovery
// cannot detect. HomeFailed() transitions selector to HomingFailed, which IS
// detected and surfaces an error screen to the user.

The Idler MUST finish homing in order to proceed with further steps. There is NO way around. If it fails, it must retry until homing succeeds. No timeout can solve that. Relying on a timer results only in having a non-homed Idler which implies failing load/unload.

@tchejunior technically: we have real time / timers implemented in the infrastructure, it looks like you failed to instruct Claude properly to fix whatever it was supposed to fix. Even playing with time is simulated, so the unit test may step +5h if necessary.

@gudnimg is correct, if there is a deadlock, it must be reproduced in the unit tests. The whole logic level is already covered, so adding a dedicated test case shouldn't be that hard (at least for Claude).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants