Polling resources get forever stuck after a network disconnect #876
Replies: 13 comments 18 replies
-
|
Ah, wait, that is not latest |
Beta Was this translation helpful? Give feedback.
-
|
Oh, very interesting - I was going to ctrl+c before I go to sleep, but mgmt just hung and never exited. I had to pass SIGKILL to make it stop. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for reporting this. Try with only one vm please. And if it hangs, then please press ^\ see: https://purpleidea.com/blog/2016/02/15/debugging-golang-programs/ and paste the full trace somewhere. Please make sure you're on git master too. It is possible that is an issue with the specific resource not handling a context... When we find those, we should absolutely fix them. I don't see anything overly obvious in the hetzner code, but I also didn't write it. It could also do with a bump in the library, so if you want to go get -u the deps for it, I am happy to merge those changes. Lastly there could in theory be a deadlock in the engine related to polling or another bug, but if so, we'd find it with a trace I think. I actually have a pending "be extra extra careful to not deadlock" patch queued, but I think it's not even required. I should hopefully merge it anyways this month. Thank you for playing with all this! |
Beta Was this translation helpful? Give feedback.
-
|
Here is that branch if you want to test it. I haven't test it at all yet. https://github.com/purpleidea/mgmt/tree/feat/more-context-changes |
Beta Was this translation helpful? Give feedback.
-
|
Got the same repro on With one vm only, still on
Will do in a spare moment and re-test everything works (outside/regardless the bug here) - I see there's a v2, but all the methods used in the resource seem compatible at a glance. |
Beta Was this translation helpful? Give feedback.
-
|
I've made a branch with the hetzner deps bumped: https://github.com/purpleidea/mgmt/tree/feat/bump-hetzner If you're interested in trying that, lmk if you can repro there before I dig too deeply into the trace. |
Beta Was this translation helpful? Give feedback.
-
|
I would also need the full exact mcl code (you can remove a password string and replace it with "hunter2" in your paste) and the full cli of how you ran it. If you test on: https://github.com/purpleidea/mgmt/tree/feat/bump-hetzner that's preferred just to avoid me tracking down a hetzer bug that's already been fixed ;) |
Beta Was this translation helpful? Give feedback.
-
|
Sidenote/another thing I've observed: Couple times upon initial machine creation I got: I only got that once when creating a fresh machine, not when it already exists - if it exists, we go into polling correctly, so I assume we're just hitting some slowness on Hetzner's end. |
Beta Was this translation helpful? Give feedback.
-
|
Reproed the issue on the branch with bumped hetzner. My MCL: And coredump: What is interesting is once I get it into this state, I can do |
Beta Was this translation helpful? Give feedback.
-
|
In this trace: It's not clear that you pressed ^C ... Did you?? |
Beta Was this translation helpful? Give feedback.
-
|
Okay, good news, bad news! After digging through the trace, here is the offending part: The bad part is here: So a bug in hetzner that's not listening to the ctx ... I assume, I didn't dig into it... I started but then I noticed: It turns out when I bumped to the latest version, I didn't realize it was the latest 1.x version :/ I've now patched this in: https://github.com/purpleidea/mgmt/tree/feat/hetzner-v2 If it passes the tests I'll merge, but do please test and let me know if that fixes the issue. If not, when you press ^C if it doesn't shutdown, do another trace and I'll read it again. Thanks for your patience and for your very helpful reporting. |
Beta Was this translation helpful? Give feedback.
-
|
Okay, so AFAICT, this is still a bug in hetzner. Here is the relevant part: I would consider filing this upstream and explaining that when this context is cancelled, it doesn't exit (under whichever conditions you've done to cause it.) If you want to be extremely sure it's this, then I would also add to the top of CheckApply in the hetzner resource: fmt.Printf("XXX CHECK APPLY BEING CALLED\n")
select {
case <-ctx.Done():
fmt.Printf("XXX CONTEXT CANCELLED IN CHECKAPPLY FOR: %v\n", obj)
}
}()You should see an even number of messages, with a the cancel message seen when you ^C. If this is not the case, let me know it's our bug (but I doubt it) if it is the case, file a bug upstream and CC me please. Thanks again for your work debugging, sorry it took me so long, just got back from .eu |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When I create a resource that is poll-based, I see that if it hits an issue hitting the API it relies on, it later doesn't keep reconciling it.
I observed it first with some API errors on the providers end, but I could soon reproduce by turning turning my network adapter on/off.
I suspect a context isn't passing a timeout somewhere, but I didn't get to dig into the code to find whether it's an issue with the resource or polling itself.
Reproduction:
Tested on latest main branch, built my binary off of
17082d012f60dd2e7839476690227e83310f0ecfBeta Was this translation helpful? Give feedback.
All reactions