[deamon] Handle cloud-init intentional shutdown#4695
[deamon] Handle cloud-init intentional shutdown#4695deepakshirkem wants to merge 7 commits intocanonical:mainfrom
Conversation
|
Hi @tobe2098 , I will required your feedback on this solution approach. |
|
Hi, it is a compilation error, unrelated to |
This Change: Detects when VM stops during ssh wait Throws IntentionalShutdown insted of timing out VM remains in stopped state as inteded by cloud-init Fixes canonical#4456
ab66f47 to
cd4916d
Compare
|
Hi @tobe2098 , |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4695 +/- ##
==========================================
- Coverage 87.64% 87.51% -0.13%
==========================================
Files 254 259 +5
Lines 14157 14155 -2
==========================================
- Hits 12407 12386 -21
- Misses 1750 1769 +19 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tobe2098
left a comment
There was a problem hiding this comment.
The work shows promise, but I think we need to re-formulate the solution. Ideally, wait_* functions would return as if there were a success in the intentional shutdown case, since it is a success. Both wait_* functions were already re-formatted to accomodate restarts, so in the case where a restart would be detected within the try_to_ssh function due to state not being running, we could check if the state is shutdown and the intended state is shutdown as well and return TimeoutAction::Done.
What do you think about this alternative solution?
| if(vm_desc.user_data_config["power_state"]) | ||
| { | ||
| auto ps = vm_desc.user_data_config["power_state"]; | ||
| if(ps["mode"] && ps["mode"].as<std::string>() == "poweroff") | ||
| { | ||
| mpl::log(mpl::Level::error, name, "DETECTED POWEROFF IN CONFIG"); | ||
| vm_desc.expects_shutdown = true; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Since vm_desc already has access to the data, would it make more sense to just parse the power_state value in-situ? Once you enter the function call that requires the information it could be parsed there for the value, instead of adding an additional field that does not really correspond to "VMDescription", since it is just an option of the contained cloud-init (if we were talking about a ParsedCloudInit object I would agree with the approach).
If the field were in BaseVirtualMachine it would be more convenient, since that virtual class does not have access to the VMDescription. The default could be false, and the field could be set in the constructor of the derived classes, where there is access to the VMDescription. What do you think?
There was a problem hiding this comment.
Yes i think your approach make more sense.
My thinking was that parsing in the daemon would make the implementation work across all backends automatically. I will try implementing it with the VirtualBox backed first and update here.
| if ((state == State::delayed_shutdown && present_state == State::running) || | ||
| state == State::starting) | ||
| if (state == State::starting && present_state == State::stopped) | ||
| { | ||
| mpl::log(mpl::Level::info, | ||
| name.toStdString(), | ||
| "VM stopped during startup (cloud-init poweroff)"); | ||
|
|
||
| state = present_state; | ||
| return state; | ||
| } | ||
|
|
||
| if (state == State::delayed_shutdown && present_state == State::running) | ||
| return state; |
There was a problem hiding this comment.
Is this logic really necessary? Supporting the intentional shutdown should be backend-wide, and this provides no additional functionality. Let me know what your thinking process here was, there are other logic changes here (like what if starting && not stopped?), for which an explanation would make things easier.
There was a problem hiding this comment.
You are right, this does not support any logic. I added this only for debugging purpose while testing, to check the logs. I will remove this changes in the next commit. I will make sure that this type of mistake will not happen again.
| if (state == State::starting && present_state == State::stopped) | ||
| { | ||
| mpl::log(mpl::Level::info, | ||
| name.toStdString(), | ||
| "VM stopped during startup (cloud-init poweroff)"); | ||
|
|
||
| state = present_state; | ||
| return state; |
There was a problem hiding this comment.
This updates the state variable when starting when it used to not be the case. What was your reasoning for this change?
|
Hi @tobe2098, |
tobe2098
left a comment
There was a problem hiding this comment.
We are getting closer @deepakshirkem. It is not complete because it could be the case that the shutdown is detected on waiting for cloud-init, instead of on ssh up. That would be the case if the cloud-init takes long enough in longer cloud-init configurations.
We should also add the logic to the other backends. To avoid the repeated code, the yaml parsing code could be a function as well.
| if (vm_state == State::stopped || vm_state == State::off) | ||
| { | ||
| if (expected_shutdown) { | ||
| mpl::log(mpl::Level::info, vm_name, | ||
| "VM powered off as configured in cloud-init"); | ||
| return utils::TimeoutAction::done; | ||
| } else { | ||
| mpl::log(mpl::Level::error, vm_name, | ||
| "VM stopped unexpectedly"); | ||
| throw StartException(vm_name, "VM stopped unexpectedly"); | ||
| } | ||
| } |
There was a problem hiding this comment.
Maybe the exception could be thrown here instead of after the try_action_for.
There was a problem hiding this comment.
Hi @tobe2098, I tried throwing the exception as you suggested and removed the final state check, but after that I am not able to see the intentional exception or the start exception. I am not sure why this behavior changed. I will dig deeper into it and add logging to investigate.
There was a problem hiding this comment.
Remember that if the shutdown happens during the cloud-init after the ssh_up you are not throwing in that function yet.
There was a problem hiding this comment.
Yeah, you are right. But in my updated code, I am throwing the exception from that function. After some debugging, I think there may be a race condition issue—my vm_state is not updating inside the lambda.
So, can we do the same as in the old code where I added a final check? It introduces some delay, but it gives the expected results.
I wanted your feedback—does my observation about a race condition make sense?
There was a problem hiding this comment.
It makes sense, but if you check the vm_state=current_state(); the state is supposed to be properly updated. I can do more testing on that later on.
There was a problem hiding this comment.
Hi @tobe2098, You're right! I was testing the crash scenario incorrectly. I was stopping the VM before it reached the running state, which is why the state wasn't updating as expected in the lambda. That's why I thought there was a race condition.
Created expects_shutdown_from_cloud_init() helper function. Implemented in all backend constructors and detect_aborted_start. Handles intentional shutdown in wait functions.
|
Hi @tobe2098, I implemented the changes as you suggested and added the helper function. I was confused during testing because, while doing the crash test manually, I stopped the VM before it reached the running state. That’s why it never stopped as expected, and I was not getting the expected behavior. |
Signed-off-by: Deepak Shirke <117824396+deepakshirkem@users.noreply.github.com>
|
Hi @deepakshirkem, if there are conflicts do not merge main into the branch, do |
tobe2098
left a comment
There was a problem hiding this comment.
I made some minor comments. You are doing great work @deepakshirkem!
Additionally, there is something that must be taken care of. In daemon.cpp, wait_for_ssh_up is called in all started/restarted instances, not only the launched ones. When dealing with the IntentionalShutdownException, we must treat it as a StartException whenever it is not a LaunchRequest.
|
|
||
| if ((state == State::delayed_shutdown && present_state == State::running) || | ||
| state == State::starting) | ||
| if (state == State::delayed_shutdown && present_state == State::running) |
There was a problem hiding this comment.
Why is starting not checked now? Did you find something while testing?
There was a problem hiding this comment.
Hi @tobe2098, I have addressed this change.
| auto on_timeout = [] { | ||
| throw std::runtime_error("timed out waiting for initialization to complete"); | ||
| }; | ||
|
|
There was a problem hiding this comment.
Could this newline be removed?
| catch (const SSHExecFailure& e) // transitioning away from catching generic runtime errors | ||
| { // TODO remove once we're confident this is an anomaly | ||
| return mpu::TimeoutAction::retry; | ||
| } | ||
| catch (const std::exception& e) // transitioning away from catching generic runtime errors | ||
| { // TODO remove once we're confident this is an anomaly | ||
| catch (const std::exception& e) | ||
| { |
There was a problem hiding this comment.
This comment was moved accidentally
| if (vm_state == State::stopped || vm_state == State::off) | ||
| { | ||
| if (expected_shutdown) { | ||
| mpl::log(mpl::Level::info, vm_name, | ||
| "VM powered off as configured in cloud-init"); | ||
| return utils::TimeoutAction::done; | ||
| } else { | ||
| mpl::log(mpl::Level::error, vm_name, | ||
| "VM stopped unexpectedly"); | ||
| throw StartException(vm_name, "VM stopped unexpectedly"); | ||
| } | ||
| } |
There was a problem hiding this comment.
It makes sense, but if you check the vm_state=current_state(); the state is supposed to be properly updated. I can do more testing on that later on.
- Restore missing 'starting' state check in VirtualBox current_state - Fix VirtualBox current_state to detect stopped state immediately - Remove duplicate TODO comment in wait_for_cloud_init - Remove extra newline in wait_for_cloud_init - Add if constexpr check to only treat IntentionalShutdownException as success for LaunchRequest, not StartRequest
|
Hi @tobe2098, I addressed all your review comments. Thank You ((: |
573888e to
0000453
Compare
- Remove trailing whitespace (per GIT9) - Add expected_shutdown check to detect_aborted_start - Fixes BaseVM.waitForCloudInitVMDownReconnects test
45d00d1 to
b6219ba
Compare
Signed-off-by: Deepak Shirke <117824396+deepakshirkem@users.noreply.github.com>
Description
What does this PR do?
This PR fixes the timeout issue when cloud-init uses
power_state: mode: poweroffto shutdown the VM. Insted of waiting for the full timeout period, Multipass now:power_stateis configured.Why is this change needed?
Currently, when a user configures cloud-init to shut down the VM after initialization,
multipass launchwaits for SSH and times out after 300 seconds, even though the VM has completed initialization successfully.This creates a poor user experience:
Implementation Details
power_statefield from user-provided cloud-init YAML during VM creationexpects_shutdownflag inVirtualMachineDescriptionand pass to VMwait_until_ssh_up(), Ifexpects_shutdown = true: throwIntentionalShutdownException(success) and Ifexpects_shutdown = false: throwStartException(failure)IntentionalShutdownExceptionand treats launch as successfulRelated Issue(s)
May be Closes #4456
Testing
Manual testing:
Test 1: Intentional shutdown (with power_state)
Result: Launch completes in ~30-40 seconds with success message, VM in Stopped state
Test 2: Normal VM (no power_state)
Result: Launch completes normally, VM in Running state with SSH available
Test 3: Unexpected crash (VM stopped without power_state)
Manually stop VM during launch:
Result: Launch fails with "VM stopped unexpectedly" error
Screenshots (if applicable)
N/A
Checklist
Additional Notes
N/A