Open
Description
How to categorize this issue?
/area robustness
/kind bug
/priority 2
What happened:
Currently MCM doesn't turn CrashLoopBackoff
(CLBF) machines to Failed
as soon as creationTimeout
expires , but delays have been observed which can range to 2min to any time.
There are 2 parts of the problem:
- timeout check for CLBF machine is done after making
CreateMachine()
driver call .CreateMachine()
itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before) - even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of
ShortRetry
(3min). This happens because the retry period is not calculated considering thetime left before timeout
but is just a constant value ofShortRetry
(3min)
What you expected to happen:
Turn to Failed
as soon as timeout expires.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
We could take inspiration from machineDeployment logic which turns Progressing
condition to False
with reason ProgressDeadlineExceeded
as soon as the deadline exceeds.
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- Others: