-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Is this a BUG REPORT or FEATURE REQUEST?: Bug report
What happened: something very similar to #225
During a change to an instanceGroup, a RollingUpgrade is created. The RollingUpgrade correctly identifies the ec2 instances affected, attempts to drain the node and fails with an error. Deleting the RollingUpgrade and restarting the process consistently fails in the same way.
executing the same drain eg. kubectl drain <node> --delete-local-data --ignore-daemonsets works as expected from a local shell.
What you expected to happen: kubectl drain evicts the pods as normal.
How to reproduce it (as minimally and precisely as possible): Not sure how to reproduce it outside of my environment, but seems to affect certain instance groups consistently.
Anything else we need to know?: the error appears to be produced immediately (about 1 second after the drain messages)
there are no PDBs being used on this cluster, all pods have a termiantionGracePeriodSeconds=30
Environment:
- rolling-upgrade-controller version 1.0.2
- Kubernetes version : 1.19.12
Other debugging information (if applicable):
- RollingUpgrade status:
$ kubectl describe rollingupgrade <rollingupgrade-name>
- controller logs:
$ kubectl logs <rolling-upgrade-controller pod>
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade scaling group details {"scalingGroup": "ew1p-fcp-k2-xxxxxxxx-sh-r5-2xlarge-eu-west-1a", "desiredInstances": 17, "launchConfig": "", "name": "instance-man
ager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade drift detected in scaling group {"driftedInstancesCount/DesiredInstancesCount": "(10/17)", "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade selecting batch for rotation {"batch size": 1, "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade found in-progress instances {"instances": ["i-039af737f93705937"]}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade rotating batch {"instances": ["i-039af737f93705937"], "name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade no InService instances in the batch {"batch": ["i-039af737f93705937"], "instances(InService)": [], "name": "instance-manager/rollup-nodes-0cc26b4a29182e2
bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade waiting for desired nodes {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade desired nodes are ready {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc"}
2021-08-25T05:20:08.849Z INFO controllers.RollingUpgrade draining the node {"instance": "i-039af737f93705937", "node name": "ip-172-24-64-136.eu-west-1.compute.internal", "name": "instance-manager/rollup-nodes-
0cc26b4a29182e2bc"}
time="2021-08-25T05:20:09Z" level=debug msg="nodeEventsHandler[update] nodeObj updated, updated in sync mapnodeNameip-172-24-69-198.eu-west-1.compute.internal"
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-tkf65, kube-system/kiam-agent-kwrjp, kube-system/kube-proxy-985s5, kube-system/node-local-dns-8tdlk, logging/default-fluentbit-lnlj6, monitoring/node-export
er-fhtx5
2021-08-25T05:20:09.624Z ERROR controllers.RollingUpgrade failed to rotate the node {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc", "error": "DrainNode failed: instanceID - i-039af737f93705937, erro
r draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"datal
ink-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"al-bisamericas-manage
r-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting
pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evic
ting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]", "errorVerbose": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"t
hameswater-agent-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, e
rror when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\":
global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \
"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\"
-n \"xxxxxxxx-customer\": global timeout reached: -1s]\ngithub.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).ReplaceNodeBatch.func1\n\t/workspace/controllers/upgrade.go:272\nruntime.goexit\n\t/usr/loca
l/go/src/runtime/asm_amd64.s:1374"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).ReplaceNodeBatch
/workspace/controllers/upgrade.go:307
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeContext).RotateNodes
/workspace/controllers/upgrade.go:108
github.com/keikoproj/upgrade-manager/controllers.(*RollingUpgradeReconciler).Reconcile
/workspace/controllers/rollingupgrade_controller.go:172
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99
evicting pod xxxxxxxx-customer/xxxxxxxx-5859f68b5d-npkq9
evicting pod xxxxxxxx-customer/xxxxxxxx-5c5f9478fd-x7kqj
evicting pod xxxxxxxx-customer/xxxxxxxx-b9658cf44-q85r4
evicting pod xxxxxxxx-customer/xxxxxxxx-7b6f9964df-9qzkq
evicting pod xxxxxxxx-customer/xxxxxxxx-6bc5d48f5-jjn5l
evicting pod xxxxxxxx-customer/xxxxxxxx-6f7c9d9df-jfbj5
evicting pod xxxxxxxx-customer/xxxxxxxx-778fff6dd6-rhc4k
evicting pod xxxxxxxx-customer/xxxxxxxx-6599f96bd8-xf2vl
2021-08-25T05:20:09.633Z ERROR controller-runtime.manager.controller.rollingupgrade Reconciler error {"reconciler group": "upgrademgr.keikoproj.io", "reconciler kind": "RollingUpgrade", "name": "rollup-nodes-0c
c26b4a29182e2bc", "namespace": "instance-manager", "error": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\
": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b965
8cf44-q85r4\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"ba
rclays-costallocb-agent-6bc5d48f5-jjn5l\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error wh
en evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]", "e
rrorVerbose": "DrainNode failed: instanceID - i-039af737f93705937, error draining node: [error when evicting pods/\"xxxxxxxx-5859f68b5d-npkq9\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evi
cting pods/\"xxxxxxxx-5c5f9478fd-x7kqj\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-b9658cf44-q85r4\" -n \"xxxxxxxx-customer\": global
timeout reached: -1s, error when evicting pods/\"xxxxxxxx-7b6f9964df-9qzkq\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6bc5d48f5-jjn5l\" -n \
"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6f7c9d9df-jfbj5\" -n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-778fff6dd6-rhc4k\"
-n \"xxxxxxxx-customer\": global timeout reached: -1s, error when evicting pods/\"xxxxxxxx-6599f96bd8-xf2vl\" -n \"xxxxxxxx-customer\": global timeout reached: -1s]\ngithub.com/keikoproj/upgrade-manager/controllers.
(*RollingUpgradeContext).ReplaceNodeBatch.func1\n\t/workspace/controllers/upgrade.go:272\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99
2021-08-25T05:20:09.633Z INFO controllers.RollingUpgrade ***Reconciling***
2021-08-25T05:20:09.633Z INFO controllers.RollingUpgrade rolling upgrade ended {"name": "instance-manager/rollup-nodes-0cc26b4a29182e2bc", "status": "error"}