Can't schedule workloads after upgrade of Kubernetes from 1.12.7 to 1.12.8

June 7, 2019 535 views
Kubernetes Debian 9

Fortunately I don't have mission critical services running on my cluster and now I likely never will be :)

I have started the automatic upgrade process of my DO-managed Kubernetes cluster from 1.12.7 to 1.12.8. After the upgrade of the Master control-plane went through I was expecting the worker nodes to get upgraded as well, however it somehow got stuck and the nodes are not being upgraded.

So currently I can no longer schedule new workloads as all new pods are going into the state ContainerCreating and are stuck there.

I tried to resize the node pool and this caused a new droplet to be spun up with the up-to-date Kubernetes version from DO (Debian do-kube-1.12.8-do.4). Using kubectl get nodes I can see the old nodes still in status Ready while the new node (even after around 30min) still shows up as NotReady and the latest event is Kubelet starting.

In addition all the old but still running droplets in the node pool no longer report metrics into the DO web interface.

Any idea what I can do? If nothing works I will probably have to tear down the whole cluster and set it up from scratch again. To be honest, this is very worrying to me.

5 Answers
snormoredo June 7, 2019
Accepted Answer

Hi there,

I’m the Engineering Manager on the Kubernetes team at DO and I wanted to follow-up on Ethan’s previous posts as we’ve continued to work through the resolution of this issue.

A recent update to our auto-upgrade process introduced drift in our upgrade logic that resulted in the failed upgrades we’ve been discussing here. Our team has identified the problem and are in the process of rolling out a fix. The data and workloads in your clusters will remain intact. Affected clusters that are within their maintenance window will resume and complete the upgrade process when the fix is rolled out. If your cluster is now outside of it’s scheduled maintenance window, you can recycle your worker nodes in the cloud panel to trigger and resume their upgrade.

Unfortunately, our testing pipeline did not catch this issue. As is common procedure at DO, we will be following up this incident with an internal retrospective that will help us evolve the system and testing pipeline to be more resilient going forward. We understand that you place your trust in our platform and sincerely apologize for any trouble this has caused. Our system and processes will get better as a result of this.

Best,
Steven Normore
Engineering Manager, DOKS

Cant you just recycle the nodes?

  • I just tried it out. The existing node was replaced and now I have one more node in state NotReady and one less in state Ready 😔

I’m facing same problem. K8 13.3 13.5 upgrade. Master upgraded but workers stuck in upgrading Forever

Actually it looks like the cilium pod is stuck in PodInitializing phase for the node-ready init pod. Which is executing this:

while !( /hyperkube kubectl get node ${K8S_NODE_NAME} -o yaml | grep InternalIP ); do sleep 1; done; exit 0

Also the csi-do-node and do-node-agent pods are not getting scheduled on the new nodes.

Edit:
the node-ready pod is reporting
Unable to connect to the server: dial tcp <ip>:443: i/o timeout

Edit2:
the main reason here seems to be that the kube-proxy can't connect to the master server because the ip address is unreachable. Logs of kube-proxy show this (sensitive data changed):

E0607 15:27:08.534659       1 reflector.go:125] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://{cluster-id}.internal.k8s.ondigitalocean.com/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup {cluster-id}.internal.k8s.ondigitalocean.com on xxx.xxx.xxx.xxx:53: no such host

Hey there,

Our engineering team is investigating this on the backend; once we have a more concrete update we'll let you all know here!

Regards,
Ethan Fox
Developer Support Engineer II - DigitalOcean

  • 👋 Good to hear that you are working on it. I hope this can be recovered quickly

  • Whatever you did, my cluster is working again. The nodes are not being upgraded automatically but I can recycle them manually now.

Have another answer? Share your knowledge.