Question

Can't schedule workloads after upgrade of Kubernetes from 1.12.7 to 1.12.8

Fortunately I don’t have mission critical services running on my cluster and now I likely never will be :)

I have started the automatic upgrade process of my DO-managed Kubernetes cluster from 1.12.7 to 1.12.8. After the upgrade of the Master control-plane went through I was expecting the worker nodes to get upgraded as well, however it somehow got stuck and the nodes are not being upgraded.

So currently I can no longer schedule new workloads as all new pods are going into the state ContainerCreating and are stuck there.

I tried to resize the node pool and this caused a new droplet to be spun up with the up-to-date Kubernetes version from DO (Debian do-kube-1.12.8-do.4). Using kubectl get nodes I can see the old nodes still in status Ready while the new node (even after around 30min) still shows up as NotReady and the latest event is Kubelet starting.

In addition all the old but still running droplets in the node pool no longer report metrics into the DO web interface.

Any idea what I can do? If nothing works I will probably have to tear down the whole cluster and set it up from scratch again. To be honest, this is very worrying to me.


Submit an answer


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Sign In or Sign Up to Answer

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

Steven Normore
DigitalOcean Employee
DigitalOcean Employee badge
June 7, 2019
Accepted Answer

Hi there,

I’m the Engineering Manager on the Kubernetes team at DO and I wanted to follow-up on Ethan’s previous posts as we’ve continued to work through the resolution of this issue.

A recent update to our auto-upgrade process introduced drift in our upgrade logic that resulted in the failed upgrades we’ve been discussing here. Our team has identified the problem and are in the process of rolling out a fix. The data and workloads in your clusters will remain intact. Affected clusters that are within their maintenance window will resume and complete the upgrade process when the fix is rolled out. If your cluster is now outside of it’s scheduled maintenance window, you can recycle your worker nodes in the cloud panel to trigger and resume their upgrade.

Unfortunately, our testing pipeline did not catch this issue. As is common procedure at DO, we will be following up this incident with an internal retrospective that will help us evolve the system and testing pipeline to be more resilient going forward. We understand that you place your trust in our platform and sincerely apologize for any trouble this has caused. Our system and processes will get better as a result of this.

Best, Steven Normore Engineering Manager, DOKS

efox
DigitalOcean Employee
DigitalOcean Employee badge
June 7, 2019

Hey there,

Our engineering team is investigating this on the backend; once we have a more concrete update we’ll let you all know here!

Regards, Ethan Fox Developer Support Engineer II - DigitalOcean

Actually it looks like the cilium pod is stuck in PodInitializing phase for the node-ready init pod. Which is executing this:

while !( /hyperkube kubectl get node ${K8S_NODE_NAME} -o yaml | grep InternalIP ); do sleep 1; done; exit 0

Also the csi-do-node and do-node-agent pods are not getting scheduled on the new nodes.

Edit: the node-ready pod is reporting Unable to connect to the server: dial tcp <ip>:443: i/o timeout

Edit2: the main reason here seems to be that the kube-proxy can’t connect to the master server because the ip address is unreachable. Logs of kube-proxy show this (sensitive data changed):

E0607 15:27:08.534659       1 reflector.go:125] k8s.io/client-go/informers/factory.go:131: Failed to list *v1.Endpoints: Get https://{cluster-id}.internal.k8s.ondigitalocean.com/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup {cluster-id}.internal.k8s.ondigitalocean.com on xxx.xxx.xxx.xxx:53: no such host

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Featured on Community

Get our biweekly newsletter

Sign up for Infrastructure as a Newsletter.

Hollie's Hub for Good

Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.

Become a contributor

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the cloud and scale up as you grow — whether you're running one virtual machine or ten thousand.

Learn more