Fortunately I don’t have mission critical services running on my cluster and now I likely never will be :)
I have started the automatic upgrade process of my DO-managed Kubernetes cluster from 1.12.7 to 1.12.8. After the upgrade of the Master control-plane went through I was expecting the worker nodes to get upgraded as well, however it somehow got stuck and the nodes are not being upgraded.
So currently I can no longer schedule new workloads as all new pods are going into the state ContainerCreating and are stuck there.
I tried to resize the node pool and this caused a new droplet to be spun up with the up-to-date Kubernetes version from DO (Debian do-kube-1.12.8-do.4). Using kubectl get nodes
I can see the old nodes still in status Ready
while the new node (even after around 30min) still shows up as NotReady
and the latest event is Kubelet starting
.
In addition all the old but still running droplets in the node pool no longer report metrics into the DO web interface.
Any idea what I can do? If nothing works I will probably have to tear down the whole cluster and set it up from scratch again. To be honest, this is very worrying to me.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.
Sign up for Infrastructure as a Newsletter.
Working on improving health and education, reducing inequality, and spurring economic growth? We'd like to help.
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
Hi there,
I’m the Engineering Manager on the Kubernetes team at DO and I wanted to follow-up on Ethan’s previous posts as we’ve continued to work through the resolution of this issue.
A recent update to our auto-upgrade process introduced drift in our upgrade logic that resulted in the failed upgrades we’ve been discussing here. Our team has identified the problem and are in the process of rolling out a fix. The data and workloads in your clusters will remain intact. Affected clusters that are within their maintenance window will resume and complete the upgrade process when the fix is rolled out. If your cluster is now outside of it’s scheduled maintenance window, you can recycle your worker nodes in the cloud panel to trigger and resume their upgrade.
Unfortunately, our testing pipeline did not catch this issue. As is common procedure at DO, we will be following up this incident with an internal retrospective that will help us evolve the system and testing pipeline to be more resilient going forward. We understand that you place your trust in our platform and sincerely apologize for any trouble this has caused. Our system and processes will get better as a result of this.
Best, Steven Normore Engineering Manager, DOKS
Hey there,
Our engineering team is investigating this on the backend; once we have a more concrete update we’ll let you all know here!
Regards, Ethan Fox Developer Support Engineer II - DigitalOcean
Actually it looks like the cilium pod is stuck in PodInitializing phase for the
node-ready
init pod. Which is executing this:Also the
csi-do-node
anddo-node-agent
pods are not getting scheduled on the new nodes.Edit: the
node-ready
pod is reportingUnable to connect to the server: dial tcp <ip>:443: i/o timeout
Edit2: the main reason here seems to be that the kube-proxy can’t connect to the master server because the ip address is unreachable. Logs of kube-proxy show this (sensitive data changed):