Managed Kubernetes not working - cilium can't connect to etcd

March 7, 2019 785 views
Kubernetes

Hi guys,

I already opened a support ticket, but I still have no replies since 3 days, so I wanted to try here too.

I use the managed kubernetes service with rancher and had it running smoothly. Then on monday morning, it suddenly stopped reporting to rancher and the deployed websites didn’t work anymore. I checked all pods and saw, that the cilium pods are restarting like crazy and most other pods are stuck in containerCreating.

It seems like the cilium pods can’t reach the etcd-node anymore. This is the log of one cilium node: https://gist.github.com/DTrierweiler/f2eecb5568fdf899695cb6f644318ffb
I even downloaded the certs from the secret and tried to connect to the etcd from my local machine with curl - which worked without problems.

Could this be related to dns problems? The 2 coredns pods are not running as well because of being stuck in containerCreating.

Thanks a lot for your help.
Besides this, is it normal for the support to take so much time? I have an unusable cluster (for 4 days now), which costs me 200$ per month and my websites are not running. Luckily this is still only staging and not production.

Cheers,
Daniel

2 Answers
jarland MOD March 7, 2019
Accepted Answer

Hey friend,

Per Nicholas from our support team:

We have seen a similar report of pods stuck in a ContainerCreating state and there might be a Cilium dependency issue; if you run kubectl -n kube-system edit ds cilium, what is dnsPolicy set to? If you change that to “ClusterFirst” or “Default”, does that resolve the issue?

I also wanted to quickly address this question:

is it normal for the support to take so much time?

It varies a bit. Our intention is to provide you with all of the things you need to troubleshoot and repair problems from your side, without having to wait for a response from our team. On the rare occasion that you do not have the ability to resolve an issue on your side and our intervention is required, such wait time is obviously unacceptable, and it is something we are working very hard on improving. By continually exposing customers to the right information up front, and getting better about providing a clear user experience as we go, we hope to see more customers empowered to solve problems so that we can be more available for the rare opportunities that you absolutely need us.

Jarland

Hey Jarland,

thank you so much. The dnsPolicy was set to ClusterFirstWithHostNet and a change to ClusterFirst did the trick. It’s running again.

Do you know why it was set to ClusterFirstWithHostNet and why it stopped working from one moment to the next?
Documentation says this value should only be used, when you use hostNetwork: true which is not the case.

I think you do a good job in providing a lot of information to fix and repair problems - but it all comes down to those rare occasions you mentioned (like in this case). I’m still not sure, why the ticket was unanswered for 3 days - correct me if I’m wrong, but I thought the main benefit of having a managed kubernetes service is not to worry about this exact problem.

Anyway - thanks a lot for the reply. Maybe it’ll help someone else as well :)

Cheers

Have another answer? Share your knowledge.