Question

Kubernetes cluster outages

Posted April 29, 2019 1.2k views
DigitalOcean Kubernetes

During last day (also some previous days in last weeks) we experienced several outages of Kubernetes cluster which impacted thousands of our clients in PRODUCTION environment.

The situation was monitored by our tech team which noticed that the containers were repeatedly restarted or terminated without our intervention. Our services became unavailable, droplets of cluster were removed from LoadBalancer (had to add them manually over and over again) and totally were not able to get logs of any of running containers (kubectl logs container_name) - following error was received:

Error from server: Get https://10.133.4.193:10250/containerLogs/OUR_NAMESPACE/OUR_POD_NAME-854bf7bc4f-vbxn6/gateway: net/http: TLS handshake timeout

The executing of commands inside containers was not working neither (kubectl exec -it container_name sh) - command just got stuck.

When we wanted to access services in cluster (websites, webapps, etc.) from outside we noticed that the request were not forwarded from LoadBalancer to cluster / droplets.

Just adding - we did not change any certificates, nor did any other configuration. Our apps were running without any problem for several days.

Kindly asking you for issue investigating or a statement saying if there were any problems with Kubernetes service / network infrastructure / etc. Thanks

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

2 answers

Greetings!

I’m sorry that I didn’t reply here earlier. I want you to know that we saw this and began discussing it, but I didn’t have anything solid to report until now. This cluster should now be healthy, after several hours of discussion between our engineers. Credit to John K and Nan Z for resolving this, I just wanted to share the news.

Should anyone else find themselves in a similar situation, please don’t hesitate to write to our support team here:
https://www.digitalocean.com/company/contact/#support

Jarland

  • Hello,

    just adding these issues continue to happen. Tonight (CEST) we had a major outage which impacted all our applications on cluster. One-by-one all the ups went unavailable, LoadBalancer crashed, etc. - all the symptoms I described earlier. The cluster is not healthy at all. These issues are coming over and over again.

    I’m really thinking of moving back to Google Cloud even if it is more expensive. Can’t afford this kind of stability, our clients are starting to be nervous.

Same behavior on a cluster, control plane become unavailable, all nodes under Load Balancer “Down” but containers still working and processing jobs.

doctl kubernetes cluster list
ID               Name   Region        Version        Auto Upgrade    Status      Node Pools
some-id    k8s       ams3      1.16.2-do.1             false           degraded    k8s-std k8s-cpu
  • Exactly same problema here.
    id generated_name fra1 1.16.2-do.1 false degraded pool-di4ey24ew

    My applications from containers seems to be working well but I don’t have access to cluster anymore. Same issue I had yesterday but I thought that I was trying to push to many apps on cluster but the real reason seems to be instability of the cluster from infrastructure point of view

Submit an Answer