Kubernetes cluster outages
During last day (also some previous days in last weeks) we experienced several outages of Kubernetes cluster which impacted thousands of our clients in PRODUCTION environment.
The situation was monitored by our tech team which noticed that the containers were repeatedly restarted or terminated without our intervention. Our services became unavailable, droplets of cluster were removed from LoadBalancer (had to add them manually over and over again) and totally were not able to get logs of any of running containers (kubectl logs container_name) - following error was received:
Error from server: Get https://10.133.4.193:10250/containerLogs/OUR_NAMESPACE/OUR_POD_NAME-854bf7bc4f-vbxn6/gateway: net/http: TLS handshake timeout
The executing of commands inside containers was not working neither (kubectl exec -it container_name sh) - command just got stuck.
When we wanted to access services in cluster (websites, webapps, etc.) from outside we noticed that the request were not forwarded from LoadBalancer to cluster / droplets.
Just adding - we did not change any certificates, nor did any other configuration. Our apps were running without any problem for several days.
Kindly asking you for issue investigating or a statement saying if there were any problems with Kubernetes service / network infrastructure / etc. Thanks
These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.