Question

Kubernetes apiserver down, causing network issues.

Posted September 30, 2019 3.8k views
Kubernetes

We’re experiencing some issues in our cluster where our network seems unstable. We noticed deploys failing sometimes because application pods could not connect to dependent services (e.g . database) a couple times. After some poking around I found that Cilium Operator and CoreDNS show a high number of restarts. This seems to be because they could not reach the kube apiserver (or because etcd is down, not sure).

Logs for Cilium Operator

level=info msg="Cilium Operator " subsys=cilium-operator
level=info msg="Starting apiserver on address :9234" subsys=cilium-operator
level=info msg="Connecting to etcd server..." config=/var/lib/etcd-config/etcd.config endpoints="[https://0a18c093-ee32-45d2-a8a6-d630a6242716.internal.k8s.ondigitalocean.com:2379]" subsys=kvstore
level=info msg="Establishing connection to apiserver" host="https://10.245.0.1:443" subsys=k8s
level=info msg="Establishing connection to apiserver" host="https://10.245.0.1:443" subsys=k8s
level=warning msg="Health check status" error="Not able to connect to any etcd endpoints" subsys=cilium-operator
level=error msg="Unable to contact k8s api-server" error="Get https://10.245.0.1:443/api/v1/componentstatuses/controller-manager: dial tcp 10.245.0.1:443: i/o timeout" ipAddr="https://10.245.0.1:443" subsys=k8s
level=fatal msg="Unable to connect to Kubernetes apiserver" error="unable to create k8s client: unable to create k8s client: Get https://10.245.0.1:443/api/v1/componentstatuses/controller-manager: dial tcp 10.245.0.1:443: i/o timeout" subsys=cilium-operator

Logs for CoreDNS

.:53
2019-09-30T09:06:13.965Z [INFO] CoreDNS-1.3.1
2019-09-30T09:06:13.965Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-09-30T09:06:13.965Z [INFO] plugin/reload: Running configuration MD5 = 2e2180a5eeb3ebf92a5100ab081a6381
E0930 09:06:49.312873       1 reflector.go:251] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to watch *v1.Endpoints: Get https://10.245.0.1:443/api/v1/endpoints?resourceVersion=2667502&timeout=6m0s&timeoutSeconds=360&watch=true: dial tcp 10.245.0.1:443: connect: connection refused
E0930 09:06:49.312873       1 reflector.go:251] github.com/coredns/coredns/plugin/kubernetes/controller.go:317: Failed to watch *v1.Endpoints: Get https://10.245.0.1:443/api/v1/endpoints?resourceVersion=2667502&timeout=6m0s&timeoutSeconds=360&watch=true: dial tcp 10.245.0.1:443: connect: connection refused
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-9d6bf9876-t4jlt.unknownuser.log.ERROR.20190930-090649.1: no such file or directory

Any ideas what might be causing these issues? I’m not sure where to look, as I understand it DO would manage the master node running etcd and kubernetes api server, so what might we be doing that causes those components to fail?

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
4 answers

Hi there,

Please open a support ticket so that we can look further into the cluster on your account.

Regards,

John Kwiatkoski

Running into a similar issue at the moment. Is it worth raising another ticket or has #02979561 been investigated/resolved already?

Hitting the same issue after update the cluster to 1.16:

level=warning msg="Health check status" error="not able to connect to any etcd endpoints" subsys=cilium-operator
{"level":"warn","ts":"2019-12-04T21:49:01.258Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://eece3570-12b4-40fa-8f9b-3a2b417d9cf9.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadli
ne exceeded"}
level=warning msg="Health check status" error="not able to connect to any etcd endpoints" subsys=cilium-operator
level=fatal msg="Unable to start status api: http: Server closed" subsys=cilium-operator

Same thing here after a 1.16 upgrade. Did you guys get a solution?

level=info msg="Connecting to kvstore..." address= kvstore=etcd subsys=cilium-operator
level=info msg="Connecting to etcd server..." config=/var/lib/etcd-config/etcd.config endpoints="[https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379]" subsys=kvstore
level=info msg="Starting to synchronize k8s nodes to kvstore..." subsys=cilium-operator
{"level":"warn","ts":"2020-01-08T19:36:34.408Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-01-08T19:36:49.409Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-01-08T19:37:04.411Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-01-08T19:37:19.412Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=warning msg="Health check status" error="not able to connect to any etcd endpoints" subsys=cilium-operator
{"level":"warn","ts":"2020-01-08T19:37:34.413Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=warning msg="Health check status" error="not able to connect to any etcd endpoints" subsys=cilium-operator
{"level":"warn","ts":"2020-01-08T19:37:49.414Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"passthrough:///https://b0709a0f-c4b9-40d7-a65e-182eabbb3f1a.internal.k8s.ondigitalocean.com:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
level=warning msg="Health check status" error="not able to connect to any etcd endpoints" subsys=cilium-operator
level=fatal msg="Unable to start status api: http: Server closed" subsys=cilium-operator
  • After contacting support, they increased the resources allocated to our master node. We had to do that a couple times, but things are pretty stable for us now.

Submit an Answer