Using DO managed kubernetes cluster with helm chart stable/prometheus results in some node_exporters being unreachable.

December 7, 2018 1.5k views
Kubernetes

I have three nodes in the cluster. Prometheus pods (which include server, alertmanager, nodeexporter etc) start just fine. Unfortunately 2 of the 3 nodeexporters cannot be reached. This seems like it must be some issue with flannel, but I don't know how to begin to debug this.

Prometheus itself (the dashboard) reports the error "context deadline exceeded" for the 2 node_exporter pods. When I create a single "curl" pod for curling ClusterIPs, the curl command hangs when trying to connect to these two.

So the question is how does one verify that flannel is functioning correctly?

Reproduce:

helm install --name prometheus-service stable/prometheus
kubectl port-forward prometheus-service-server-<id> 9090
http://localhost:9090/targets (view in browser)

And see that some (perhaps all but one) of the node_exporter pods report "context deadline exceeded".

6 Answers

Make sure to open tcp/9100 from 10.0.0.0/8 172.16.0.0/20 192.168.0.0/16 in the DO firewall panel of your cluster.

You might also have noticed that prometheus fails to get kubelet metrics. Watch this question for updates: https://www.digitalocean.com/community/questions/cannot-install-heapster-to-cluster-due-to-kubelets-not-allowing-to-get-metrics-on-port-10255

  • This is probably related to me not understanding something about how DO is managing communication between nodes. Why would I want to open up that port? I was under the impression that part of the advantage of kubernetes on DO was that the internal networking was isolated.

  • Opening the port 9100 as explain by @cbenhagen solves the context deadline exceeded error, but still there is the connection refused on port 10255 as reported by the Prometheus targets page.

    Anyone found a way to solve it? @guy1234 ? @crsmithdev ?

I opened port 9100 via the droplet networking control panel, and it now seems to work, and it looks like because there's one firewall config for the cluster, that is applied to subsequent nodes that are spun up. The same sources as the rest of the k8s rules can be used.

Would be curious to see if this worked for you.

I am wondering if there is something incorrect about how my nodes are set up or I just don't understand how the networking is supposed to work.

1) Get a listing of the nodes:
kubectl get nodes -o wide

2) Create a pod that can curl:
kubectl run curl --image=radial/busyboxplus:curl -i --tty

3) Try curling the internal-ip that is listed from #1.
curl INTERNAL-IP

The result is that the curl pod can only curl the node that it is apart of. Is this how this is supposed to work? Curling any other node's internal-ip results in curl hanging.

I'm having this same problem. @guy1234 , I also found your GitHub issue (https://github.com/helm/charts/issues/9791), did you ever find a resolution / workaround for this?

  • Unfortunately no I still run with them disabled. If you find something, please let me know.

FYI i was able to fix this issue by adding the port 9100 in the k8s firewall and addresses: 10.0.0.0/8 172.16.0.0/20 192.168.0.0/16

I ran into the same issue. The reason for this is probably that the node-exporter are configured with hostNetwork: true. This is required to be able to scrape some networking metrics but also mean that it runs in the host network namespace. It seems that this traffic is not going through Cilium but directly via the private network. Therefore we need to add a firewall rule as @manelpb already stated. Nevertheless you can leave out 172.16.0.0/20 (I believe it's the load balancer subnet), so allowing 10.0.0.0/8 and 192.168.0.0/16 as source ip range on port 9100 works.

Have another answer? Share your knowledge.