I’ve been getting familiar with Kubernetes and have a 2 node pool with a 2 pod application load-balanced.

Lately, the health checks have been failing and the application is going down.

When I run kubectl get nodes both nodes have a status of ready. when I run kubectl get pods all pods are running with 0 restarts.

How does the health check work and how can I debug/troubleshoot this issue to figure out why the nodes are becoming unhealthy so that I can resolve the issue.

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
2 answers

Hi there!

The LB is typically checking a specific nodePort service not just checking whether a node is healthy. I understand this can be confusing with the current iteration of the LB UI.

The externaltrafficpolicy setting on some kubernetes services may be set to “Local”. This means a node without a pod running locally for that service will reject the LB healthcheck and show as down. In other words, to save on overhead from unnecessary network hops between nodes, only nodes hosting pods for the LoadBalancer service will report as healthy.

With externaltrafficpolicy set to “cluster”, all nodes forward that traffic to other nodes that are hosting pods for that service. In this case even a node not hosting a pod for that particular service would show as “Healthy” as it would then just forward it to a node that can service the request.

To change this setting for a particular service use the following command:

kubectl patch svc myservice -p ’{“spec”:{“externalTrafficPolicy”:“Cluster”}}’

An important thing to note here is that if using the externaltrafficpolicy of “Cluster” you will lose the original client IP address due to this extra network hop. So if your application is checking for or dependent on knowing the client IP, the “Local” setting is required.

You can find more information on externaltrafficpolicy here: https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-type-nodeport

Regards,

John Kwiatkoski
Senior Developer Support Engineer - Kubernetes

  • I’m only dealing with one node so I don’t think this is the issue. if I proxy to the pods individually they respond with the application just find. So both pods are responding but the LB is indicating that they are unhealthy.

    • Per @nabsul’s comment, is your service that is of type: LoadBalancer an ingress controller?

      The LB is checking the health of the LB service’s pods and not others.

      If that is not the case, I would check that your healthcheck settings on the service/LB are what you are expecting.

      Can you manually try and replicate the failing healthchecks from your local machine?

      • spec:
          type: LoadBalancer
          externalTrafficPolicy: "Cluster"
        

        This is the service setting.

        Is there anyway I can see what the health check is returning to try and pinpoint the issue?

        • It should be using querying the nodeport of the LB service you can find that in the yaml or by viewing the LB in the cloud panel and checking the current settings that CCM has set.

I had this problem a while back, and at least for me the problem was that the both instances of the ingress service I was using (nginx) were running on the same node. Because of that one node was responding and one was not.

To check if you’re having the same problem, you can check how many instances of your ingress controller are running and on which nodes.

If this is your problem, I solved it by adding this to my deployment spec:

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: "app"
                    operator: In
                    values:
                    - nginx-ingress-pod
              topologyKey: "kubernetes.io/hostname"

However, I think the ideal solution would be to run the the service as a daemon set: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/ (instead of running it as a deployment)

I haven’t made that change yet myself, but would like to at some point.

Submit an Answer