I’ve woken up to find my services have all fallen over. Investigating that I’ve found all my k8s nodes are in NotReady state. Deploying isn’t working.

No notifications about this happening. No emails. Nothing from DO to say “by the way, your nodes have fallen over”.

Can someone from DO help me with this?

3 comments
  • Hey bud, I have the same issue, do you know why it did it?

  • Hi @davidAngler - no idea why it did it. DO said my nodes were OOM. But I think that’s inaccurate. A few other peoples k8s clusters had a fit, not just mine, clearly. I had to get onto twitter before they’d give any support. How you got it sorted dude.

  • @colinjohnriddell Getting the same answer from DO,

    They seem okay but they seem to say the node OS went in “panic” mode, meaning out of resources, software is quite light.

    I will try a stronger pool, and move the resources there and play with podAntiAffinities to prefer them on different nodes, because when it came back online everything went on the same node. :|

    Anyways Im sure its probably DO playing in the back and telling us a different story.

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

×
3 answers

Hi, how are you defining your pods for your service(s)? It’s not clear from your original post. If you have a public repository, can you drop a link in the comments? Well, I must go and I look forward to any additional feedback.

Think different and code well,

-Conrad

No answer to this. something happened with DO clusters. DO blamed my nodes being OOM. They’re fine now and they were fine before.

Hello @colinjohnriddell ,

NotReady status on a node can be caused due to multiple reasons::

  • The node kubelet service has stopped running.
  • The container runtime(Docker) has stopped running.
  • The node VM is no longer available.
  • Resource contention on the Nodes.

It is a best practice that Kubernetes nodes should be treated as ephemeral. Because of this, it is common to recycle a node that has an issue to replace it with a healthy node. This can fix many common problems specific to nodes. Generally, we see Node in Not Ready state due to the lack of resources.

If you want to check about the specific incident you can review events around the nodes using the following commands:

kubectl get nodes
kubectl describe node <name_of_node>
kubectl get events n kube​system

Coming to the notification option, at present, this feature is not there. However, this is already there in our roadmap. I don’t have a specific ETA for it. Our product team always look for such feature request and product feedback, I request you to vote/add on the idea here and subscribe for updates: https://ideas.digitalocean.com/ideas/

We use that page to help gauge demand for new features, so adding it, or adding your vote, will help us to prioritize when we can implement this feature.

I hope this helps!

Best Regards,
Purnima Kumari
Developer Support Engineer II, DigitalOcean

  • I’ve started getting problems again on my nodes.
    Recycling seems to do nothing other than hang my nodes and set them to status NotReady,SchedulingDisabled.
    How long should it take to cycle a node?

    Since k8s is managed by DO, is it fair to say that the first 3 points you’ve made are the responsibility of the provider (you)?

    Thanks for your help and comments on the issue.

    • I understand your concern here. Nodes take ~5min to recycle. But they can take up to 30min waiting for workloads to terminate gracefully. You can speed up the process bypassing the –skip-drain flag via doctl. This command will tell Kubernetes not to wait for the pods to drain gracefully. This option is not yet implemented for the cloud panel.

      doctl kubernetes cluster node-pool delete-node <node> --skip-drain
      

      This is most often caused by nodes not being having enough resources for the workload. Does this issue occur when you use larger nodes? 

      Yes, that is a fair criticism and we know this a pain point for our customer and a place we need to focus on improving. If this issue still persists on larger nodes please don’t hesitate to open a  support ticket with us. So that we can take a closer look.

      Best Regards,
      Purnima Kumari
      Developer Support Engineer II, DigitalOcean

Submit an Answer