When a node in a DigitalOcean Kubernetes cluster is unhealthy or not ready, replacing the node is manual and cumbersome. The cluster will operate at lower capacity without replacing the nodes because the unhealthy nodes will not run any Pods.
Cluster nodes can become unhealthy when the kubelet service
dies or is unresponsive. This can happen for several reasons,
This tutorial provides an automated way to recycle unhealthy nodes in a DigitalOcean Kubernetes (DOKS) cluster using Digital Mobius.
Digital Mobius is an open-source application written in Go specifically for DOKS cluster node recycling. The application monitors DOKS cluster nodes that are in an unhealthy state at specified regular intervals.
Digital Mobius needs a set of environment variables to be configured and available. You can see these variables in the values.yaml:
Note:
Choose an appropriate value for DELAY_NODE_CREATION.
A value that is too low will interfere with the time interval needed for a node to become ready and available after it gets recycled. In real-world situations, this can take several minutes or more to complete. A good starting point is 10m
, the value used in this tutorial.
Digital Mobius can be easily deployed using the Helm chart (or artifacthub.io).
maintenance
as the namespace:Note:
The enabledFeatures.disableDryRun
option enables or disables the tool’s DRY RUN
mode. Setting it to true
means the dry run mode is disabled, and the cluster nodes will be recycled. Enabling the dry run mode is helpful if you want to test it first without performing any changes to the actual cluster nodes.
The output looks similar to the following:
Verify the running Pod(s):
The output looks similar to the following:
Inspect the logs:
The output looks similar to the following:
Now that we have successfully deployed Digital Mobius,
. Let us check out the underlying logic in which it operates.
A node is considered unhealthy if the node condition is Ready
and the status is False
or Unknown.
Then, the application recreates the affected node(s) using the DigitalOcean Delete Kubernetes Node API.
The following diagram shows how Digital Mobius checks the worker node(s) state:
We must disconnect one or more nodes from the DOKS cluster to test the Digital Mobius setup. To do this, we will use the doks-debug tool to create some debug pods that run containers with elevated privileges. To access the running containers in the debug pods, we will use kubectl exec.
This command will allow us to execute commands inside the containers and gain access to the worker node(s) system services.
Verify the DaemonSet:
The output looks similar to the following (notice the doks-debug
entry):
Verify the debug pods:
The output looks similar to the following:
kubelet
serviceUse kubectl exec
in one of the debug pods and get access to worker node system services. Then, stop the kubelet service, which results in the node going away from the kubectl get nodes
command output.
Open a new terminal window and watch the worker nodes:
Pick the first debug pod and access the shell:
A prompt that looks similar to the following appears:
Inspect the system service:
The output looks similar to the following:
Stop the kubelet:
After you stop the kubelet service, you will be kicked out of the shell session. This means the node controller lost connection with the affected node where the kubelet service was killed.
You can see the NotReady
state of the affected node in the other terminal window where you set the watch:
After the time interval you specified in DELAY_NODE_CREATION
expires, the node vanishes as expected:
Next, check how Digital Mobius monitors the DOKS cluster. Open a terminal window and inspect the logs first:
The output looks like below (watch for the Recycling node {...}
lines):
In the terminal window where you set the watch for kubectl get nodes,
a new node appears after a minute, replacing the old one. The new node has a different ID and a new AGE
value:
As you can see, the node was automatically recycled.
In conclusion, while automatic recovery of cluster nodes is a valuable feature, it is crucial to prioritize node health monitoring and load management to prevent frequent node failures. In addition, properly setting Pod resource limits, such as setting and using fair values, can also help avoid overloading nodes. By adopting these best practices, you can ensure the stability and reliability of your Kubernetes cluster, avoiding costly downtime and service disruptions.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!