Kubernetes is a great tool for orchestrating containerized workloads on a cluster of nodes. If you've ever experienced the sudden downtime of a node, you maybe came in touch with Kubernetes' rescheduling strategies of deployments that kicked in after some time. In this post I want to highlight how such situations are recognized by the system. This can be helpful to understand and tune rescheduling mechanics or when developing your own operators and resources.
In order to demonstrate this process in a more appealing way, the following graphic will be used to visualize the key actions and decisions. We will deal with a system consisting of one master and one node.
In a healthy system, the kubelet running on the node continuously reports its status to the master
This is controlled by setting the CLI param
--node-status-update-frequency of the kubelet, whose default is 10s.
That way, the master stays informed about the health of the cluster nodes and can schedule pods in a proper way.
The master obviously cannot be informed about this reason, but when monitoring the nodes the timeout
--node-monitor-grace-period gets checked . Per default this timeout is set to 40 seconds in the controller manager. That means, that a node has 40 seconds to recover and send its status to the master until the next step is entered.
If the node can successfully recover, the system stays healthy and continues with the loop
If the node could not respond in the given timeout, its status is set to
Unknown and a second timeout starts. This timeout, called
--pod-eviction-timeout, controls when the pods on the node are ready to be evicted (as well as "Taints and Tolerations" in the next section). The default value is set to 5 minutes.
As soon as the nodes responds within this timeframe , the master sets its status back to
Ready and the process can continue with usual the loop at the beginning.
But when this timeout is exceeded with non-responding node, the pods are finally marked for deletion .
It should be noted, that these pods are not removed instantly. Instead, the node has to go online again and connect to the master in order to confirm this deletion (2-phase confirmation).
If that is not possible, for example when the node has left the cluster permanently, you have to remove these pods manually.
Taints and Tolerations
Even though you set the eviction timeout
--pod-eviction-timeout to a lower value, you may notice that pods still need 5 minutes to be deleted. This is due to the admission controller that sets a default toleration to every pod, which allows it to stay on a not-ready or unreachable node for period of time.
tolerations: - key: node.kubernetes.io/not-ready effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 30
As you can see in the default configuration above, the value is set to 300 seconds/5 minutes. One possible solution is to apply a custom configuration to each pod, where this value is adjusted to your needs. You can also adjust this setting globally.
For instance, when a value (
tolerationSeconds) of 20 seconds is chosen, it will take 60 seconds overall for a pod to be deleted, because the
--node-monitor-grace-period value is taken into account before.
Wrapping it up
I hope that you now got a rough idea about how Kubernetes recognizes and handles offline nodes. Especially the two timeouts as well as the default taints and tolerations configuration can be a caveat.
This can come in handy when you develop an own operator, that has to deal with non-responding nodes. For instance, Kubernetes' deployment controller recognizes these situations automatically and reschedules the configured pods.
This also one of the reasons why you should avoid using "naked" pods, because this helpful handling has to be implemented by you in that case.