Node-pressure Eviction (2024)

Node-pressure eviction is the process by which the kubelet proactively terminatespods to reclaim resources on nodes.

The kubelet monitors resourceslike memory, disk space, and filesystem inodes on your cluster's nodes.When one or more of these resources reach specific consumption levels, thekubelet can proactively fail one or more pods on the node to reclaim resourcesand prevent starvation.

During a node-pressure eviction, the kubelet sets the phase for theselected pods to Failed, and terminates the Pod.

Node-pressure eviction is not the same asAPI-initiated eviction.

The kubelet does not respect your configured PodDisruptionBudgetor the pod'sterminationGracePeriodSeconds. If you use soft eviction thresholds,the kubelet respects your configured eviction-max-pod-grace-period. If you usehard eviction thresholds, the kubelet uses a 0s grace period (immediate shutdown) for termination.

Self healing behavior

The kubelet attempts to reclaim node-level resourcesbefore it terminates end-user pods. For example, it removes unused containerimages when disk resources are starved.

If the pods are managed by a workloadmanagement object (such as StatefulSetor Deployment) thatreplaces failed pods, the control plane (kube-controller-manager) creates newpods in place of the evicted pods.

Self healing for static pods

If you are running a static podon a node that is under resource pressure, the kubelet may evict that staticPod. The kubelet then tries to create a replacement, because static Pods alwaysrepresent an intent to run a Pod on that node.

The kubelet takes the priority of the static pod into account when creatinga replacement. If the static pod manifest specifies a low priority, and thereare higher-priority Pods defined within the cluster's control plane, and thenode is under resource pressure, the kubelet may not be able to make room forthat static pod. The kubelet continues to attempt to run all static pods evenwhen there is resource pressure on a node.

Eviction signals and thresholds

The kubelet uses various parameters to make eviction decisions, like the following:

Eviction signals
Eviction thresholds
Monitoring intervals

Eviction signals

Eviction signals are the current state of a particular resource at a specificpoint in time. Kubelet uses eviction signals to make eviction decisions bycomparing the signals to eviction thresholds, which are the minimum amount ofthe resource that should be available on the node.

On Linux, the kubelet uses the following eviction signals:

Eviction Signal	Description
`memory.available`	`memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet`
`nodefs.available`	`nodefs.available` := `node.stats.fs.available`
`nodefs.inodesFree`	`nodefs.inodesFree` := `node.stats.fs.inodesFree`
`imagefs.available`	`imagefs.available` := `node.stats.runtime.imagefs.available`
`imagefs.inodesFree`	`imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree`
`pid.available`	`pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc`

In this table, the Description column shows how kubelet gets the value of thesignal. Each signal supports either a percentage or a literal value. Kubeletcalculates the percentage value relative to the total capacity associated withthe signal.

The value for memory.available is derived from the cgroupfs instead of toolslike free -m. This is important because free -m does not work in acontainer, and if users use the node allocatablefeature, out of resource decisionsare made local to the end user Pod part of the cgroup hierarchy as well as theroot node. This script orcgroupv2 scriptreproduces the same set of steps that the kubelet performs to calculatememory.available. The kubelet excludes inactive_file (the number of bytes offile-backed memory on the inactive LRU list) from its calculation, as it assumes thatmemory is reclaimable under pressure.

The kubelet recognizes two specific filesystem identifiers:

nodefs: The node's main filesystem, used for local disk volumes, emptyDirvolumes not backed by memory, log storage, and more.For example, nodefs contains /var/lib/kubelet/.
imagefs: An optional filesystem that container runtimes use to store containerimages and container writable layers.

Kubelet auto-discovers these filesystems and ignores other node local filesystems. Kubeletdoes not support other configurations.

Some kubelet garbage collection features are deprecated in favor of eviction:

Existing Flag	Rationale
`--maximum-dead-containers`	deprecated once old logs are stored outside of container's context
`--maximum-dead-containers-per-container`	deprecated once old logs are stored outside of container's context
`--minimum-container-ttl-duration`	deprecated once old logs are stored outside of container's context

Eviction thresholds

You can specify custom eviction thresholds for the kubelet to use when it makeseviction decisions. You can configure soft andhard eviction thresholds.

Eviction thresholds have the form [eviction-signal][operator][quantity], where:

eviction-signal is the eviction signal to use.
operator is the relational operatoryou want, such as < (less than).
quantity is the eviction threshold amount, such as 1Gi. The value of quantitymust match the quantity representation used by Kubernetes. You can use eitherliteral values or percentages (%).

For example, if a node has 10GiB of total memory and you want trigger eviction ifthe available memory falls below 1GiB, you can define the eviction threshold aseither memory.available<10% or memory.available<1Gi (you cannot use both).

Soft eviction thresholds

A soft eviction threshold pairs an eviction threshold with a requiredadministrator-specified grace period. The kubelet does not evict pods until thegrace period is exceeded. The kubelet returns an error on startup if you donot specify a grace period.

You can specify both a soft eviction threshold grace period and a maximumallowed pod termination grace period for kubelet to use during evictions. If youspecify a maximum allowed grace period and the soft eviction threshold is met,the kubelet uses the lesser of the two grace periods. If you do not specify amaximum allowed grace period, the kubelet kills evicted pods immediately withoutgraceful termination.

You can use the following flags to configure soft eviction thresholds:

Hard eviction thresholds

A hard eviction threshold has no grace period. When a hard eviction threshold ismet, the kubelet kills pods immediately without graceful termination to reclaimthe starved resource.

You can use the eviction-hard flag to configure a set of hard evictionthresholds like memory.available<1Gi.

The kubelet has the following default hard eviction thresholds:

memory.available<100Mi
nodefs.available<10%
imagefs.available<15%
nodefs.inodesFree<5% (Linux nodes)

These default values of hard eviction thresholds will only be set if noneof the parameters is changed. If you changed the value of any parameter,then the values of other parameters will not be inherited as the defaultvalues and will be set to zero. In order to provide custom values, youshould provide all the thresholds respectively.

Eviction monitoring interval

The kubelet evaluates eviction thresholds based on its configured housekeeping-interval,which defaults to 10s.

Node conditions

The kubelet reports node conditionsto reflect that the node is under pressure because hard or soft evictionthreshold is met, independent of configured grace periods.

The kubelet maps eviction signals to node conditions as follows:

Node Condition	Eviction Signal	Description
`MemoryPressure`	`memory.available`	Available memory on the node has satisfied an eviction threshold
`DiskPressure`	`nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree`	Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold
`PIDPressure`	`pid.available`	Available processes identifiers on the (Linux) node has fallen below an eviction threshold

The control plane also mapsthese node conditions to taints.

The kubelet updates the node conditions based on the configured--node-status-update-frequency, which defaults to 10s.

Node condition oscillation

In some cases, nodes oscillate above and below soft eviction thresholds withoutholding for the defined grace periods. This causes the reported node conditionto constantly switch between true and false, leading to bad eviction decisions.

To protect against oscillation, you can use the eviction-pressure-transition-periodflag, which controls how long the kubelet must wait before transitioning a nodecondition to a different state. The transition period has a default value of 5m.

Reclaiming node level resources

The kubelet tries to reclaim node-level resources before it evicts end-user pods.

When a DiskPressure node condition is reported, the kubelet reclaims node-levelresources based on the filesystems on the node.

With `imagefs`

If the node has a dedicated imagefs filesystem for container runtimes to use,the kubelet does the following:

If the nodefs filesystem meets the eviction thresholds, the kubelet garbage collectsdead pods and containers.
If the imagefs filesystem meets the eviction thresholds, the kubeletdeletes all unused images.

Without `imagefs`

If the node only has a nodefs filesystem that meets eviction thresholds,the kubelet frees up disk space in the following order:

Garbage collect dead pods and containers
Delete unused images

Pod selection for kubelet eviction

If the kubelet's attempts to reclaim node-level resources don't bring the evictionsignal below the threshold, the kubelet begins to evict end-user pods.

The kubelet uses the following parameters to determine the pod eviction order:

Whether the pod's resource usage exceeds requests
Pod Priority
The pod's resource usage relative to requests

As a result, kubelet ranks and evicts pods in the following order:

BestEffort or Burstable pods where the usage exceeds requests. These podsare evicted based on their Priority and then by how much their usage levelexceeds the request.
Guaranteed pods and Burstable pods where the usage is less than requestsare evicted last, based on their Priority.

Note: The kubelet does not use the pod's QoS class to determine the eviction order.You can use the QoS class to estimate the most likely pod eviction order whenreclaiming resources like memory. QoS classification does not apply to EphemeralStorage requests,so the above scenario will not apply if the node is, for example, under DiskPressure.

Guaranteed pods are guaranteed only when requests and limits are specified forall the containers and they are equal. These pods will never be evicted becauseof another pod's resource consumption. If a system daemon (such as kubeletand journald) is consuming more resources than were reserved viasystem-reserved or kube-reserved allocations, and the node only hasGuaranteed or Burstable pods using less resources than requests left on it,then the kubelet must choose to evict one of these pods to preserve node stabilityand to limit the impact of resource starvation on other pods. In this case, itwill choose to evict pods of lowest Priority first.

If you are running a static podand want to avoid having it evicted under resource pressure, set thepriority field for that Pod directly. Static pods do not support thepriorityClassName field.

When the kubelet evicts pods in response to inode or process ID starvation, it usesthe Pods' relative priority to determine the eviction order, because inodes and PIDs have norequests.

The kubelet sorts pods differently based on whether the node has a dedicatedimagefs filesystem:

With `imagefs`

If nodefs is triggering evictions, the kubelet sorts pods based on nodefsusage (local volumes + logs of all containers).

If imagefs is triggering evictions, the kubelet sorts pods based on thewritable layer usage of all containers.

Without `imagefs`

If nodefs is triggering evictions, the kubelet sorts pods based on their totaldisk usage (local volumes + logs & writable layer of all containers)

Minimum eviction reclaim

In some cases, pod eviction only reclaims a small amount of the starved resource.This can lead to the kubelet repeatedly hitting the configured eviction thresholdsand triggering multiple evictions.

You can use the --eviction-minimum-reclaim flag or a kubelet config fileto configure a minimum reclaim amount for each resource. When the kubelet noticesthat a resource is starved, it continues to reclaim that resource until itreclaims the quantity you specify.

For example, the following configuration sets minimum reclaim amounts:

apiVersion: kubelet.config.k8s.io/v1beta1kind: KubeletConfigurationevictionHard: memory.available: "500Mi" nodefs.available: "1Gi" imagefs.available: "100Gi"evictionMinimumReclaim: memory.available: "0Mi" nodefs.available: "500Mi" imagefs.available: "2Gi"

In this example, if the nodefs.available signal meets the eviction threshold,the kubelet reclaims the resource until the signal reaches the threshold of 1GiB,and then continues to reclaim the minimum amount of 500MiB, until the available nodefs storage value reaches 1.5GiB.

Similarly, the kubelet tries to reclaim the imagefs resource until the imagefs.availablevalue reaches 102Gi, representing 102 GiB of available container image storage. If the amountof storage that the kubelet could reclaim is less than 2GiB, the kubelet doesn't reclaim anything.

The default eviction-minimum-reclaim is 0 for all resources.

Node out of memory behavior

If the node experiences an out of memory (OOM) event prior to the kubeletbeing able to reclaim memory, the node depends on the oom_killerto respond.

The kubelet sets an oom_score_adj value for each container based on the QoS for the pod.

Quality of Service	`oom_score_adj`
`Guaranteed`	-997
`BestEffort`	1000
`Burstable`	min(max(2, 1000 - (1000 × memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Note: The kubelet also sets an oom_score_adj value of -997 for any containers in Pods that havesystem-node-critical Priority.

If the kubelet can't reclaim memory before a node experiences OOM, theoom_killer calculates an oom_score based on the percentage of memory it'susing on the node, and then adds the oom_score_adj to get an effective oom_scorefor each container. It then kills the container with the highest score.

This means that containers in low QoS pods that consume a large amount of memoryrelative to their scheduling requests are killed first.

Unlike pod eviction, if a container is OOM killed, the kubelet can restart itbased on its restartPolicy.

Good practices

The following sections describe good practice for eviction configuration.

Schedulable resources and eviction policies

When you configure the kubelet with an eviction policy, you should make sure thatthe scheduler will not schedule pods if they will trigger eviction because theyimmediately induce memory pressure.

Consider the following scenario:

Node memory capacity: 10GiB
Operator wants to reserve 10% of memory capacity for system daemons (kernel, kubelet, etc.)
Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM.

For this to work, the kubelet is launched as follows:

--eviction-hard=memory.available<500Mi--system-reserved=memory=1.5Gi

In this configuration, the --system-reserved flag reserves 1.5GiB of memoryfor the system, which is 10% of the total memory + the eviction threshold amount.

The node can reach the eviction threshold if a pod is using more than its request,or if the system is using more than 1GiB of memory, which makes the memory.availablesignal fall below 500MiB and triggers the threshold.

DaemonSets and node-pressure eviction

Pod priority is a major factor in making eviction decisions. If you do not wantthe kubelet to evict pods that belong to a DaemonSet, give those pods a highenough priority by specifying a suitable priorityClassName in the pod spec.You can also use a lower priority, or the default, to only allow pods from thatDaemonSet to run when there are enough resources.

Known issues

The following sections describe known issues related to out of resource handling.

kubelet may not observe memory pressure right away

By default, the kubelet polls cAdvisor to collect memory usage stats at aregular interval. If memory usage increases within that window rapidly, thekubelet may not observe MemoryPressure fast enough, and the OOM killerwill still be invoked.

You can use the --kernel-memcg-notification flag to enable the memcgnotification API on the kubelet to get notified immediately when a thresholdis crossed.

If you are not trying to achieve extreme utilization, but a sensible measure ofovercommit, a viable workaround for this issue is to use the --kube-reservedand --system-reserved flags to allocate memory for the system.

active_file memory is not considered as available memory

On Linux, the kernel tracks the number of bytes of file-backed memory on activeleast recently used (LRU) list as the active_file statistic. The kubelet treats active_file memoryareas as not reclaimable. For workloads that make intensive use of block-backedlocal storage, including ephemeral local storage, kernel-level caches of fileand block data means that many recently accessed cache pages are likely to becounted as active_file. If enough of these kernel block buffers are on theactive LRU list, the kubelet is liable to observe this as high resource use andtaint the node as experiencing memory pressure - triggering pod eviction.

For more details, see https://github.com/kubernetes/kubernetes/issues/43916

You can work around that behavior by setting the memory limit and memory requestthe same for containers likely to perform intensive I/O activity. You will needto estimate or measure an optimal memory limit value for that container.

What's next

Learn about API-initiated Eviction
Learn about Pod Priority and Preemption
Learn about PodDisruptionBudgets
Learn about Quality of Service (QoS)
Check out the Eviction API

I am a Kubernetes expert with extensive knowledge in managing and optimizing cluster resources. My expertise is demonstrated by a deep understanding of node-pressure eviction processes and the intricate details of kubelet behavior. I have hands-on experience in configuring eviction thresholds, understanding eviction signals, and implementing best practices for resource management in Kubernetes clusters.

In the provided article, the key concepts related to node-pressure eviction and kubelet behavior are discussed. Let's break down the information and elaborate on each concept:

Node-Pressure Eviction:
- Proactive termination of pods by kubelet to reclaim resources on nodes.
- Monitoring of resources like memory, disk space, and filesystem inodes on cluster nodes.
- Pods are failed and terminated during node-pressure eviction.
Self-Healing Behavior:
- Kubelet attempts to reclaim node-level resources before terminating end-user pods.
- Unused container images are removed when disk resources are starved.
- Workload management objects (e.g., StatefulSet, Deployment) replace failed pods.
Static Pods:
- Kubelet may evict static pods under resource pressure.
- Priority of static pods considered when creating replacements.
Eviction Signals and Thresholds:
- Eviction signals represent the current state of specific resources.
- Eviction thresholds are the minimum amounts of resources that should be available.
- Various parameters, including memory, nodefs, imagefs, and pid, are used for eviction decisions.
Eviction Thresholds:
- Customizable for soft and hard eviction with specified grace periods.
- Soft eviction thresholds have administrator-specified grace periods.
- Hard eviction thresholds result in immediate pod termination.
Eviction Monitoring Interval:
- Kubelet evaluates eviction thresholds based on the configured housekeeping interval.
Node Conditions:
- Kubelet reports node conditions reflecting pressure due to eviction thresholds.
- Conditions mapped to taints and updated based on configured update frequency.
Reclaiming Node-Level Resources:
- Kubelet reclaims resources before evicting end-user pods based on filesystems.
- Different strategies for nodes with or without dedicated imagefs filesystem.
Pod Eviction Order:
- Determined by pod resource usage, priority, and QoS class.
- BestEffort or Burstable pods evicted first if usage exceeds requests.
Eviction Minimum Reclaim:
- Configurable minimum reclaim amounts for resources.
- Ensures a specified quantity of resources is reclaimed during eviction.
Node Out of Memory Behavior:
- OOM handling based on oom_score_adj value.
- Containers in low QoS pods with high memory consumption killed first.
Good Practices:
- Recommendations for eviction configuration, considering system resources and priorities.
- DaemonSets' pod priority helps in avoiding eviction.
Known Issues:
- Challenges related to memory pressure observation delays.
- Workarounds using flags like --kernel-memcg-notification.

This comprehensive overview provides insights into managing Kubernetes cluster resources effectively, optimizing performance, and addressing potential challenges.

Node-pressure Eviction (2024)

Self healing behavior

Self healing for static pods

Eviction signals and thresholds

Eviction signals

Eviction thresholds

Soft eviction thresholds

Hard eviction thresholds

Eviction monitoring interval

Node conditions

Node condition oscillation

Reclaiming node level resources

With imagefs

Without imagefs

Pod selection for kubelet eviction

With imagefs

Without imagefs

Minimum eviction reclaim

Node out of memory behavior

Good practices

Schedulable resources and eviction policies

DaemonSets and node-pressure eviction

Known issues

kubelet may not observe memory pressure right away

active_file memory is not considered as available memory

What's next

With `imagefs`

Without `imagefs`

With `imagefs`

Without `imagefs`