The default scheduler will keep your cluster running, but it will not keep it balanced or resilient unless you tell it how. Left alone, kube-scheduler packs pods onto whichever feasible node scores highest, and “highest” rarely means “spread across three failure domains so a zone outage does not take down quorum.” This guide walks the scheduling cycle end to end, then layers on the four controls that actually shape placement in production: affinity, topology spread constraints, taints and tolerations, and PriorityClasses with preemption. Everything here targets Kubernetes 1.29+ where the newer matchLabelKeys and minDomains semantics are stable.
1. The scheduling cycle: filtering, scoring, binding
kube-scheduler runs one pod at a time off the head of a priority queue. For each pod it executes a two-phase cycle built on the scheduler framework, a set of extension points (plugins) that the in-tree behavior itself is implemented against.
| Phase | Extension points | What happens |
|---|---|---|
| Scheduling cycle | PreFilter, Filter, PostFilter, PreScore, Score, Reserve, Permit |
Pick exactly one node, synchronously |
| Binding cycle | PreBind, Bind, PostBind |
Persist the assignment, possibly asynchronously |
The mental model that matters:
- Filtering answers “can this pod run here at all?” Plugins like
NodeAffinity,TaintToleration,NodeResourcesFit, andPodTopologySpreadeach return feasible/infeasible per node. A node has to pass every filter. If zero nodes survive, the pod isUnschedulableandPostFilterruns — which is where preemption lives. - Scoring ranks the survivors. Each scoring plugin returns 0-100 per node; the framework applies plugin weights and sums them.
NodeResourcesBalancedAllocation,ImageLocality,InterPodAffinity, andPodTopologySpreadall contribute. The highest total wins; ties break randomly. - Binding writes the
nodeNameto the pod’sspec. The kubelet on that node takes over from there.
A critical consequence: filtering is hard, scoring is soft. requiredDuringSchedulingIgnoredDuringExecution rules become filters (a pod will go Pending forever rather than violate them). preferredDuringScheduling... rules become scores (the scheduler tries, then gives up and places the pod anyway). Choosing between the two is the single most important decision in every spec below.
You can see the framework’s view of an unschedulable pod directly:
kubectl get pod payments-7d9c-abcde -o yaml | yq '.status.conditions'
# look for type: PodScheduled, status: "False", reason: Unschedulable
kubectl describe pod payments-7d9c-abcde | sed -n '/Events:/,$p'
2. Node affinity and nodeSelector vs matchLabelKeys
nodeSelector is the blunt instrument: a flat map of label key/value pairs that must all match. It is still fine for trivial cases.
spec:
nodeSelector:
kubernetes.io/arch: arm64
node.kubernetes.io/instance-type: m7g.2xlarge
Node affinity is the expressive version. It supports operators (In, NotIn, Exists, Gt, Lt) and both hard and soft variants:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # prefer on-demand, tolerate spot
Two semantics worth internalizing:
- Multiple
nodeSelectorTermsare OR-ed; multiplematchExpressionswithin one term are AND-ed. This is the opposite of most people’s first guess and a common source of “why is nothing scheduling.” IgnoredDuringExecutionmeans the rule is evaluated at schedule time only. A node label that changes after the pod is bound does not evict it. There is no stableRequiredDuringExecutionvariant; do not design around eviction-on-relabel.
matchLabelKeys is the newer addition (stable for topology spread in 1.27; for pod affinity it reached beta later). It does not match against node labels at all — it tells the scheduler to derive part of the constraint from the incoming pod’s own labels, keyed by label name. Its primary job is to scope a constraint per rollout using pod-template-hash, so a Deployment update does not see old and new ReplicaSets as one spreading domain. You will see it used in Section 4 where it actually matters; for node affinity specifically, prefer plain matchExpressions.
3. Inter-pod affinity and anti-affinity
Pod affinity constrains placement relative to other pods, evaluated within a topologyKey — a node label that defines the domain (“same node,” “same zone”). This is how you spread replicas and co-locate caches.
Anti-affinity to spread replicas across nodes (so a node failure never takes two replicas):
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: payments-api
topologyKey: kubernetes.io/hostname
Affinity to co-locate a sidecar cache with its app in the same zone (cheap, low-latency cross-AZ avoidance):
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: payments-api
topologyKey: topology.kubernetes.io/zone
Performance warning: inter-pod affinity is O(pods x nodes) to evaluate and does not scale the way topology spread does. The upstream guidance is explicit — avoid required pod anti-affinity in clusters beyond a few hundred nodes; use topology spread constraints for spreading at scale and reserve pod affinity for genuine co-location intent. Required hostname anti-affinity also caps your replica count at the node count, which produces silent Pending pods during scale-up.
4. Topology spread constraints: maxSkew, minDomains, whenUnsatisfiable
Topology spread is the purpose-built, scalable mechanism for even distribution. It evaluates the skew — the difference in matching-pod count between the most and least populated domains — and keeps it under a bound.
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
minDomains: 3
matchLabelKeys:
- pod-template-hash
labelSelector:
matchLabels:
app: payments-api
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: payments-api
Each knob, precisely:
maxSkewis the maximum allowed(max domain count) - (min domain count). WithmaxSkew: 1and three zones, replica counts like 2/2/1 are legal; 3/1/1 is not.whenUnsatisfiableis the hard/soft switch.DoNotSchedulemakes the constraint a filter (the pod goes Pending).ScheduleAnywaymakes it a score (best-effort). The pattern above is the production default: hard spread across zones, soft spread across nodes.minDomains(stable in 1.27) forces the scheduler to assume at least N domains exist even if fewer are currently populated. Without it, the first replica creates one zone-domain, skew is trivially satisfied, and a small Deployment can land entirely in one zone before the others are ever considered. SetminDomainsto your zone count for any workload that needs true zonal spread from replica one. It is only valid withwhenUnsatisfiable: DoNotSchedule.matchLabelKeysappends the named pod labels to the selector at scheduling time. Addingpod-template-hashmeans a rolling update computes skew per ReplicaSet, so old and new pods are not pooled together — which otherwise lets a deploy temporarily violate spread or block on it.
Two cluster-wide defaults also feed in: nodeAffinityPolicy and nodeTaintsPolicy (both default Honor in current versions) control whether nodes the pod could not run on anyway are excluded from skew math. Leave them at the defaults unless you have a specific reason.
5. Taints, tolerations, and dedicated node pools
Affinity is pod-attracts-node. Taints are the inverse — node-repels-pod — and they are how you reserve hardware. A taint has a key, value, and effect:
| Effect | Behavior |
|---|---|
NoSchedule |
New pods without a matching toleration are not scheduled here |
PreferNoSchedule |
Soft version; scheduler avoids but may place |
NoExecute |
As NoSchedule, and evicts already-running pods that do not tolerate it |
Taint a GPU pool so only GPU workloads land there:
kubectl taint nodes -l node.kubernetes.io/instance-type=g5.xlarge \
nvidia.com/gpu=present:NoSchedule
Only pods that explicitly tolerate it are eligible:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: present
effect: NoSchedule
nodeSelector:
nvidia.com/gpu.present: "true"
The pairing that trips people up: a toleration is permission, not attraction. Tolerating the GPU taint lets a pod run there, but does not stop it from being scheduled onto an ordinary node. To actually pin GPU pods to GPU nodes you need both a toleration (to get past the taint) and node affinity or a nodeSelector (to require the GPU label). For hard multi-tenant isolation, apply the same dual pattern per tenant: taint tenant=acme:NoExecute and require tenant: acme via affinity.
NoExecute additionally supports tolerationSeconds, which is how the node-lifecycle controller’s node.kubernetes.io/not-ready and unreachable taints give pods a grace window (default 300s) before eviction:
tolerations:
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 30 # evict fast for latency-critical pods
6. PriorityClasses, preemption, and protecting critical workloads
When the cluster is full, who wins? Pod priority decides. A PriorityClass maps a name to an integer; higher is more important.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: platform-critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Control-plane adjacent and Tier-0 platform services."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: best-effort-batch
value: 10000
globalDefault: false
preemptionPolicy: Never
description: "Batch jobs that should run only on genuinely spare capacity."
Wire it into a pod with priorityClassName: platform-critical.
How preemption works: when a high-priority pod fails filtering (no feasible node), the PostFilter phase looks for a node where evicting one or more lower-priority pods would make the pending pod fit. It picks the node that minimizes disruption, then deletes the victims (respecting their graceful termination). The victims go back to Pending and reschedule elsewhere if they can.
Three guardrails that matter in production:
preemptionPolicy: Neverlets a pod be high-priority for queue ordering without ever evicting anyone. This is exactly right for important-but-not-urgent batch: it jumps the line for free capacity but never knocks out a serving pod.- System-reserved classes
system-cluster-criticalandsystem-node-criticalship built in (values ~2 billion). Do not exceed them with your own classes; reserve those tiers for control-plane and node agents. - Preemption respects PodDisruptionBudgets on a best-effort basis only. The scheduler prefers victims whose eviction does not violate a PDB, but if no such set exists, it will preempt across a PDB rather than leave the higher-priority pod Pending. PDBs are not a hard shield against preemption — priority is.
kubectl get pods -A \
-o custom-columns='NS:.metadata.namespace,POD:.metadata.name,PRIO:.spec.priority,PC:.spec.priorityClassName' \
| sort -k3 -n -r | head
7. Pod Disruption Budgets, the descheduler, and node drains
A PodDisruptionBudget bounds voluntary disruption — kubectl drain, node-pool upgrades, the descheduler, autoscaler scale-down. It does nothing for involuntary events (a node dying) and, as noted, only soft-protects against preemption.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payments-api
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: payments-api
minAvailable and maxUnavailable are mutually exclusive — pick one. A drain blocks (the eviction API returns 429) until honoring it would not breach the budget:
kubectl drain ip-10-1-2-3.eu-west-1.compute.internal \
--ignore-daemonsets --delete-emptydir-data --grace-period=120
The descheduler is the counterweight to the scheduler’s point-in-time decisions. The scheduler never moves a pod once bound, so over time you get drift — pods stranded on nodes that violate affinity after a relabel, lopsided utilization after scale events, topology skew that grew as nodes were added. The descheduler runs as a CronJob or Deployment, finds pods that would not be scheduled the same way today, and evicts them so the scheduler replaces them better. It honors PDBs and priority by default.
A profile that targets the two highest-value strategies:
apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: rebalance
pluginConfig:
- name: RemovePodsViolatingTopologySpreadConstraint
args:
constraints:
- DoNotSchedule
- name: LowNodeUtilization
args:
thresholds:
cpu: 25
memory: 25
targetThresholds:
cpu: 70
memory: 70
plugins:
balance:
enabled:
- RemovePodsViolatingTopologySpreadConstraint
- LowNodeUtilization
LowNodeUtilization evicts pods off nodes below the thresholds band so they can pack onto nodes under targetThresholds — letting the autoscaler then remove the drained nodes. Run it on a schedule (every few minutes to hourly), never tighter than your rollout cadence, and always with PDBs in place so it cannot evict past your availability floor.
Verify
Confirm each control is doing what you intended before you trust it under load.
Spread across zones is actually achieved:
kubectl get pods -l app=payments-api -o wide --no-headers \
| awk '{print $7}' | sort | uniq -c
# join node -> zone if needed:
kubectl get nodes -L topology.kubernetes.io/zone
A pending pod’s reason, per filter, including preemption verdicts:
kubectl describe pod <pending-pod> | sed -n '/Events:/,$p'
# typical messages:
# "0/12 nodes are available: 4 node(s) didn't match pod topology spread constraints,
# 8 node(s) had untolerated taint {nvidia.com/gpu: present}."
# "0/12 nodes are available: ... preemption: 0/12 nodes are available:
# 12 No preemption victims found for incoming pod."
Taints and tolerations line up:
kubectl get nodes -o json \
| jq -r '.items[] | "\(.metadata.name)\t\(.spec.taints // [])"'
PDB headroom before a drain:
kubectl get pdb payments-api \
-o custom-columns='NAME:.metadata.name,MIN:.spec.minAvailable,ALLOWED:.status.disruptionsAllowed,CURRENT:.status.currentHealthy'
Priority and preemption events are visible cluster-wide:
kubectl get events -A --field-selector reason=Preempted \
--sort-by=.lastTimestamp | tail
Enterprise scenario
A payments platform team ran a regional EKS cluster across three AZs and a 6-replica Deployment of their authorization service behind a minAvailable: 4 PDB. They believed they were zone-resilient. During an eu-west-1b impairment, the service dropped below quorum and latency spiked — and the postmortem found five of six replicas had been running in eu-west-1a.
Root cause was two compounding gaps. First, they used only required pod anti-affinity on kubernetes.io/hostname, which guarantees one-replica-per-node but says nothing about zones; the cluster autoscaler had grown the 1a node group first during a prior scale event, and the scheduler happily filled it. Second, they had a soft zone spread (ScheduleAnyway) that the scheduler abandoned the moment scoring preferred the warm, already-provisioned 1a nodes.
The fix was to make zone spread a filter, force the domain count from replica one, and scope it per rollout so deploys would not thrash:
spec:
topologySpreadConstraints:
- maxSkew: 1
minDomains: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
matchLabelKeys: [pod-template-hash]
labelSelector:
matchLabels:
app: authz
They kept the hostname anti-affinity as ScheduleAnyway (downgraded from required, so scale-up could never wedge on it), set the authz Deployment to a platform-critical PriorityClass with preemptionPolicy: PreemptLowerPriority so it could reclaim capacity from batch during an AZ loss, and added a descheduler RemovePodsViolatingTopologySpreadConstraint pass on a 10-minute schedule to correct any drift the autoscaler reintroduced. The next quarter’s GameDay zone-kill held quorum: 2/2/2 going in, 2/0/2 surviving with the two displaced replicas rescheduling into 1a and 1c within the PDB.
The lesson the team wrote down: soft constraints describe intent; only hard constraints survive a bad day. Zonal availability that you cannot lose is a DoNotSchedule plus minDomains, never a preference.