Advanced Kubernetes Scheduling: Affinity, Topology Spread Constraints, Taints, and Priority-Based Preemption

The default scheduler will keep your cluster running, but it will not keep it balanced or resilient unless you tell it how. Left alone, kube-scheduler packs pods onto whichever feasible node scores highest, and “highest” rarely means “spread across three failure domains so a zone outage does not take down quorum.” This guide walks the scheduling cycle end to end, then layers on the four controls that actually shape placement in production: affinity, topology spread constraints, taints and tolerations, and PriorityClasses with preemption. Everything here targets Kubernetes 1.29+ where the newer matchLabelKeys and minDomains semantics are stable.

1. The scheduling cycle: filtering, scoring, binding

kube-scheduler runs one pod at a time off the head of a priority queue. For each pod it executes a two-phase cycle built on the scheduler framework, a set of extension points (plugins) that the in-tree behavior itself is implemented against.

Phase	Extension points	What happens
Scheduling cycle	`PreFilter`, `Filter`, `PostFilter`, `PreScore`, `Score`, `Reserve`, `Permit`	Pick exactly one node, synchronously
Binding cycle	`PreBind`, `Bind`, `PostBind`	Persist the assignment, possibly asynchronously

The mental model that matters:

Filtering answers “can this pod run here at all?” Plugins like NodeAffinity, TaintToleration, NodeResourcesFit, and PodTopologySpread each return feasible/infeasible per node. A node has to pass every filter. If zero nodes survive, the pod is Unschedulable and PostFilter runs — which is where preemption lives.
Scoring ranks the survivors. Each scoring plugin returns 0-100 per node; the framework applies plugin weights and sums them. NodeResourcesBalancedAllocation, ImageLocality, InterPodAffinity, and PodTopologySpread all contribute. The highest total wins; ties break randomly.
Binding writes the nodeName to the pod’s spec. The kubelet on that node takes over from there.

A critical consequence: filtering is hard, scoring is soft. requiredDuringSchedulingIgnoredDuringExecution rules become filters (a pod will go Pending forever rather than violate them). preferredDuringScheduling... rules become scores (the scheduler tries, then gives up and places the pod anyway). Choosing between the two is the single most important decision in every spec below.

You can see the framework’s view of an unschedulable pod directly:

kubectl get pod payments-7d9c-abcde -o yaml | yq '.status.conditions'
# look for type: PodScheduled, status: "False", reason: Unschedulable
kubectl describe pod payments-7d9c-abcde | sed -n '/Events:/,$p'

2. Node affinity and nodeSelector vs matchLabelKeys

nodeSelector is the blunt instrument: a flat map of label key/value pairs that must all match. It is still fine for trivial cases.

spec:
  nodeSelector:
    kubernetes.io/arch: arm64
    node.kubernetes.io/instance-type: m7g.2xlarge

Node affinity is the expressive version. It supports operators (In, NotIn, Exists, Gt, Lt) and both hard and soft variants:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values: ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: karpenter.sh/capacity-type
                operator: In
                values: ["on-demand"]   # prefer on-demand, tolerate spot

Two semantics worth internalizing:

Multiple nodeSelectorTerms are OR-ed; multiple matchExpressions within one term are AND-ed. This is the opposite of most people’s first guess and a common source of “why is nothing scheduling.”
IgnoredDuringExecution means the rule is evaluated at schedule time only. A node label that changes after the pod is bound does not evict it. There is no stable RequiredDuringExecution variant; do not design around eviction-on-relabel.

matchLabelKeys is the newer addition (stable for topology spread in 1.27; for pod affinity it reached beta later). It does not match against node labels at all — it tells the scheduler to derive part of the constraint from the incoming pod’s own labels, keyed by label name. Its primary job is to scope a constraint per rollout using pod-template-hash, so a Deployment update does not see old and new ReplicaSets as one spreading domain. You will see it used in Section 4 where it actually matters; for node affinity specifically, prefer plain matchExpressions.

3. Inter-pod affinity and anti-affinity

Pod affinity constrains placement relative to other pods, evaluated within a topologyKey — a node label that defines the domain (“same node,” “same zone”). This is how you spread replicas and co-locate caches.

Anti-affinity to spread replicas across nodes (so a node failure never takes two replicas):

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app: payments-api
          topologyKey: kubernetes.io/hostname

Affinity to co-locate a sidecar cache with its app in the same zone (cheap, low-latency cross-AZ avoidance):

spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: payments-api
            topologyKey: topology.kubernetes.io/zone

Performance warning: inter-pod affinity is O(pods x nodes) to evaluate and does not scale the way topology spread does. The upstream guidance is explicit — avoid required pod anti-affinity in clusters beyond a few hundred nodes; use topology spread constraints for spreading at scale and reserve pod affinity for genuine co-location intent. Required hostname anti-affinity also caps your replica count at the node count, which produces silent Pending pods during scale-up.

4. Topology spread constraints: maxSkew, minDomains, whenUnsatisfiable

Topology spread is the purpose-built, scalable mechanism for even distribution. It evaluates the skew — the difference in matching-pod count between the most and least populated domains — and keeps it under a bound.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      minDomains: 3
      matchLabelKeys:
        - pod-template-hash
      labelSelector:
        matchLabels:
          app: payments-api
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app: payments-api

Each knob, precisely:

maxSkew is the maximum allowed (max domain count) - (min domain count). With maxSkew: 1 and three zones, replica counts like 2/2/1 are legal; 3/1/1 is not.
whenUnsatisfiable is the hard/soft switch. DoNotSchedule makes the constraint a filter (the pod goes Pending). ScheduleAnyway makes it a score (best-effort). The pattern above is the production default: hard spread across zones, soft spread across nodes.
minDomains (stable in 1.27) forces the scheduler to assume at least N domains exist even if fewer are currently populated. Without it, the first replica creates one zone-domain, skew is trivially satisfied, and a small Deployment can land entirely in one zone before the others are ever considered. Set minDomains to your zone count for any workload that needs true zonal spread from replica one. It is only valid with whenUnsatisfiable: DoNotSchedule.
matchLabelKeys appends the named pod labels to the selector at scheduling time. Adding pod-template-hash means a rolling update computes skew per ReplicaSet, so old and new pods are not pooled together — which otherwise lets a deploy temporarily violate spread or block on it.

Two cluster-wide defaults also feed in: nodeAffinityPolicy and nodeTaintsPolicy (both default Honor in current versions) control whether nodes the pod could not run on anyway are excluded from skew math. Leave them at the defaults unless you have a specific reason.

5. Taints, tolerations, and dedicated node pools

Affinity is pod-attracts-node. Taints are the inverse — node-repels-pod — and they are how you reserve hardware. A taint has a key, value, and effect:

Effect	Behavior
`NoSchedule`	New pods without a matching toleration are not scheduled here
`PreferNoSchedule`	Soft version; scheduler avoids but may place
`NoExecute`	As `NoSchedule`, and evicts already-running pods that do not tolerate it

Taint a GPU pool so only GPU workloads land there:

kubectl taint nodes -l node.kubernetes.io/instance-type=g5.xlarge \
  nvidia.com/gpu=present:NoSchedule

Only pods that explicitly tolerate it are eligible:

spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Equal
      value: present
      effect: NoSchedule
  nodeSelector:
    nvidia.com/gpu.present: "true"

The pairing that trips people up: a toleration is permission, not attraction. Tolerating the GPU taint lets a pod run there, but does not stop it from being scheduled onto an ordinary node. To actually pin GPU pods to GPU nodes you need both a toleration (to get past the taint) and node affinity or a nodeSelector (to require the GPU label). For hard multi-tenant isolation, apply the same dual pattern per tenant: taint tenant=acme:NoExecute and require tenant: acme via affinity.

NoExecute additionally supports tolerationSeconds, which is how the node-lifecycle controller’s node.kubernetes.io/not-ready and unreachable taints give pods a grace window (default 300s) before eviction:

  tolerations:
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 30   # evict fast for latency-critical pods

6. PriorityClasses, preemption, and protecting critical workloads

When the cluster is full, who wins? Pod priority decides. A PriorityClass maps a name to an integer; higher is more important.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: platform-critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Control-plane adjacent and Tier-0 platform services."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: best-effort-batch
value: 10000
globalDefault: false
preemptionPolicy: Never
description: "Batch jobs that should run only on genuinely spare capacity."

Wire it into a pod with priorityClassName: platform-critical.

How preemption works: when a high-priority pod fails filtering (no feasible node), the PostFilter phase looks for a node where evicting one or more lower-priority pods would make the pending pod fit. It picks the node that minimizes disruption, then deletes the victims (respecting their graceful termination). The victims go back to Pending and reschedule elsewhere if they can.

Three guardrails that matter in production:

preemptionPolicy: Never lets a pod be high-priority for queue ordering without ever evicting anyone. This is exactly right for important-but-not-urgent batch: it jumps the line for free capacity but never knocks out a serving pod.
System-reserved classes system-cluster-critical and system-node-critical ship built in (values ~2 billion). Do not exceed them with your own classes; reserve those tiers for control-plane and node agents.
Preemption respects PodDisruptionBudgets on a best-effort basis only. The scheduler prefers victims whose eviction does not violate a PDB, but if no such set exists, it will preempt across a PDB rather than leave the higher-priority pod Pending. PDBs are not a hard shield against preemption — priority is.

kubectl get pods -A \
  -o custom-columns='NS:.metadata.namespace,POD:.metadata.name,PRIO:.spec.priority,PC:.spec.priorityClassName' \
  | sort -k3 -n -r | head

7. Pod Disruption Budgets, the descheduler, and node drains

A PodDisruptionBudget bounds voluntary disruption — kubectl drain, node-pool upgrades, the descheduler, autoscaler scale-down. It does nothing for involuntary events (a node dying) and, as noted, only soft-protects against preemption.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payments-api
spec:
  minAvailable: 2          # or maxUnavailable: 1
  selector:
    matchLabels:
      app: payments-api

minAvailable and maxUnavailable are mutually exclusive — pick one. A drain blocks (the eviction API returns 429) until honoring it would not breach the budget:

kubectl drain ip-10-1-2-3.eu-west-1.compute.internal \
  --ignore-daemonsets --delete-emptydir-data --grace-period=120

The descheduler is the counterweight to the scheduler’s point-in-time decisions. The scheduler never moves a pod once bound, so over time you get drift — pods stranded on nodes that violate affinity after a relabel, lopsided utilization after scale events, topology skew that grew as nodes were added. The descheduler runs as a CronJob or Deployment, finds pods that would not be scheduled the same way today, and evicts them so the scheduler replaces them better. It honors PDBs and priority by default.

A profile that targets the two highest-value strategies:

apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
  - name: rebalance
    pluginConfig:
      - name: RemovePodsViolatingTopologySpreadConstraint
        args:
          constraints:
            - DoNotSchedule
      - name: LowNodeUtilization
        args:
          thresholds:
            cpu: 25
            memory: 25
          targetThresholds:
            cpu: 70
            memory: 70
    plugins:
      balance:
        enabled:
          - RemovePodsViolatingTopologySpreadConstraint
          - LowNodeUtilization

LowNodeUtilization evicts pods off nodes below the thresholds band so they can pack onto nodes under targetThresholds — letting the autoscaler then remove the drained nodes. Run it on a schedule (every few minutes to hourly), never tighter than your rollout cadence, and always with PDBs in place so it cannot evict past your availability floor.

Verify

Confirm each control is doing what you intended before you trust it under load.

Spread across zones is actually achieved:

kubectl get pods -l app=payments-api -o wide --no-headers \
  | awk '{print $7}' | sort | uniq -c
# join node -> zone if needed:
kubectl get nodes -L topology.kubernetes.io/zone

A pending pod’s reason, per filter, including preemption verdicts:

kubectl describe pod <pending-pod> | sed -n '/Events:/,$p'
# typical messages:
#   "0/12 nodes are available: 4 node(s) didn't match pod topology spread constraints,
#    8 node(s) had untolerated taint {nvidia.com/gpu: present}."
#   "0/12 nodes are available: ... preemption: 0/12 nodes are available:
#    12 No preemption victims found for incoming pod."

Taints and tolerations line up:

kubectl get nodes -o json \
  | jq -r '.items[] | "\(.metadata.name)\t\(.spec.taints // [])"'

PDB headroom before a drain:

kubectl get pdb payments-api \
  -o custom-columns='NAME:.metadata.name,MIN:.spec.minAvailable,ALLOWED:.status.disruptionsAllowed,CURRENT:.status.currentHealthy'

Priority and preemption events are visible cluster-wide:

kubectl get events -A --field-selector reason=Preempted \
  --sort-by=.lastTimestamp | tail

Enterprise scenario

A payments platform team ran a regional EKS cluster across three AZs and a 6-replica Deployment of their authorization service behind a minAvailable: 4 PDB. They believed they were zone-resilient. During an eu-west-1b impairment, the service dropped below quorum and latency spiked — and the postmortem found five of six replicas had been running in eu-west-1a.

Root cause was two compounding gaps. First, they used only required pod anti-affinity on kubernetes.io/hostname, which guarantees one-replica-per-node but says nothing about zones; the cluster autoscaler had grown the 1a node group first during a prior scale event, and the scheduler happily filled it. Second, they had a soft zone spread (ScheduleAnyway) that the scheduler abandoned the moment scoring preferred the warm, already-provisioned 1a nodes.

The fix was to make zone spread a filter, force the domain count from replica one, and scope it per rollout so deploys would not thrash:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      minDomains: 3
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      matchLabelKeys: [pod-template-hash]
      labelSelector:
        matchLabels:
          app: authz

They kept the hostname anti-affinity as ScheduleAnyway (downgraded from required, so scale-up could never wedge on it), set the authz Deployment to a platform-critical PriorityClass with preemptionPolicy: PreemptLowerPriority so it could reclaim capacity from batch during an AZ loss, and added a descheduler RemovePodsViolatingTopologySpreadConstraint pass on a 10-minute schedule to correct any drift the autoscaler reintroduced. The next quarter’s GameDay zone-kill held quorum: 2/2/2 going in, 2/0/2 surviving with the two displaced replicas rescheduling into 1a and 1c within the PDB.

The lesson the team wrote down: soft constraints describe intent; only hard constraints survive a bad day. Zonal availability that you cannot lose is a DoNotSchedule plus minDomains, never a preference.

Checklist

Decide hard vs soft for every placement rule — required/DoNotSchedule only where a Pending pod is preferable to a misplaced one.
Use topology spread (not required pod anti-affinity) for spreading at scale; reserve pod affinity for true co-location.
Set minDomains to your zone count on any workload that must spread from the first replica.
Add matchLabelKeys: [pod-template-hash] to spread constraints so rollouts compute skew per ReplicaSet.
Pair every dedicated-pool toleration with a matching nodeSelector/affinity — tolerations grant permission, not attraction.
Keep custom PriorityClass values below system-cluster-critical; use preemptionPolicy: Never for line-jumping batch.
Define a PDB (minAvailable or maxUnavailable) for every stateful or quorum-sensitive workload, and remember it only soft-protects against preemption.
Run the descheduler with RemovePodsViolatingTopologySpreadConstraint and LowNodeUtilization, scheduled no tighter than your rollout cadence.
Validate post-deploy with a zone/node distribution check and a GameDay zone-kill before trusting the design.

Advanced Kubernetes Scheduling: Affinity, Topology Spread Constraints, Taints, and Priority-Based Preemption

1. The scheduling cycle: filtering, scoring, binding

2. Node affinity and nodeSelector vs matchLabelKeys

3. Inter-pod affinity and anti-affinity

4. Topology spread constraints: maxSkew, minDomains, whenUnsatisfiable

5. Taints, tolerations, and dedicated node pools

6. PriorityClasses, preemption, and protecting critical workloads

7. Pod Disruption Budgets, the descheduler, and node drains

Verify

Enterprise scenario

Checklist

Written by Vinod

Comments

Keep Reading

Cilium Beyond CNI: Cluster Mesh, Egress Gateway, and the BGP Control Plane

GitOps with Flux: Image Update Automation, OCI Artifact Sources, and Hard Multi-Tenancy

Helm for Complex Releases: Umbrella Charts, Library Charts, Lifecycle Hooks, and Safe Rollbacks