Containerization Fundamentals

Kubernetes StatefulSets, In Depth: Stable Identity, Ordered Lifecycle & Per-Pod Storage

A Deployment treats its Pods as cattle: interchangeable, anonymous, disposable. If one dies, the controller stamps out a fresh replacement with a new random name and a new IP, and nothing downstream cares — that is exactly what you want for a stateless web server or an API. But a whole class of software refuses to be cattle. A database replica, a message broker node, a distributed-consensus member (etcd, ZooKeeper, Kafka, Cassandra) each needs to be a pet with papers: a stable name that survives restarts, a stable network address other members can find, its own private disk that follows it around, and a predictable order in which the herd is brought up, scaled, and torn down. A Postgres replica that comes back as pod-7f9c-xk2 with an empty volume is not the same replica — it has amnesia.

The StatefulSet is the workload controller built for exactly these requirements. It guarantees three things a Deployment cannot: stable, sticky identity (predictable names and stable per-Pod DNS), ordered, graceful lifecycle (Pods are created, scaled, and deleted in a deterministic sequence), and stable per-Pod storage (each replica gets its own PersistentVolumeClaim that is retained across rescheduling and even rescaling). Those three properties are what let you run consensus systems and databases on Kubernetes without losing data during the very failover you forgot to test.

This lesson covers StatefulSets exhaustively — every field that matters, what it does, the values it takes, its default, when to set it, and the gotcha that bites people in production. It is long on purpose: by the end you will be able to design a stateful workload with confidence and answer the exam questions that probe the subtle bits (the required headless Service, the partition canary, what happens to PVCs when you scale down). Everything targets Kubernetes v1.30+, where the newer pieces — persistentVolumeClaimRetentionPolicy and .spec.ordinals.start — are stable.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need a local cluster and basic comfort with kubectl and a Pod spec. If you have not set one up, do the lab in What Is Kubernetes? Control Plane, Nodes, etcd & the kubelet — it walks you through a free local cluster with kind or minikube. Because a StatefulSet’s whole point is storage that persists, you should have met PersistentVolumes, PersistentVolumeClaims and StorageClasses in Kubernetes Storage: Volumes, PV, PVC & StorageClass. And because the easiest way to understand what a StatefulSet adds is to compare it to what you already know, it helps to have met the Deployment → ReplicaSet → Pod chain.

This is Lesson W1 of the Kubernetes Zero-to-Hero course (Intermediate tier). It sits in the workloads track alongside the Jobs, CronJobs & DaemonSets deep dive, and it is the conceptual foundation for the production case study Running Stateful PostgreSQL on Kubernetes with an operator, which applies everything here to a real HA database. After this, the course moves into platform extension with Admission Control: Validating & Mutating Webhooks + ValidatingAdmissionPolicy.

Core concepts: identity, ordering and storage

A StatefulSet manages a set of Pods that are not interchangeable. To make that work it pins three properties to each Pod, indexed by a stable integer ordinal (0, 1, 2, …):

Two more load-bearing terms:

Here is the contrast that makes the whole concept click:

Property Deployment StatefulSet
Pod names random suffix (web-7f9c-xk2) stable ordinal (web-0, web-1)
Identity after restart new (fresh suffix + IP) same name, same DNS, same PVC
Storage shared/none, or one PVC shared by all one PVC per Pod, sticky
Per-Pod DNS no yes (via headless Service)
Creation/scale-up order all at once one at a time, ordinal order (default)
Termination/scale-down order arbitrary one at a time, reverse order (default)
Update strategy RollingUpdate/Recreate (by surge) RollingUpdate (reverse-ordinal, in place) / OnDelete

Stable network identity: the required headless Service & per-Pod DNS

Every StatefulSet must reference a governing Service via .spec.serviceName, and that Service should be headless (clusterIP: None). This Service is the machinery that creates each Pod’s stable DNS name. Without it, the per-Pod records never get created and your members cannot find each other by a durable address.

The minimal pair looks like this:

apiVersion: v1
kind: Service
metadata:
  name: nginx            # this becomes the DNS subdomain for every Pod
  labels:
    app: nginx
spec:
  clusterIP: None        # <-- headless: no VIP, DNS returns Pod IPs directly
  selector:
    app: nginx
  ports:
    - name: web
      port: 80
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: nginx     # <-- must match the headless Service name
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx       # <-- must match both selectors above
    spec:
      containers:
        - name: nginx
          image: registry.k8s.io/nginx-slim:0.27
          ports:
            - containerPort: 80
              name: web

With this applied, the cluster gives you a precise, predictable set of names. Assume the namespace is default and the cluster domain is cluster.local:

Name Resolves to Lifetime
nginx.default.svc.cluster.local the set of all ready Pod IPs (headless: A records for web-0, web-1, web-2) as long as Pods are ready
web-0.nginx.default.svc.cluster.local the IP of Pod web-0 specifically stable for the identity web-0, across reschedules
web-1.nginx.default.svc.cluster.local the IP of Pod web-1 specifically stable for web-1
web-2.nginx.default.svc.cluster.local the IP of Pod web-2 specifically stable for web-2

The pattern is <pod-name>.<service-name>.<namespace>.svc.<cluster-domain>. Three things to internalise:

A subtlety worth knowing for production and exams: by default a Pod’s per-Pod DNS record is only published once the Pod is Ready. If a peer needs to be discoverable before it passes readiness (common during cluster bootstrap, where members find each other before they’re serving), set .spec.publishNotReadyAddresses: true on the headless Service so DNS includes not-yet-ready Pods.


Ordered, graceful lifecycle & podManagementPolicy

The second guarantee is ordering. By default a StatefulSet does not act on all Pods at once — it proceeds one Pod at a time, in order, waiting for each to be healthy before moving on. This matters because consensus systems and primary/replica databases often require members to come up in sequence (member 0 bootstraps, member 1 joins it, member 2 joins the cluster).

The default ordering guarantees (podManagementPolicy: OrderedReady) are:

That last point has a sharp edge: a single unhealthy Pod blocks the entire roll. If web-1 never becomes Ready, you cannot scale up to add web-3 and you cannot finish a rolling update past web-1. This is by design — for a database you generally want to stop rather than charge ahead with a broken member — but it surprises people. (Kubernetes did add a guard so that a brand-new failed Pod doesn’t permanently wedge the set: the controller will recreate a Pod that failed before ever becoming Ready. But a Pod that was Ready and then went unhealthy will hold the line.)

podManagementPolicy switches this behaviour:

podManagementPolicy Create/scale-up order Scale-down order Waits for Ready between Pods? When to use
OrderedReady (default) strict ascending, one at a time strict descending, one at a time yes Systems that must bootstrap in order or where one broken member should halt the roll (most databases, etcd/ZooKeeper).
Parallel all at once all at once no Members that are independent of one another and just need stable identity + storage (e.g. a sharded store where shards don’t bootstrap from each other, or where startup order is irrelevant). Much faster to scale.

Two cautions on Parallel:

Termination grace and ordering interact with terminationGracePeriodSeconds in the Pod template. Because Pods come down one at a time under OrderedReady, a generous grace period on a database (to flush WAL, leave the quorum cleanly) is multiplied across the set — draining 5 replicas with a 60-second grace can take minutes. That’s usually correct for safety, but plan for it during node drains and upgrades.


Per-Pod storage: volumeClaimTemplates

The third guarantee is stable storage, and it is delivered by volumeClaimTemplates — a list of PVC templates on the StatefulSet. For each template and each replica, the controller creates a dedicated PersistentVolumeClaim, dynamically provisioning a PersistentVolume from the named StorageClass. The PVC’s name is deterministic: <template-name>-<statefulset-name>-<ordinal>.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: nginx
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: nginx
          image: registry.k8s.io/nginx-slim:0.27
          ports:
            - containerPort: 80
              name: web
          volumeMounts:
            - name: data            # <-- matches the template name below
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: standard  # omit to use the cluster's default class
        resources:
          requests:
            storage: 1Gi

This produces three PVCs — data-web-0, data-web-1, data-web-2 — each bound to its own PV. The behaviour you must commit to memory:

A common point of confusion: a volumeClaimTemplate is not the same as a volume of type persistentVolumeClaim in the Pod template. A normal PVC volume references one pre-existing claim that all replicas would share — wrong for stateful data (and impossible with ReadWriteOnce). The template creates one PVC per replica, which is the whole point. Reach for shared storage (one ReadWriteMany PVC mounted by every Pod) only when the workload genuinely shares a filesystem; for databases and queues you want per-Pod ReadWriteOnce.

One more for completeness: a StatefulSet Pod also has the StorageClass’s volumeBindingMode in play. With the recommended WaitForFirstConsumer, the PV is provisioned in the same zone the Pod is scheduled to, which keeps a zonal disk and its Pod together — important on cloud providers where a disk can’t cross zones.


Rolling out changes: updateStrategy & the partition canary

Changing the Pod template (a new image, a config change, a resource bump) triggers a rollout. StatefulSets give you two strategies via .spec.updateStrategy.type:

updateStrategy.type What it does Ordering When to use
RollingUpdate (default) Automatically deletes and recreates Pods to the new template, one at a time descending ordinal (highest first: N-1 … 1, 0) Most cases — automated, ordered, in-place updates.
OnDelete Does nothing automatically; the new template applies only when you delete a Pod by hand you control it When you need full manual control over when and which member updates (e.g. drain-then-update each DB node yourself, or operator-driven updates).

Three things to understand about RollingUpdate:

RollingUpdate with partition

Set .spec.updateStrategy.rollingUpdate.partition: K. The rule is simple but powerful:

Only Pods with an ordinal >= K are updated to the new template. Pods with ordinal < K stay on the old template — even if you delete them, they come back on the old spec.

So with replicas: 5 and partition: 4, changing the image updates only web-4 (ordinals ≥ 4). web-0…web-3 remain on the previous version. You now have a canary: one member on the new build, the rest stable. If web-4 looks healthy, lower the partition to 3 to roll web-3, then 2, 1, and finally 0 (or jump straight to 0/remove the field to finish). If the canary misbehaves, raise the partition back up (or revert the template) and only the canary was ever affected.

spec:
  replicas: 5
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 4        # only ordinals >= 4 get the new template -> canary on web-4
partition value (replicas=5) Pods that update on a template change
0 (or field absent) all Pods (web-0 … web-4) — full rollout
3 web-3, web-4
4 web-4 only (canary)
5 (= replicas) none — template change is staged but applied to nothing yet

Two notes: setting partition equal to or greater than replicas stages the new template without rolling any Pod — handy for pre-loading a change you’ll release later. And partition also gates new scale-ups: Pods created above the current size follow the rule too. Lowering the partition is the act of “releasing” the canary to more members; you do it deliberately, observing health at each step.

There is also .spec.updateStrategy.rollingUpdate.maxUnavailable (beta in recent releases, behind the MaxUnavailableStatefulSet feature gate): it lets a RollingUpdate take down more than one Pod at a time within the partition. Default is 1 (the classic one-at-a-time behaviour). Raise it only for sets that tolerate multiple simultaneous restarts.


Reclaiming storage: persistentVolumeClaimRetentionPolicy

Historically, StatefulSet PVCs were always retained — deleting the StatefulSet or scaling it down left the per-Pod PVCs (and their data) behind, which you then had to clean up by hand. Since Kubernetes v1.27 (stable), persistentVolumeClaimRetentionPolicy lets you choose what happens to the auto-created PVCs in two distinct events:

spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain     # when the whole StatefulSet is deleted
    whenScaled: Delete      # when the StatefulSet is scaled down

The two fields and their values:

Field Trigger Retain (default) Delete
whenDeleted The StatefulSet itself is deleted PVCs are kept; data survives; you clean up manually PVCs (for all ordinals) are deleted after the Pods are gone
whenScaled The set is scaled down (replica count reduced) PVCs for removed ordinals are kept; scaling back up reuses them PVCs for the removed (higher) ordinals are deleted

Whether the underlying PersistentVolume (and the disk) is actually destroyed when a PVC is deleted depends on the PV’s reclaimPolicy (Delete vs Retain), set by the StorageClass — so Delete here plus a Delete reclaim policy genuinely erases data, while Delete here plus a Retain reclaim policy releases the PVC but keeps the PV for manual recovery.

How to choose:

Implementation detail worth knowing: the controller wires up Delete behaviour using owner references on the PVCs (pointing at the StatefulSet for whenDeleted, or at the Pod for whenScaled), so Kubernetes garbage collection does the deletion. This means the policy is honoured even if the controller restarts mid-operation.

Kubernetes StatefulSet anatomy

The diagram above ties the three guarantees together: a headless Service fronts ordinal Pods web-0…web-N, each Pod carries its stable per-Pod DNS name and its own data-web-N PVC bound to a dedicated PV, and the arrows show the lifecycle ordering — ascending for creation, descending for termination and rolling updates — with the partition line marking which ordinals a staged rollout touches.


StatefulSet vs Deployment vs DaemonSet

Choosing the right controller is the most common real-world (and interview) decision. The short version: stateless and interchangeable → Deployment; needs stable identity/order/storage → StatefulSet; one per node → DaemonSet.

Requirement Use Why
Stateless web/API; Pods interchangeable Deployment Fastest rollout (surge), no identity baggage, scales freely.
Each replica needs a stable name + DNS StatefulSet Deployment Pods get random, changing identities.
Each replica needs its own persistent disk that follows it StatefulSet volumeClaimTemplates gives one sticky PVC per Pod.
Members must bootstrap/scale in a fixed order StatefulSet OrderedReady guarantees sequence; Deployments don’t.
Primary/replica databases, quorum systems (etcd, ZooKeeper, Kafka, Cassandra) StatefulSet (usually via an operator) Needs all three guarantees at once.
Exactly one Pod on every node (log/metrics/CNI/CSI agents) DaemonSet Tracks node membership; one-per-node, not a fixed count.
A run-once or scheduled batch task Job / CronJob Run-to-completion semantics, not run-forever.

Sharp contrasts beginners blur:

And the production reality check: for anything beyond a toy, you usually don’t hand-write the StatefulSet for a database — you use an operator (CloudNativePG for Postgres, Strimzi for Kafka, the ZooKeeper operator, etc.). The operator generates a StatefulSet under the hood for identity and storage, then layers on the consensus, failover, backup and PITR logic a bare StatefulSet can’t express. This lesson teaches the primitives; the Stateful PostgreSQL with an operator lesson shows the real thing.


Common stateful patterns


Hands-on lab

Free, on your laptop, using kind (or minikube). You’ll create a headless Service + StatefulSet, observe ordered creation, prove per-Pod identity and storage stickiness, run a partition canary, scale down to see PVC retention, then experiment with the retention policy. kind ships a default standard StorageClass (Rancher local-path provisioner) so dynamic provisioning works out of the box.

1. Create a cluster:

kind create cluster --name sts-lab
kubectl get storageclass
# expect a 'standard (default)' class (rancher.io/local-path)

2. Create the headless Service and StatefulSet:

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: web
  labels: { app: web }
spec:
  clusterIP: None
  selector: { app: web }
  ports:
    - { name: http, port: 80 }
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: web
  replicas: 3
  podManagementPolicy: OrderedReady
  selector:
    matchLabels: { app: web }
  updateStrategy:
    type: RollingUpdate
  persistentVolumeClaimRetentionPolicy:
    whenScaled: Retain
    whenDeleted: Retain
  template:
    metadata:
      labels: { app: web }
    spec:
      terminationGracePeriodSeconds: 5
      containers:
        - name: nginx
          image: registry.k8s.io/nginx-slim:0.27
          ports: [{ containerPort: 80, name: http }]
          volumeMounts:
            - { name: data, mountPath: /usr/share/nginx/html }
  volumeClaimTemplates:
    - metadata: { name: data }
      spec:
        accessModes: ["ReadWriteOnce"]
        resources: { requests: { storage: 64Mi } }
EOF

3. Watch ordered, one-at-a-time creation:

kubectl get pods -l app=web -w
# web-0 -> Running/Ready, THEN web-1 -> Running/Ready, THEN web-2. Never all at once.
# Ctrl-C when 3/3 are Ready.
kubectl get statefulset web        # READY 3/3
kubectl get pvc -l app=web
# data-web-0, data-web-1, data-web-2 -> all Bound

Validation: the Pod names are web-0, web-1, web-2 (stable ordinals), and there are exactly three PVCs named data-web-<n>.

4. Prove stable per-Pod DNS and storage stickiness. Write a file into web-1’s volume, then delete the Pod and confirm the data and identity return:

# write unique content onto web-1's disk, served by nginx
kubectl exec web-1 -- sh -c 'echo "I am web-1, ordinal 1" > /usr/share/nginx/html/index.html'

# from a throwaway client Pod, resolve the per-Pod name and fetch it
kubectl run client --rm -it --image=busybox:1.36 --restart=Never -- \
  sh -c 'nslookup web-1.web.default.svc.cluster.local; wget -qO- http://web-1.web.default.svc.cluster.local'
# nslookup returns web-1's IP; wget prints: I am web-1, ordinal 1

# now delete web-1 and watch it come back with the SAME name and SAME data
kubectl delete pod web-1
kubectl wait --for=condition=Ready pod/web-1 --timeout=60s
kubectl exec web-1 -- cat /usr/share/nginx/html/index.html
# STILL prints: I am web-1, ordinal 1   <-- it re-bound data-web-1

This is the headline demonstration: a deleted StatefulSet Pod returns with its name, DNS record, and disk intact. A Deployment Pod would have come back anonymous and empty.

5. Run a partition canary. Stage a change so only the highest ordinal updates:

# set a partition so only ordinals >= 2 update
kubectl patch statefulset web --type=merge \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'

# change the image -> with partition=2, only web-2 should roll
kubectl set image statefulset/web nginx=registry.k8s.io/nginx-slim:0.28
kubectl rollout status statefulset/web --timeout=120s

# verify: web-2 on the new image, web-0/web-1 still on the old one
kubectl get pod web-0 web-1 web-2 \
  -o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image
# web-0 -> :0.27   web-1 -> :0.27   web-2 -> :0.28   (canary on web-2 only)

Now “release” the canary to the rest by lowering the partition:

kubectl patch statefulset web --type=merge \
  -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
kubectl rollout status statefulset/web --timeout=120s
kubectl get pods -l app=web \
  -o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image
# all three now on :0.28, rolled web-1 then web-0 (reverse ordinal)

6. See PVC retention on scale-down (policy = Retain):

kubectl scale statefulset web --replicas=1
kubectl get pods -l app=web        # only web-0 remains (web-2, then web-1 terminated)
kubectl get pvc -l app=web
# data-web-0, data-web-1, data-web-2 are ALL still here and Bound -> PVCs retained

Scale back up and watch the data return:

kubectl scale statefulset web --replicas=3
kubectl wait --for=condition=Ready pod/web-1 --timeout=60s
kubectl exec web-1 -- cat /usr/share/nginx/html/index.html
# STILL "I am web-1, ordinal 1" -> web-1 re-attached its retained PVC

7. (Optional) Contrast with whenScaled: Delete:

kubectl patch statefulset web --type=merge \
  -p '{"spec":{"persistentVolumeClaimRetentionPolicy":{"whenScaled":"Delete"}}}'
kubectl scale statefulset web --replicas=1
kubectl get pvc -l app=web
# now data-web-1 and data-web-2 are GONE (deleted with their Pods); only data-web-0 remains

Cleanup:

kubectl delete statefulset web
kubectl delete service web
kubectl delete pvc -l app=web        # PVCs are NOT removed by deleting the StatefulSet (Retain)
kind delete cluster --name sts-lab

Cost note: entirely free — kind runs in Docker on your laptop and the local-path provisioner uses host disk. On a cloud provider, every volumeClaimTemplate PVC provisions a real, billed managed disk per replica (e.g. an EBS/Managed Disk volume), and with whenScaled: Retain those disks keep costing money after you scale down until you delete the PVCs — the number-one surprise on a StatefulSet bill.


Common mistakes & troubleshooting

Symptom Likely cause Fix
StatefulSet stuck at 1/3 (or any partial count), higher Pods never created OrderedReady: a lower ordinal (e.g. web-0) is not Ready, so creation halts there kubectl describe pod web-0 / kubectl logs web-0; fix the failing Pod (image, probe, config). The roll resumes once it’s Ready.
Per-Pod DNS names (web-0.svc…) don’t resolve No headless Service, serviceName doesn’t match it, or the Service isn’t clusterIP: None Create a headless Service whose name equals .spec.serviceName; verify clusterIP: None.
Pods stay Pending, PVCs Pending No default StorageClass, named class doesn’t exist, or no provisioner/capacity kubectl get sc; set storageClassName to a real class or mark one default; check provisioner logs.
Peers can’t find a member during bootstrap (before it’s Ready) Per-Pod DNS only publishes Ready Pods by default Set publishNotReadyAddresses: true on the headless Service.
Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy'... Tried to edit an immutable field (serviceName, selector, podManagementPolicy, volumeClaimTemplates) Recreate the StatefulSet; use kubectl delete sts --cascade=orphan to keep Pods/PVCs, then re-apply the new spec.
Rolling update doesn’t progress past one Pod Updated Pod isn’t becoming Ready (reverse-ordinal roll halts), or a non-zero partition is pinning lower ordinals Fix the unhealthy Pod; check .spec.updateStrategy.rollingUpdate.partition — lower it to release more ordinals.
Image change applied to no Pods partitionreplicas (staged but not released), or strategy is OnDelete Lower partition, or for OnDelete delete the Pods you want updated.
Orphaned PVCs / unexpected cloud disk bill after scaling down Default whenScaled: Retain keeps removed ordinals’ PVCs Delete the stale PVCs, or set persistentVolumeClaimRetentionPolicy.whenScaled: Delete if losing that data is acceptable.
Deleting the StatefulSet left all PVCs behind Default whenDeleted: Retain (data protection) Delete PVCs explicitly, or set whenDeleted: Delete for ephemeral environments.

A note on --cascade=orphan: kubectl delete statefulset web --cascade=orphan removes the StatefulSet object but leaves the Pods and PVCs running. This is the supported way to change an immutable field (e.g. podManagementPolicy) without downtime: orphan, re-apply the StatefulSet with the new spec (it adopts the existing Pods/PVCs by selector), done.

Best practices

Security notes

Interview & exam questions

  1. Why does a StatefulSet require a headless Service, and what is “headless”? A headless Service (clusterIP: None) has no virtual IP and doesn’t load-balance; DNS returns the Pods’ own addresses. The StatefulSet names it via .spec.serviceName, which is what causes the per-Pod DNS records (pod-N.<svc>.<ns>.svc.cluster.local) to be created — the stable network identity that lets cluster members address each other. Without it, those per-Pod names don’t exist.

  2. What three guarantees does a StatefulSet give that a Deployment does not? Stable, sticky identity (deterministic name-N that persists across restarts, with stable DNS); ordered, graceful lifecycle (ordinal-ordered create/scale/terminate); and stable per-Pod storage (one sticky PVC per replica via volumeClaimTemplates).

  3. In what order are StatefulSet Pods created and deleted by default? Created in ascending ordinal order (0,1,2…), each waiting for the previous to be Running and Ready; deleted in descending order (highest first), each waiting for the higher ones to be fully gone. This is podManagementPolicy: OrderedReady.

  4. web-0 is in CrashLoopBackOff. You scale the StatefulSet from 3 to 5. What happens? Nothing new is created. Under OrderedReady, Pod N isn’t created until 0…N-1 are Ready — so a broken web-0 blocks the entire set; web-3/web-4 never appear until web-0 is healthy.

  5. What does podManagementPolicy: Parallel change, and what does it not change? It launches and terminates Pods all at once without waiting for Ready between them (faster scaling). It does not change update ordering (still reverse-ordinal), and it does not weaken identity or storage guarantees. It’s immutable after creation.

  6. How does volumeClaimTemplates name its PVCs, and what happens to a PVC when its Pod is rescheduled? PVCs are named <template>-<statefulset>-<ordinal> (e.g. data-web-1). On reschedule, the recreated Pod re-binds to its existing PVC — same volume, same data — even on a different node. The PVC is the durable thing; the Pod is ephemeral.

  7. What happens to the PVCs when you (a) scale a StatefulSet down and (b) delete it? By default (Retain/Retain) both keep the PVCs — scale-down leaves removed ordinals’ PVCs (so scaling back up reuses the data), and deleting the StatefulSet leaves all PVCs. You change this with persistentVolumeClaimRetentionPolicy.whenScaled / whenDeleted set to Delete.

  8. Explain updateStrategy.rollingUpdate.partition. With partition: K, only Pods with ordinal ≥ K update to the new template; ordinals < K stay on the old one (and stay old even if deleted). It’s the canary/staged-rollout lever: set it high to update just the top ordinal, observe, then lower it to roll more. partition ≥ replicas stages a change without rolling anything.

  9. How does a StatefulSet rolling update differ from a Deployment’s? A StatefulSet rolls one Pod at a time, in reverse-ordinal order, in place (same name and PVC) with no surge — it can’t run two Pods for one identity or share a ReadWriteOnce disk. A Deployment can surge (briefly exceed replicas) and uses random-named replicas. A stuck Pod halts a StatefulSet roll.

  10. When would you choose a StatefulSet over a Deployment, and over a DaemonSet? Over a Deployment: when you need stable identity/DNS, ordered lifecycle, or per-Pod persistent storage (databases, queues, consensus). Over a DaemonSet: when you need a fixed count of identity-bearing replicas, not one per node (a DaemonSet is for per-node agents).

  11. Which StatefulSet fields are immutable, and how do you change one anyway? serviceName, selector, podManagementPolicy, and (mostly) volumeClaimTemplates are immutable. To change one, delete the StatefulSet with --cascade=orphan (leaving Pods and PVCs), then re-apply the modified spec, which re-adopts them.

  12. You’re upgrading a 5-member quorum system (e.g. etcd) on a StatefulSet. How do you do it safely, and why does ordering matter? Use RollingUpdate (default), upgrading one member at a time in reverse ordinal so a majority is always healthy; optionally drive it with partition to canary one member first. One-at-a-time matters because taking down two of five members at once loses quorum. Never scale below majority during the roll.

Quick check

  1. What value of clusterIP makes a Service headless, and why does a StatefulSet need one?
  2. A StatefulSet named db with serviceName: db in namespace prod — what is the stable DNS name of its second replica?
  3. With podManagementPolicy: OrderedReady, can web-2 be created while web-1 is still Pending?
  4. After kubectl delete statefulset web with the default retention policy, are the PVCs gone?
  5. With replicas: 4 and updateStrategy.rollingUpdate.partition: 3, which Pods update when you change the image?

Answers

  1. clusterIP: None. It gives no VIP and makes DNS return the individual Pod addresses, which is what creates the per-Pod stable DNS records the StatefulSet’s identity guarantee relies on.
  2. db-1.db.prod.svc.cluster.local (pattern pod-name.service.namespace.svc.cluster-domain; ordinals start at 0, so the second replica is db-1).
  3. No. Under OrderedReady, web-2 isn’t created until web-0 and web-1 are both Running and Ready.
  4. No. With the default whenDeleted: Retain, deleting the StatefulSet leaves the PVCs (and data) behind; you delete them explicitly.
  5. Only web-3 (ordinals ≥ 3). web-0, web-1, web-2 stay on the old template.

Exercise

Build a small quorum-style StatefulSet on a local kind cluster and exercise every guarantee:

  1. Create a headless Service quorum and a StatefulSet quorum with replicas: 3, podManagementPolicy: OrderedReady, a volumeClaimTemplate named data (64Mi, ReadWriteOnce), and persistentVolumeClaimRetentionPolicy: { whenScaled: Delete, whenDeleted: Retain }. Use any small server image (e.g. registry.k8s.io/nginx-slim).
  2. Watch the Pods come up one at a time in ascending order; confirm three PVCs data-quorum-0..2 are Bound.
  3. Write a distinct file into each Pod’s volume (echo "member N" > …/index.html). Delete quorum-1, wait for it to return, and prove it kept its name and data.
  4. From a throwaway busybox Pod, resolve and fetch quorum-2.quorum.default.svc.cluster.local to prove per-Pod DNS works.
  5. Set updateStrategy.rollingUpdate.partition: 2, change the image, and confirm only quorum-2 updated. Then lower the partition to 0 and watch quorum-1 then quorum-0 roll (reverse ordinal).
  6. Scale to replicas: 1 and confirm — because whenScaled: Delete — that data-quorum-1 and data-quorum-2 were deleted (contrast with the default Retain).
  7. Clean up with kind delete cluster.

Success criteria: ordered bring-up; a deleted Pod returns with identical name and data; per-Pod DNS resolves; the partition canary touches exactly one ordinal; and the retention policy deletes the right PVCs on scale-down.

Certification mapping

Glossary

Next steps

You now understand the three StatefulSet guarantees and every field that delivers them. The natural next move is to see them applied to a real, production-grade workload: Running Stateful PostgreSQL on Kubernetes with an operator builds an HA Postgres cluster with synchronous replication, automated quorum failover, WAL archiving and a rehearsed point-in-time-recovery drill — everything a bare StatefulSet can’t do on its own. From there the course turns to extending the platform itself with Kubernetes Admission Control: Validating & Mutating Webhooks + ValidatingAdmissionPolicy. If you want more depth on the storage layer underneath volumeClaimTemplates, revisit Kubernetes Storage: Volumes, PV, PVC & StorageClass and CSI Volume Snapshots, Cloning, Resize & Topology.

KubernetesStatefulSetsstorageheadless-servicestateful-workloadsCKAD
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading