Getting a container to run on Kubernetes is the easy part. A kubectl apply and a kubectl get pods showing Running looks like success — and in a demo it is. But the gap between running and production-ready is where most teams quietly accumulate outages: the pod that takes traffic before its database connection is open, the workload with no memory limit that gets OOMKilled at 3 a.m., the Deployment that drops requests during every rollout, the StatefulSet that all lands on one node and disappears when that node is drained for patching.
This lesson is the Day-2 readiness checklist — the set of properties a workload needs before it carries real traffic, and the reasoning behind each one so you can defend your choices in a design review or an interview. We will work through the controls that separate a demo from production: probes (liveness, readiness, startup), resource requests and limits with the QoS classes they produce, PodDisruptionBudgets, topology spread constraints and anti-affinity, the HorizontalPodAutoscaler, graceful shutdown, the rolling-update strategy, ConfigMap and Secret hygiene, securityContext and Pod Security, NetworkPolicy, and observability. We finish with a copy-paste checklist you can put in a pull-request template and a single hardened Deployment manifest that wires almost all of it together.
The voice here is deliberately practical. Every setting below has cost you can feel in production if you get it wrong, and an interviewer will ask you why, not just what.
Learning objectives
By the end of this lesson you can:
- Distinguish liveness, readiness and startup probes, choose the right probe type and handler, and explain the failure mode each one prevents.
- Set resource requests and limits deliberately, predict the resulting QoS class (Guaranteed, Burstable, BestEffort), and explain how QoS drives eviction order.
- Protect availability during voluntary disruptions with a PodDisruptionBudget, and spread replicas across failure domains with topology spread constraints and pod anti-affinity.
- Configure a safe rolling update (
maxSurge/maxUnavailable) and implement graceful shutdown withpreStophooks andterminationGracePeriodSeconds. - Add a HorizontalPodAutoscaler, manage configuration with ConfigMaps and Secrets, and harden the pod with securityContext, Pod Security Admission and NetworkPolicy.
- Apply a single production-readiness checklist to any workload and read a hardened Deployment manifest line by line.
Prerequisites & where this fits
You need to be comfortable with the core workload objects — Pods, ReplicaSets, Deployments and Services — and able to run kubectl apply, kubectl get, kubectl describe and kubectl logs. If those are not yet second nature, work through Pods, ReplicaSets, Deployments & Services: The Core Objects and Your First Cluster: kubectl and a Real Deploy first. You will need a cluster for the lab; a free local one from kind, minikube or k3d is enough.
This is the production-readiness checkpoint of the Kubernetes Zero-to-Hero course. It sits after the fundamentals and before you provision and operate your own clusters in Provisioning Production Kubernetes: kubeadm, HA Control Plane, etcd Backup & Upgrades. Everything here is squarely in the CKAD wheelhouse (designing resilient application deployments) and overlaps heavily with CKA (workload operations).
Core concepts: what “production-ready” actually means
Kubernetes is a declarative reconciliation engine: you describe the desired state and controllers work continuously to make actual state match. “Production-ready” means you have given those controllers enough information to make good decisions on your behalf — and protected the workload against the four things that routinely break it:
| Threat to availability | What it looks like | The control that addresses it |
|---|---|---|
| Bad rollouts | A new image crashes or serves errors, but the old version is already gone | Readiness probes + rolling-update strategy + (later) progressive delivery |
| Resource contention | A noisy neighbour starves your pod of CPU/memory; OOMKills | Requests, limits, QoS classes |
| Voluntary disruptions | A node drain (upgrade, autoscaler) takes down too many replicas at once | PodDisruptionBudget + multiple replicas |
| Involuntary disruptions | A node, rack or zone fails | Topology spread / anti-affinity across failure domains |
Two distinctions underpin the whole lesson. The first is voluntary vs involuntary disruption. Involuntary disruptions are things you do not initiate — a kernel panic, a hardware failure, a node running out of memory. Voluntary disruptions are deliberate operator actions: draining a node to patch it, scaling down a node pool, deleting a pod. You cannot prevent involuntary disruptions, only spread your blast radius; you can rate-limit voluntary disruptions with a PodDisruptionBudget. The second is desired vs actual state — the readiness signal you expose is how a pod tells Kubernetes “actual is not ready yet, do not send me traffic,” and almost every control below is ultimately about making that signal accurate.
Health probes: liveness, readiness and startup
Kubernetes cannot read your application’s mind. It knows a container’s process is alive, but not whether the app inside is healthy or ready to serve. Probes are how you tell it.
| Probe | Question it answers | On failure | Use it for |
|---|---|---|---|
| Liveness | “Is this container wedged and beyond recovery?” | The container is restarted (per restartPolicy) |
Deadlocks, stuck event loops — states a restart fixes |
| Readiness | “Should this pod receive traffic right now?” | The pod is removed from Service endpoints (not restarted) | Warm-up, lost dependency, overload, draining |
| Startup | “Has this slow-starting app finished booting yet?” | The container is restarted; gates liveness/readiness until it passes | Legacy/JVM apps with long, variable startup |
Three rules save you from the classic self-inflicted outages:
- Readiness is the one that protects users. It controls whether the pod is in the Service’s endpoint list. A readiness probe that also checks a critical downstream dependency lets a pod gracefully stop taking traffic when that dependency is gone — but be careful: if every replica checks a shared dependency and that dependency blips, you can take the entire Service out of rotation at once. Probe what this pod needs to serve, not the health of the whole world.
- Liveness must be cheap and local. If your liveness probe calls the database and the database is slow, Kubernetes will conclude the container is dead and restart it — turning a dependency hiccup into a restart storm that makes recovery harder. Liveness should answer “is this process wedged,” nothing more.
- Startup probes exist so the other two do not have to compensate. Without a startup probe, a slow app forces you to set a long
initialDelaySecondson liveness, which then makes liveness slow to detect real hangs for the container’s whole life. A startup probe gives the app a generous boot budget once, then hands over to a tight liveness probe.
Probe handlers come in four flavours: httpGet (a 2xx/3xx response means pass — the most common for web services), tcpSocket (the port accepts a connection — fine for non-HTTP servers), exec (a command exits 0 — flexible but the most expensive, as it forks a process each run), and grpc (native gRPC health checking, stable since v1.27). The tunables are the same for all three lifecycle probes:
| Field | Meaning | Sensible default |
|---|---|---|
initialDelaySeconds |
Wait before the first probe | Prefer a startup probe over a large value here |
periodSeconds |
How often to probe | 10 (readiness can be tighter, e.g. 5) |
timeoutSeconds |
How long to wait for a response | 1–2 (the default 1 is often too tight for HTTP) |
failureThreshold |
Consecutive failures before acting | 3 |
successThreshold |
Consecutive successes to recover | 1 (must be 1 for liveness/startup) |
A startup probe’s total budget is failureThreshold × periodSeconds — set that to comfortably exceed your worst-case boot time. Expose a lightweight /healthz (liveness) and a /readyz (readiness) in your app rather than reusing one endpoint for both; they answer different questions.
Resource requests, limits and QoS classes
Requests and limits are the single most consequential — and most often skipped — production setting.
- A request is what the scheduler reserves for the pod. It is the basis for bin-packing: the scheduler only places a pod on a node that has the requested CPU and memory free. Requests are also what the HorizontalPodAutoscaler measures utilisation against.
- A limit is the hard ceiling the kubelet/runtime enforces. The two resources behave very differently at the limit:
| Resource | Over the limit, what happens | Implication |
|---|---|---|
| CPU | The container is throttled (CFS quota) — slowed, never killed | Tail-latency spikes; the pod survives |
| Memory | The container is OOMKilled when it exceeds its limit | The container dies and restarts |
Because CPU throttles but memory kills, the standard guidance is: always set memory requests and limits equal for predictable workloads, set a CPU request, and be cautious with CPU limits — aggressive CPU limits cause throttling that hurts latency without any safety benefit. Many mature platforms set CPU requests but omit CPU limits for latency-sensitive services, relying on requests for fair scheduling.
The combination of requests and limits determines the pod’s Quality of Service (QoS) class, which decides eviction order when a node runs out of memory (the kubelet evicts to reclaim resources):
| QoS class | Condition | Eviction order under node pressure |
|---|---|---|
| Guaranteed | Every container has requests equal to limits for both CPU and memory | Evicted last — most protected |
| Burstable | At least one container has a request or limit, but not Guaranteed | Evicted after BestEffort, ordered by usage above requests |
| BestEffort | No requests or limits set anywhere | Evicted first — never run critical workloads this way |
For production: give every container at least requests, and target Guaranteed for anything stateful or latency-critical. A BestEffort pod is a pod the kubelet will sacrifice without hesitation — acceptable only for throwaway batch work. You can constrain a namespace with a LimitRange (defaults and min/max per pod) and cap total consumption with a ResourceQuota; both are how platform teams stop a single team’s workloads from starving a shared cluster.
PodDisruptionBudgets: surviving voluntary disruption
A PodDisruptionBudget (PDB) caps how many of a workload’s pods can be voluntarily disrupted at once. It does not stop a node failing — it stops kubectl drain (and the cluster autoscaler, and node-pool upgrades) from evicting too many replicas simultaneously.
You express it one of two ways, never both:
| Field | Meaning | Example |
|---|---|---|
minAvailable |
Minimum pods that must stay up during disruption | 2 or 50% |
maxUnavailable |
Maximum pods that may be down during disruption | 1 or 25% |
A PDB only has teeth if you run more than one replica. minAvailable: 1 on a single-replica Deployment means the drain blocks forever and you cannot patch the node — a common foot-gun. For a 3-replica web service, maxUnavailable: 1 (or minAvailable: 2) lets node maintenance proceed one pod at a time while keeping a quorum serving. Percentages are evaluated against the number of pods at disruption time and round in your favour for minAvailable.
Spreading replicas: topology spread and anti-affinity
Three replicas mean nothing if all three land on the same node and that node is drained. You need them spread across failure domains — nodes, then availability zones.
Topology spread constraints are the modern, preferred tool. They tell the scheduler to keep pods evenly distributed across a topology key:
| Field | What it controls |
|---|---|
topologyKey |
The domain to spread across — kubernetes.io/hostname (node) or topology.kubernetes.io/zone (zone) |
maxSkew |
The maximum allowed difference in pod count between the most and least populated domains |
whenUnsatisfiable |
DoNotSchedule (hard — pod stays Pending if it would breach skew) or ScheduleAnyway (soft — best effort) |
labelSelector |
Which pods are counted when computing the spread |
A typical production pattern spreads across zones softly (ScheduleAnyway) and across nodes more firmly, so a pod never piles two replicas on one node when another is free. Pod anti-affinity is the older mechanism that achieves similar goals (preferredDuringScheduling... keeps replicas apart on a best-effort basis); prefer topology spread constraints for new work — they are cheaper for the scheduler and express intent more directly. Use the hard variant (DoNotSchedule / requiredDuringScheduling) only when you genuinely prefer a Pending pod to a co-located one.
Rolling updates and graceful shutdown
A Deployment’s default update strategy is RollingUpdate, governed by two knobs that, combined with readiness probes, give you zero-downtime deploys:
| Field | Meaning | Effect |
|---|---|---|
maxSurge |
Extra pods allowed above the desired count during a rollout | Higher = faster rollout, more peak capacity used |
maxUnavailable |
Pods allowed to be unavailable during a rollout | 0 = never drop below desired count (safest); requires headroom |
The safest production setting for an even-numbered, capacity-constrained service is maxUnavailable: 0 with maxSurge: 1 — a new pod must become Ready before an old one is removed, so capacity never dips. This only works if your readiness probe is honest: if it reports ready before the app can serve, the rollout will happily replace healthy pods with broken ones. The other strategy, Recreate, kills all old pods before creating new ones (a downtime window) — use it only when two versions cannot coexist, e.g. an exclusive lock or an incompatible schema.
Graceful shutdown is the other half of zero-downtime. When a pod is deleted (a rollout, a scale-down, a drain), Kubernetes does this, in parallel:
- The pod is marked
Terminatingand removed from Service endpoints (it stops being a traffic target). - The
preStophook runs (if defined). SIGTERMis sent to PID 1 in each container.- After
terminationGracePeriodSeconds(default 30), any remaining processes getSIGKILL.
The subtle race: endpoint removal propagates asynchronously through kube-proxy and ingress controllers, so for a brief moment a Terminating pod may still receive new connections. The standard fix is a preStop sleep (sleep 5–15) that delays SIGTERM long enough for the endpoint removal to propagate, then a graceful in-app handler that drains in-flight requests before exiting. Set terminationGracePeriodSeconds longer than your longest in-flight request plus the preStop sleep. Your app must trap SIGTERM and exit cleanly — if it ignores SIGTERM (common when the process is wrapped in a shell), every shutdown becomes a hard 30-second kill that drops requests.
Configuration and secrets
Hard-coding configuration into an image is the anti-pattern; externalise it:
| Mechanism | For | Inject as | Notes |
|---|---|---|---|
| ConfigMap | Non-sensitive config (flags, URLs, files) | Env vars or mounted files | Changing it does not restart pods — roll the Deployment or use a config-reloader |
| Secret | Sensitive data (tokens, passwords, keys) | Env vars or mounted files | Base64-encoded not encrypted by default; mount as files, not env, where possible |
Two production rules: prefer mounting ConfigMaps/Secrets as files over environment variables (mounted files can update live without a restart and do not leak into kubectl describe or crash dumps), and enable encryption at rest for Secrets in etcd (or use an external store via the Secrets Store CSI driver). To force a rollout when config changes, hash the config into a pod-template annotation (e.g. a checksum/config annotation) so the Deployment’s pod template changes and triggers a rolling update.
securityContext and Pod Security
A hardened pod runs as an unprivileged user, with a read-only root filesystem, no extra Linux capabilities, and no privilege escalation. The fields live at pod and container level:
| Field | Set to | Why |
|---|---|---|
runAsNonRoot: true |
always | Refuses to start a container running as UID 0 |
runAsUser / runAsGroup |
a high non-zero UID (e.g. 10001) |
Drops root explicitly |
allowPrivilegeEscalation: false |
always | Blocks setuid/setgid gaining more privilege than the parent |
readOnlyRootFilesystem: true |
where feasible | Immutable container FS; mount emptyDir for writable paths |
capabilities.drop: ["ALL"] |
always | Start from zero Linux capabilities, add back only what is needed |
seccompProfile.type: RuntimeDefault |
always | Restricts the syscalls the container can make |
These are enforced cluster-side by Pod Security Admission (PSA), the built-in replacement for the removed PodSecurityPolicy. PSA applies one of three Pod Security Standards per namespace via labels:
| Standard | What it allows | Use for |
|---|---|---|
| privileged | Unrestricted | System/infra namespaces only |
| baseline | Blocks known privilege escalations | A sane minimum for most apps |
| restricted | Enforces the hardening above (non-root, drop ALL, seccomp, etc.) | The target for production workloads |
You set it with namespace labels — pod-security.kubernetes.io/enforce: restricted (plus warn and audit variants to surface violations without blocking during migration). Aim every production namespace at restricted and make the workload comply, rather than weakening the namespace to fit a lax workload.
NetworkPolicy: default-deny networking
By default, every pod can talk to every other pod in the cluster — a flat network with no segmentation. A NetworkPolicy restricts ingress and egress at the pod level (enforced by your CNI — Calico, Cilium, etc.; note that some CNIs do not enforce NetworkPolicy at all, so verify yours does).
The production baseline is default-deny, then allow what is needed: apply a policy that selects all pods in a namespace and denies all ingress (and ideally egress), then add narrow allow-policies for the specific flows your app needs — e.g. “allow ingress to the API on port 8080 from pods labelled role=frontend,” and “allow egress to the database namespace on 5432 and to kube-dns on 53.” This turns a single compromised pod from a cluster-wide pivot point into a contained incident. Remember to allow DNS egress (UDP/TCP 53 to kube-system) or name resolution breaks in subtle ways.
Observability: metrics, logs and traces
You cannot operate what you cannot see. Production-ready means the three pillars are wired in from day one, not bolted on after the first incident:
| Pillar | What it gives you | Common stack |
|---|---|---|
| Metrics | Aggregate health, alerting, autoscaling signals | Prometheus + Grafana; expose /metrics, set prometheus.io/scrape or a ServiceMonitor |
| Logs | Per-request detail, debugging | Write structured logs to stdout/stderr; collect with Fluent Bit/Loki/ELK |
| Traces | Latency across service hops | OpenTelemetry → Tempo/Jaeger |
Three minimums: log to stdout/stderr (never to a file inside the container — the platform collects stdout), emit structured (JSON) logs so they are queryable, and expose application metrics including the RED signals (Rate, Errors, Duration) so you can define SLOs and drive the HPA on a meaningful signal. Wire metrics to your readiness/SLO story so alerts fire on user-visible symptoms, not just on pod restarts.
Autoscaling: the HorizontalPodAutoscaler
The HorizontalPodAutoscaler (HPA) adds and removes pod replicas to track a target metric — most commonly CPU utilisation as a percentage of the pod’s CPU request (which is exactly why requests are non-negotiable: with no request, the HPA has nothing to compute a percentage against). It needs the metrics-server installed.
Key knobs: minReplicas/maxReplicas (the bounds), the target (e.g. averageUtilization: 70), and behavior (scale-up/down stabilisation windows and rate limits, to damp flapping). For metrics beyond CPU/memory — queue depth, requests-per-second, external signals — you graduate to KEDA, covered in Kubernetes Autoscaling: HPA, KEDA & Karpenter. Pair the HPA with a PDB and topology spread so scaling events keep replicas well distributed and respect disruption limits.
The diagram groups every control above into the four readiness pillars — health & lifecycle, resources & scaling, resilience & disruption, and security & networking — so you can see at a glance which knob defends against which failure mode.
The copy-paste production-readiness checklist
Paste this into your pull-request template or a READINESS.md and tick each box before a workload carries real traffic.
PRODUCTION-READINESS CHECKLIST (tick every box before go-live)
HEALTH & LIFECYCLE
[ ] Readiness probe defined; reflects "can serve traffic now" (warm-up + critical deps)
[ ] Liveness probe defined; cheap, local, no external dependency calls
[ ] Startup probe for slow-starting apps (so liveness can stay tight)
[ ] App traps SIGTERM and drains in-flight work before exit
[ ] preStop hook (sleep 5-15s) to cover async endpoint removal
[ ] terminationGracePeriodSeconds > preStop sleep + longest in-flight request
RESOURCES & SCALING
[ ] CPU + memory requests set on every container
[ ] Memory limit == memory request (predictable; avoid OOM surprises)
[ ] QoS class is Guaranteed or Burstable (never BestEffort for prod)
[ ] HPA configured with min/max and a meaningful target (CPU% or custom)
[ ] metrics-server (and Prometheus adapter / KEDA if custom metrics) installed
[ ] Namespace LimitRange + ResourceQuota in place (shared clusters)
RESILIENCE & DISRUPTION
[ ] replicas >= 2 (>=3 for quorum/HA services)
[ ] PodDisruptionBudget set (maxUnavailable or minAvailable) and not blocking drains
[ ] Topology spread across nodes (and zones) configured
[ ] RollingUpdate: maxUnavailable: 0 / maxSurge: 1 (capacity never dips), or justified
[ ] No single points of failure pinned to one node/zone
CONFIG & SECRETS
[ ] Config externalised to ConfigMap (no config baked into the image)
[ ] Secrets in Secret objects; encryption-at-rest enabled (or external store/CSI)
[ ] Secrets mounted as files where possible (not env); checksum annotation to roll on change
SECURITY
[ ] runAsNonRoot: true, runAsUser a high non-zero UID
[ ] allowPrivilegeEscalation: false; capabilities drop ALL
[ ] readOnlyRootFilesystem: true (+ emptyDir for writable paths)
[ ] seccompProfile: RuntimeDefault
[ ] Namespace at Pod Security 'restricted' (enforce)
[ ] Image pinned by digest; scanned; pulled from a trusted registry
NETWORKING
[ ] Default-deny NetworkPolicy in the namespace
[ ] Explicit allow rules for required ingress/egress (incl. DNS egress to kube-dns)
OBSERVABILITY
[ ] Logs to stdout/stderr, structured (JSON)
[ ] App metrics exposed (/metrics) incl. Rate/Errors/Duration
[ ] Dashboards + alerts on user-visible SLOs; tracing wired (OpenTelemetry)
[ ] Labels/annotations: app, version, owner, runbook link
A hardened Deployment manifest
This single manifest wires together almost every control above — probes, resources for a Guaranteed pod, graceful shutdown, a safe rolling update, externalised config, a full securityContext, and topology spread. Read it top to bottom; the inline comments map each block back to the checklist.
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
labels:
app: orders-api
version: "1.4.2" # observability: every object carries app + version
spec:
replicas: 3 # resilience: >=3 so a PDB + spread are meaningful
revisionHistoryLimit: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # capacity never dips below desired during a rollout
maxSurge: 1 # one new (Ready) pod created before an old one goes
selector:
matchLabels:
app: orders-api
template:
metadata:
labels:
app: orders-api
version: "1.4.2"
annotations:
checksum/config: "REPLACED_BY_CI_WITH_HASH" # roll pods when ConfigMap changes
spec:
terminationGracePeriodSeconds: 45 # > preStop sleep + longest in-flight request
securityContext: # pod-level: applies to all containers
runAsNonRoot: true
runAsUser: 10001
runAsGroup: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway # spread across zones, best effort
labelSelector:
matchLabels:
app: orders-api
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule # never two replicas on one node
labelSelector:
matchLabels:
app: orders-api
containers:
- name: orders-api
image: registry.example.com/orders-api@sha256:<digest> # pin by digest
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
envFrom:
- configMapRef:
name: orders-api-config # externalised, non-sensitive config
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: orders-api-secrets # sensitive value from a Secret
key: db-password
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # requests == limits => Guaranteed QoS
memory: "512Mi" # memory limit == request avoids OOM surprises
startupProbe: # generous one-time boot budget
httpGet: { path: /healthz, port: http }
periodSeconds: 5
failureThreshold: 30 # up to 150s to start, then hand over
readinessProbe: # gates Service endpoints
httpGet: { path: /readyz, port: http }
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 3
livenessProbe: # cheap, local; restarts a wedged process
httpGet: { path: /healthz, port: http }
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"] # cover async endpoint removal
securityContext: # container-level hardening
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp # writable path despite read-only root FS
volumes:
- name: tmp
emptyDir: {}
Pair it with the three companion objects the checklist demands — a PDB, an HPA, and a default-deny NetworkPolicy:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: orders-api
spec:
maxUnavailable: 1 # node drains take at most one replica at a time
selector:
matchLabels:
app: orders-api
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: orders-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: orders-api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # 70% of the pod's CPU request
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {} # selects every pod in the namespace
policyTypes: ["Ingress"] # deny all ingress; add explicit allow-policies next
Hands-on lab
You will harden a workload on a free local cluster, then prove each control works — watching a rollout stay up, a PDB block a drain, and a missing-request pod fail to autoscale. Roughly 25 minutes.
1. Create a cluster and a namespace
# kind (or: minikube start / k3d cluster create ready)
kind create cluster --name ready
kubectl create namespace shop
kubectl label namespace shop \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/warn=restricted
Labelling the namespace restricted means Pod Security Admission will reject any pod that is not hardened — a fast way to verify your manifest actually complies.
2. Try an unhardened pod (and watch it get rejected)
kubectl -n shop run nginx --image=nginx:1.27
Expected: the request is denied with a message listing violations (allowPrivilegeEscalation != false, unrestricted capabilities, runAsNonRoot != true, seccompProfile). This is Pod Security doing its job — proof that “restricted” is enforced.
3. Deploy the hardened workload
Save the hardened Deployment above as orders-api.yaml (swap the image for a runnable hardened one — ghcr.io/nginxinc/nginx-unprivileged:1.27 listens on 8080 and runs as non-root; point both probes at /), plus the PDB and HPA, then apply:
kubectl -n shop apply -f orders-api.yaml
kubectl -n shop rollout status deploy/orders-api
kubectl -n shop get pods -o wide # confirm spread across nodes
Expected: three pods reach Running and READY 1/1. On a multi-node cluster the -o wide output shows them on different nodes (topology spread). Confirm the QoS class is Guaranteed:
kubectl -n shop get pod -l app=orders-api \
-o jsonpath='{.items[0].status.qosClass}{"\n"}'
# -> Guaranteed
4. Watch a zero-downtime rollout
# In terminal 1, hammer the Service (after exposing it):
kubectl -n shop expose deploy/orders-api --port=80 --target-port=8080
kubectl -n shop run curl --image=curlimages/curl --restart=Never -it --rm -- \
sh -c 'while true; do curl -s -o /dev/null -w "%{http_code}\n" orders-api; sleep 0.5; done'
# In terminal 2, trigger a rollout:
kubectl -n shop set image deploy/orders-api orders-api=ghcr.io/nginxinc/nginx-unprivileged:1.26
Expected: the curl loop keeps printing 200 throughout — maxUnavailable: 0 plus a working readiness probe means no request is dropped.
5. Prove the PodDisruptionBudget protects you
NODE=$(kubectl -n shop get pod -l app=orders-api \
-o jsonpath='{.items[0].spec.nodeName}')
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data
Expected: the drain evicts pods one at a time, waiting for replacements to become Ready, because maxUnavailable: 1 forbids taking down more than one at once. With a single replica and minAvailable: 1, this command would block — that is the foot-gun to avoid. Uncordon when done: kubectl uncordon "$NODE".
6. See why requests matter for autoscaling
kubectl -n shop describe hpa orders-api | grep -A3 Metrics
If metrics-server is installed you will see a CPU percentage; if you had omitted CPU requests, the HPA would report <unknown> and refuse to scale — the concrete reason requests are non-negotiable. (On kind, install metrics-server with --kubelet-insecure-tls to see live numbers.)
Cleanup
kubectl delete namespace shop
kind delete cluster --name ready # or: minikube delete / k3d cluster delete ready
Cost note
Everything here runs on a free local cluster (kind/minikube/k3d) on your laptop — zero cloud spend. The only cost is the few hundred MB of RAM the control plane and three small pods use.
Common mistakes & troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Requests dropped during every rollout | No readiness probe, or it reports ready too early | Add an honest readiness probe gating real serving capability; set maxUnavailable: 0 |
| Restart storm during a dependency outage | Liveness probe calls the slow/down dependency | Make liveness cheap and local; check dependencies in readiness, not liveness |
Pod OOMKilled, restarts repeatedly |
Memory limit too low, or limit set well below real usage | Set memory request == limit to the observed working set; right-size with VPA recommendations |
kubectl drain hangs forever |
PDB cannot be satisfied (e.g. single replica, minAvailable: 1) |
Run >=2 replicas; relax PDB; or --disable-eviction only as a last resort |
| All replicas on one node; node drain caused an outage | No topology spread / anti-affinity | Add topology spread on kubernetes.io/hostname (and zone) |
| Pod rejected at apply with policy violations | Namespace enforces restricted; manifest not hardened |
Add the full securityContext (non-root, drop ALL, seccomp, no priv-esc) |
| Requests work in-cluster but break after deploy | Connections to Terminating pods during async endpoint removal |
Add a preStop sleep; ensure the app traps SIGTERM and drains |
HPA shows <unknown> targets, never scales |
No CPU/memory request, or metrics-server missing | Set requests; install metrics-server |
| Config change not picked up | ConfigMap updated but pods not restarted | Add a checksum/config annotation to the pod template to force a rollout |
| DNS resolution fails after adding NetworkPolicy | Default-deny egress blocks port 53 to kube-dns | Add an egress allow rule to kube-system DNS on UDP/TCP 53 |
Best practices
- Make readiness honest and liveness cheap. Readiness gates user traffic; liveness only restarts a wedged process. Never let liveness depend on an external system.
- Always set requests; set memory limits equal to memory requests. Be deliberate (and often sparing) with CPU limits — throttling hurts latency without preventing failure.
- Run at least two (ideally three) replicas for anything that serves traffic, and back them with a PDB plus topology spread so neither maintenance nor a node failure can take you down.
- Roll out with
maxUnavailable: 0/maxSurge: 1for capacity-sensitive services, and pair it with graceful shutdown (preStop+ SIGTERM handling + a sufficient grace period). - Externalise config and secrets, mount secrets as files, enable encryption at rest, and force rollouts on config change with a checksum annotation.
- Pin images by digest, scan them, and standardise labels (
app,version,owner, runbook link) so observability and incident response have something to key on. - Treat the checklist as a gate, not a wishlist — enforce it with Pod Security Admission and policy-as-code (Kyverno/OPA Gatekeeper) so non-compliant workloads cannot reach production.
Security notes
Production-readiness is security here. Three points deserve emphasis. First, Secrets are base64, not encrypted, by default — anyone with get secret RBAC or etcd access can read them; enable encryption at rest, prefer mounting over env vars, and consider an external store via the Secrets Store CSI driver. Second, the default flat network is a lateral-movement highway — a default-deny NetworkPolicy turns a single compromised pod into a contained incident instead of a cluster-wide pivot; just remember to allow DNS egress. Third, restricted Pod Security is the floor, not the ceiling — a non-root, read-only, capability-stripped pod with RuntimeDefault seccomp removes the most common container-escape and privilege-escalation paths; layer on Pod Security Admission to enforce it cluster-side. Combine least-privilege RBAC, image provenance (signed, scanned, digest-pinned), and these pod-level controls for defence in depth.
Interview & exam questions
-
What is the difference between a liveness and a readiness probe, and what happens when each fails? Liveness answers “is this container wedged?” — on failure the container is restarted. Readiness answers “should this pod get traffic?” — on failure the pod is removed from Service endpoints but not restarted. Liveness fixes hangs; readiness controls traffic during warm-up, overload or dependency loss.
-
When and why would you add a startup probe? For slow-starting apps (JVM, legacy). It gives a generous one-time boot budget and gates liveness/readiness until it passes, so you can keep the liveness probe tight for the rest of the container’s life instead of inflating
initialDelaySeconds. -
Why should a liveness probe never call an external dependency? If the dependency is slow or down, the probe fails, Kubernetes restarts the container, and you get a restart storm that makes recovery harder — turning a dependency blip into a self-inflicted outage. Liveness must be cheap and local.
-
What determines a pod’s QoS class, and why does it matter? The relationship between requests and limits. Guaranteed = requests equal limits for both CPU and memory; Burstable = some requests/limits but not equal; BestEffort = none set. QoS sets eviction order under node memory pressure: BestEffort is evicted first, Guaranteed last.
-
What happens when a container exceeds its CPU limit versus its memory limit? Over the CPU limit it is throttled (slowed, never killed). Over the memory limit it is OOMKilled and restarted. Hence: be cautious with CPU limits (throttling hurts latency); set memory limit equal to request for predictability.
-
What does a PodDisruptionBudget protect against, and what does it not? It limits voluntary disruptions (drains, autoscaler scale-down, node-pool upgrades) so too many replicas are not evicted at once. It does not protect against involuntary disruptions (node/hardware failure) — spread (topology/anti-affinity) handles those. And it only works with >1 replica.
-
How do you achieve a zero-downtime rolling update? Run multiple replicas, set
maxUnavailable: 0andmaxSurge: 1(a new Ready pod before removing an old one), back it with an honest readiness probe, and implement graceful shutdown (preStop sleep + SIGTERM handling + adequateterminationGracePeriodSeconds). -
Why might requests still reach a pod after it enters
Terminating? Endpoint removal propagates asynchronously through kube-proxy and ingress controllers, so for a short window a terminating pod can still be a target. Mitigate with apreStopsleep that delays SIGTERM until the removal has propagated, plus in-app connection draining. -
Prefer topology spread constraints or pod anti-affinity, and why? Topology spread constraints for new work — they express “spread evenly across this domain” directly with
maxSkew, are cheaper for the scheduler, and support soft/hard viawhenUnsatisfiable. Anti-affinity is the older, more expensive mechanism for keeping pods apart. -
How does the HorizontalPodAutoscaler use resource requests? CPU utilisation is computed as a percentage of the pod’s CPU request, so without a request the HPA has no denominator and reports
<unknown>, refusing to scale. This is a key reason requests are mandatory. The HPA also needs metrics-server. -
What replaced PodSecurityPolicy, and how do you enforce hardening cluster-side? Pod Security Admission (PSA), applied per namespace via labels (
pod-security.kubernetes.io/enforce: restricted, withwarn/auditfor migration). It enforces the Pod Security Standards (privileged / baseline / restricted);restrictedrequires non-root, dropped capabilities, seccompRuntimeDefault, no privilege escalation, etc. -
What is the default pod-to-pod network behaviour, and how do you secure it? By default every pod can reach every other pod. Apply a default-deny NetworkPolicy (select all pods, deny ingress/egress), then add narrow allow-rules per required flow — remembering to allow DNS egress to kube-dns on port 53. Enforcement depends on a CNI that supports NetworkPolicy.
Quick check
- Which probe controls whether a pod appears in a Service’s endpoint list?
- A pod has CPU/memory requests equal to its limits. What QoS class is it, and where does it sit in eviction order?
- You set
minAvailable: 1on a single-replica Deployment and then runkubectl drain. What happens? - What two rolling-update fields give you “never drop below desired capacity,” and what value does each take?
- Name the three minimum observability practices for a production workload.
Answers
- The readiness probe — on failure the pod is removed from Service endpoints (it is not restarted).
- Guaranteed, and it is evicted last under node memory pressure (most protected).
- The drain blocks indefinitely — evicting the only replica would breach
minAvailable: 1, so the node cannot be drained. Run at least two replicas. maxUnavailable: 0(no pod may be unavailable) andmaxSurge: 1(one extra Ready pod is created before an old one is removed).- Log to stdout/stderr, emit structured (JSON) logs, and expose application metrics (Rate/Errors/Duration) for SLOs and autoscaling.
Exercise
Take an unhardened Deployment of your choice (or the bare nginx from the lab) and bring it to production-readiness against the checklist, proving each control:
- Add liveness, readiness and startup probes pointing at real endpoints; demonstrate that failing readiness drops the pod from Service endpoints (
kubectl get endpoints) without a restart. - Set requests and limits to land the pod in Guaranteed QoS; verify with
kubectl get pod -o jsonpath='{.status.qosClass}'. - Scale to three replicas, add a PDB (
maxUnavailable: 1) and topology spread across nodes; drain a node and show eviction proceeds one pod at a time. - Configure
maxUnavailable: 0/maxSurge: 1, add apreStopsleep and a sensible grace period, and show a rollout that keeps acurlloop returning200throughout. - Move all config to a ConfigMap and any secret to a Secret; apply the full
restrictedsecurityContextand confirm the pod is admitted into arestrictednamespace. - Add a default-deny NetworkPolicy plus the minimum allow-rules (including DNS egress) and confirm the app still works.
Write a short READINESS.md recording which checklist items you completed and the command that proves each one — exactly what a reviewer would ask for.
Certification mapping
| Exam | Where this lesson maps |
|---|---|
| CKAD | Application Design and Build (probes, multi-container patterns, config), Application Deployment (rolling updates, deployment strategies), Application Observability and Maintenance (probes, logging, monitoring), Services & Networking (NetworkPolicy) — this is core CKAD territory |
| CKA | Workloads & Scheduling (deployments, rolling updates, resource limits, PDBs, topology), Services & Networking (NetworkPolicy), Troubleshooting (probe and resource failures) |
| CKS | Minimize Microservice Vulnerabilities (securityContext, Pod Security Standards), System Hardening and Cluster Hardening (NetworkPolicy default-deny, least privilege) |
| KCNA | Conceptual coverage of probes, resources, scaling and observability for the entry-level exam |
Glossary
- Liveness probe — a check that, on failure, restarts the container; for detecting wedged processes.
- Readiness probe — a check that, on failure, removes the pod from Service endpoints; for controlling traffic.
- Startup probe — a one-time boot-budget check that gates liveness/readiness for slow-starting apps.
- Request — the CPU/memory the scheduler reserves for a container; the basis for bin-packing and HPA percentages.
- Limit — the hard ceiling enforced by the kubelet/runtime; CPU is throttled, memory is OOMKilled.
- QoS class — Guaranteed / Burstable / BestEffort, derived from requests vs limits; sets eviction order.
- PodDisruptionBudget (PDB) — caps how many pods may be voluntarily disrupted at once.
- Topology spread constraint — scheduler rule to distribute pods evenly across a topology domain (node, zone).
- Voluntary vs involuntary disruption — operator-initiated (drain, scale-down) vs unplanned (node failure).
- Graceful shutdown — endpoint removal → preStop → SIGTERM → (grace period) → SIGKILL; lets a pod drain cleanly.
- securityContext — pod/container security settings (non-root, capabilities, read-only FS, seccomp).
- Pod Security Admission (PSA) — built-in admission controller enforcing the Pod Security Standards per namespace.
- NetworkPolicy — pod-level ingress/egress firewall rules enforced by the CNI.
- HorizontalPodAutoscaler (HPA) — scales replica count to track a target metric (often CPU% of request).
Next steps
You can now take any workload from “it runs” to “it is production-ready.” Next, learn to build and operate the cluster itself — HA control planes, etcd backup and safe upgrades — in Provisioning Production Kubernetes: kubeadm, HA Control Plane, etcd Backup & Upgrades. To go deeper on individual controls, see Kubernetes Autoscaling: HPA, KEDA & Karpenter, Right-Sizing with the Vertical Pod Autoscaler, Default-Deny Network Policies with Cilium, and Pod Security Admission: Baseline to Restricted.