Containerization Architecture

Provisioning Production Kubernetes: kubeadm, HA Control Plane, etcd Backup & Upgrades

Up to now in this course you have used clusters — deploying workloads, making them production-ready, troubleshooting them. This lesson is about building and running the cluster itself. When you let a cloud provider hand you a managed control plane (EKS, AKS, GKE), an enormous amount of careful engineering is hidden from you: how the API server is made redundant, where the cluster’s data actually lives, how the TLS certificates that hold the whole thing together are issued and rotated, and how you upgrade a live cluster without dropping a single request. The day you self-manage — on-prem, at the edge, in an air-gapped environment, or simply to understand what the managed service is doing for you — all of that becomes your job.

We will use kubeadm, the official, opinionated cluster-bootstrapping tool. kubeadm is not a full installer (it deliberately leaves CNI, OS prep, and infrastructure to you), but it is the reference for how a conformant cluster is assembled, and it is exactly what the CKA exam expects you to drive. By the end you will understand a highly available (HA) control plane built from three or more control-plane nodes behind a load balancer; the crucial choice between stacked and external etcd; how to back up and restore etcd (the single most important operational skill for a self-managed cluster); how to safely take nodes in and out of service; and how to perform a staged, version-skew-aware upgrade of a running cluster. You will do the hands-on parts free and locally so the commands are real, not abstract.

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You should be comfortable with the control-plane components and the reconciliation loop from Kubernetes Architecture Deep-Dive, and with the workload concepts from Pods, Deployments & Services and the Day-2 readiness checklist. You need a terminal, and for the hands-on lab a free local cluster tool — kind, minikube, or k3d — plus kubectl and etcdctl. This is the Architecture tier of the Kubernetes Zero-to-Hero course: it is where you stop being a consumer of clusters and start being the person who can build, operate, recover, and upgrade one. It is also the densest single block of CKA exam content, so the commands here are worth committing to muscle memory.

A note on honesty before we start: you will almost never hand-build a production cluster the way this lesson does, command by command — you will use a higher-level installer (Cluster API, kubespray, Rancher, or a managed service). But you will be operating, recovering, and upgrading such clusters, and to do that well you must understand the anatomy underneath. That anatomy is the whole point of this lesson.

Core concepts: what a cluster actually is

Strip away the abstractions and a Kubernetes cluster is two things: a control plane that holds the desired state and makes decisions, and a set of worker nodes that run your Pods. The control plane is itself a small set of programs:

Component Role How kubeadm runs it
kube-apiserver The front door — the only thing that talks to etcd; everything else talks to it Static Pod on each control-plane node
etcd The cluster’s database — all state lives here as key/value data Static Pod (stacked) or separate machines (external)
kube-scheduler Assigns Pods to nodes Static Pod on each control-plane node (leader-elected)
kube-controller-manager Runs the reconciliation controllers Static Pod on each control-plane node (leader-elected)
kubelet Node agent — talks to the container runtime via CRI; runs the static Pods above systemd service on every node
kube-proxy Programs Service networking (iptables/IPVS) DaemonSet

Two ideas do most of the heavy lifting in this lesson:

Jargon check. Quorum is the minimum number of etcd members that must agree (a majority) for the cluster’s data store to accept writes. PKI (public-key infrastructure) is the set of certificate authorities and certificates that let cluster components prove who they are to each other over TLS. Version skew is the set of rules about how far apart component versions are allowed to drift during an upgrade.

kubeadm: what init and join actually do

kubeadm does the fiddly, security-critical bootstrap and nothing more. It is worth knowing exactly what each command produces, because half of CKA cluster questions are really “do you understand the artefacts kubeadm created?”

kubeadm init on the first control-plane node performs these phases (you can run them individually with kubeadm init phase ...):

  1. Preflight checks — kernel modules, swap off, ports free, container runtime reachable, correct cgroup driver (systemd is the modern default).
  2. Generate the PKI — a self-signed cluster CA at /etc/kubernetes/pki/, plus the etcd CA, the front-proxy CA, and the service-account signing keypair. Every component cert is signed by one of these.
  3. Write kubeconfigsadmin.conf (your cluster-admin credential — this is what you copy to ~/.kube/config), plus controller-manager.conf, scheduler.conf, and the kubelet’s bootstrap config.
  4. Write static Pod manifests to /etc/kubernetes/manifests/ for the apiserver, controller-manager, scheduler, and (in the stacked topology) etcd. The kubelet starts them.
  5. Bootstrap tokens & add-ons — install the cluster DNS (CoreDNS) and kube-proxy, and print a kubeadm join command containing a token and the CA cert hash.

A minimal init for an HA cluster names the load balancer up front so the certificates and kubeconfigs point at the stable address, not at one node:

# On the FIRST control-plane node
sudo kubeadm init \
  --control-plane-endpoint "k8s-api.internal:6443" \  # the LB DNS:port — REQUIRED for HA
  --upload-certs \                                     # stash certs in a Secret so other CP nodes can pull them
  --pod-network-cidr "10.244.0.0/16" \                 # must match your CNI
  --kubernetes-version v1.30.2

--control-plane-endpoint is the make-or-break flag for HA: it bakes the load balancer’s address into every cert and kubeconfig. Omit it and you can never add a second control-plane node without regenerating certificates. --upload-certs puts the control-plane certificates into a short-lived (2-hour) Secret so additional control-plane nodes can fetch them with a --certificate-key instead of you copying files by hand.

kubeadm join comes in two flavours, and the difference is one flag:

# Join a WORKER node
sudo kubeadm join k8s-api.internal:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

# Join an additional CONTROL-PLANE node (note --control-plane + --certificate-key)
sudo kubeadm join k8s-api.internal:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane \
  --certificate-key <key-from-upload-certs>

The --discovery-token-ca-cert-hash is what stops a worker from being tricked into joining an imposter API server — it pins the cluster CA. Tokens expire after 24 hours by default; regenerate one any time with kubeadm token create --print-join-command.

Designing an HA control plane

A single control-plane node is a single point of failure: lose it and you cannot schedule, scale, or change anything (existing Pods keep running, but the cluster is “frozen”). HA means three or more control-plane nodes so the cluster keeps making decisions through the loss of one. Two things must be made redundant: the API server (stateless, easy — just run several behind a load balancer) and etcd (stateful, the hard part — covered next).

The non-negotiable pieces of an HA control plane:

Piece Requirement Why
Control-plane nodes Odd number ≥ 3 (3 is standard; 5 for large/critical) etcd needs a majority; odd counts maximise fault tolerance per node
Load balancer TCP (layer 4) LB in front of all apiserver :6443 Gives one stable --control-plane-endpoint; health-checks out dead apiservers
etcd 3 or 5 members with quorum The data store must survive a node loss
Spread Across racks/AZs/failure domains A correlated failure must not take a majority at once

Why odd numbers and why three? etcd uses the Raft consensus protocol and must have a majority (quorum) of members available to accept writes. The fault tolerance is (n-1)/2 rounded down:

etcd members Quorum (majority) Failures tolerated
1 1 0
2 2 0 (worse than 1 — you doubled risk for no gain)
3 2 1
4 3 1 (no better than 3, more overhead)
5 3 2
7 4 3

The killer insight in that table: two members tolerate zero failures (you need both, so either dying loses quorum), and four are no more resilient than three. Even numbers buy you nothing but cost and write latency, so you always run an odd number, and three is the sweet spot for most clusters. Beyond seven, write latency (every write must reach a majority) outweighs the extra resilience, so etcd recommends a maximum of seven members.

The load balancer should be layer 4 (TCP), not layer 7, because the API server speaks TLS end-to-end and you do not want the LB terminating it. Its health check should probe /healthz (or /livez) on :6443 so a wedged apiserver is taken out of rotation. On-prem this is commonly HAProxy + keepalived (a virtual IP that floats to a healthy node); in cloud it is a network load balancer.

Stacked vs external etcd: the architecture decision

This is the architecture choice the brief cares about most, and a classic interview question. kubeadm supports two topologies for where etcd runs.

Stacked etcd (the kubeadm default): each control-plane node also runs an etcd member, co-located as a static Pod on the same machine. Three control-plane nodes give you a three-member etcd cluster “for free.”

External etcd: etcd runs on its own dedicated set of machines, separate from the control-plane nodes. The API servers point at this external etcd cluster.

Stacked (3 nodes):                     External (3 + 3 nodes):
┌─────────────┐                        ┌─────────────┐   ┌──────────┐
│ CP1: api,   │                        │ CP1: api,   │   │ etcd-1   │
│   sched,    │                        │   sched,    │──▶│          │
│   ctrl,etcd │                        │   ctrl      │   │ etcd-2   │
├─────────────┤   etcd cluster         ├─────────────┤──▶│          │
│ CP2: ...etcd│   (co-located)         │ CP2: ...    │   │ etcd-3   │
├─────────────┤                        ├─────────────┤──▶│ (own     │
│ CP3: ...etcd│                        │ CP3: ...    │   │  machines)│
└─────────────┘                        └─────────────┘   └──────────┘

The comparison that matters:

Dimension Stacked etcd External etcd
Machines needed 3 (CP = etcd) 6+ (3 CP + 3 etcd)
Cost & ops complexity Lower — fewer machines, kubeadm manages it Higher — separate cluster to provision, secure, back up, upgrade
Blast radius Losing a node loses an apiserver and an etcd member together etcd and control-plane failures are decoupled
etcd I/O isolation etcd competes with apiserver for CPU/disk on the same box etcd gets dedicated, fast disks (it is latency-sensitive)
Scaling etcd independently No — tied to control-plane node count Yes
kubeadm default Yes No (you provision etcd first, then point kubeadm at it)
Best for Most clusters; simplicity wins Large/critical clusters; strict isolation; very high object counts

The right default is stacked — it is simpler, cheaper, and what most clusters run. Reach for external when you need to isolate etcd’s disk I/O (etcd is acutely sensitive to disk latency — slow disks cause leader elections and cluster-wide slowdowns), when your blast-radius requirements forbid coupling a control-plane failure with an etcd-member failure, or when a very large cluster pushes etcd hard enough to warrant its own tuned hardware. The trade-off is real operational weight: a second cluster to secure (its own TLS), back up, and upgrade.

For external etcd, you set up the etcd cluster yourself (or with kubeadm in etcd-only mode), then feed kubeadm a config file pointing at it:

# kubeadm-config.yaml (external etcd)
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.30.2
controlPlaneEndpoint: "k8s-api.internal:6443"
etcd:
  external:
    endpoints:
      - https://10.0.1.10:2379
      - https://10.0.1.11:2379
      - https://10.0.1.12:2379
    caFile: /etc/kubernetes/pki/etcd/ca.crt
    certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt
    keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key

etcd backup and restore: the skill that saves your job

If you remember one thing from this lesson, make it this. An etcd snapshot is the difference between a five-minute recovery and a destroyed cluster. Managed Kubernetes does this for you invisibly; on a self-managed cluster it is your responsibility, and “we never took a backup” is how clusters die permanently.

A snapshot is a point-in-time copy of the entire key-value store — every object in the cluster. Take it with etcdctl, pointing at a live etcd member with that member’s client certificates:

# Back up etcd (run on a control-plane node; v3 API is required)
sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot is intact and see how many keys it holds
sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-snapshot.db --write-out=table

In production you run that on a schedule (a CronJob or a systemd timer), copy the snapshot off the cluster to object storage, and test restoring it periodically — an untested backup is a rumour, not a backup.

Restore is the high-stakes operation. You restore the snapshot into a new data directory and point etcd at it. The exact safe sequence on a single-member (or one-at-a-time) restore:

# 1. STOP the control plane so nothing writes to etcd while you restore.
#    Move the static Pod manifests aside; the kubelet stops the Pods.
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/

# 2. Restore the snapshot into a fresh data directory.
sudo ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored

# 3. Point etcd's static Pod manifest at the new data dir
#    (edit the hostPath volume for etcd-data to /var/lib/etcd-restored),
#    then move the manifests back so the kubelet restarts the control plane.
sudo vi /tmp/etcd.yaml         # change the data dir volume path
sudo mv /tmp/*.yaml /etc/kubernetes/manifests/

# 4. Verify
kubectl get nodes
kubectl get pods -A

The non-obvious rules that trip people up in exams and incidents:

Certificate management and rotation

The cluster is held together by TLS. kubeadm generates a cluster CA and issues short-lived (one-year) certificates to the API server, controller-manager, scheduler, etcd, and the kubelets. Certificates that silently expire are one of the most common causes of a “the cluster died overnight” incident — the API server stops trusting its own components and everything grinds to a halt.

Check expiry at any time:

sudo kubeadm certs check-expiration

This prints every certificate, its expiry date, and whether the CA that signed it is externally managed. The good news: kubeadm auto-renews all control-plane certificates on every kubeadm upgrade. A cluster you upgrade at least yearly effectively never sees an expired cert. The trap is the cluster that runs untouched for over a year — those expire and you must renew manually:

# Renew all kubeadm-managed certs (then restart the control-plane static Pods)
sudo kubeadm certs renew all

# Or one at a time, e.g. just the apiserver
sudo kubeadm certs renew apiserver

After renewing you must restart the static Pods (move the manifests out and back, or reboot the kubelet) so they pick up the new certs, and refresh your admin.conf if that was renewed.

The kubelet has its own certificates and is the one place automatic rotation is the norm:

The one certificate kubeadm cannot auto-rotate is the cluster root CA itself (default ten-year life) — rotating a CA is a deliberate, disruptive operation because every component must re-trust the new CA. Plan it; do not let it surprise you a decade in.

Node lifecycle: cordon, drain, taints

Operating a cluster means routinely taking nodes out of service — to patch the OS, upgrade the kubelet, replace hardware, or decommission. Do it wrong and you cause an outage; do it right and users never notice. The vocabulary:

Operation What it does When
kubectl cordon <node> Marks the node unschedulable — no new Pods land, existing Pods stay First step before maintenance; “stop sending me work”
kubectl drain <node> Cordons and evicts existing Pods (respecting PodDisruptionBudgets) so they reschedule elsewhere Before reboot/upgrade/decommission
kubectl uncordon <node> Marks the node schedulable again After maintenance, to return it to the pool
Taint A property on a node that repels Pods unless they carry a matching toleration Reserve nodes (GPU, control plane); auto-applied on problems

The safe maintenance sequence is always cordon → drain → do the work → uncordon:

# Take node out of service safely (skip DaemonSets, tolerate emptyDir if you must)
kubectl drain node-3 --ignore-daemonsets --delete-emptydir-data

# ... patch / upgrade / reboot node-3 ...

kubectl uncordon node-3      # put it back in the pool

drain honours PodDisruptionBudgets — if evicting a Pod would breach a PDB, the drain blocks and waits, which is exactly the protection you want (it is why the Day-2 lesson insisted on PDBs). --ignore-daemonsets is almost always required because DaemonSet Pods are managed per-node and are not drained.

Taints and tolerations are the complementary mechanism: a taint repels, a toleration grants permission to ignore that repulsion. Control-plane nodes carry node-role.kubernetes.io/control-plane:NoSchedule so ordinary workloads stay off them. Taints have three effects: NoSchedule (don’t place new Pods), PreferNoSchedule (avoid if possible), and NoExecute (also evict running Pods that don’t tolerate it — this is how Kubernetes evacuates a NotReady node automatically).

kubectl taint nodes gpu-1 dedicated=gpu:NoSchedule     # add
kubectl taint nodes gpu-1 dedicated=gpu:NoSchedule-    # remove (trailing minus)

To decommission a node for good: drain it, then kubectl delete node <name> to remove it from the cluster’s records, then on the node itself kubeadm reset to clean up its kubeadm state.

The version-skew policy and a staged upgrade

Upgrading a live cluster is the operation that most rewards understanding and most punishes guesswork. Two rules govern it: the version-skew policy (how far components may differ) and the upgrade order (control plane before nodes, one minor version at a time).

The version-skew policy for the components you operate:

Relationship Allowed skew (Kubernetes ≥ 1.28) Plain English
kubelet vs kube-apiserver kubelet may be up to 3 minor versions older, never newer Nodes may lag the control plane, but never lead it
kube-controller-manager / scheduler vs apiserver Same minor, or 1 older Control-plane peers stay close to the apiserver
kubectl vs apiserver Within 1 minor (newer or older) Your client should be near the server version
Across the upgrade One minor version at a time (1.29 → 1.30, not 1.29 → 1.31) No skipping minors

The two load-bearing rules: the control plane is always upgraded first (the apiserver must never be older than the things talking to it), and you never skip a minor version — to go from 1.28 to 1.30 you upgrade to 1.29 first, then 1.30. Patch versions (1.30.1 → 1.30.2) are freely applied.

A safe staged upgrade with kubeadm, control plane first, then workers one node at a time:

# ── FIRST control-plane node ──
# 1. Upgrade the kubeadm binary to the target version (via your package manager), then:
sudo kubeadm upgrade plan                 # shows current vs available, checks skew
sudo kubeadm upgrade apply v1.30.2        # upgrades the control plane components + renews certs

# 2. Drain THIS node, then upgrade the kubelet + kubectl binaries, then:
kubectl drain cp-1 --ignore-daemonsets
sudo systemctl daemon-reload && sudo systemctl restart kubelet
kubectl uncordon cp-1

# ── OTHER control-plane nodes ──
sudo kubeadm upgrade node                 # NOT "apply" — "node" on the rest
# then drain → upgrade kubelet → restart → uncordon, as above

# ── WORKER nodes, one at a time ──
kubectl drain worker-1 --ignore-daemonsets --delete-emptydir-data
#   (on the node) upgrade kubeadm + kubelet + kubectl packages
sudo kubeadm upgrade node
sudo systemctl daemon-reload && sudo systemctl restart kubelet
kubectl uncordon worker-1

The pattern to internalise: kubeadm upgrade apply runs exactly once, on the first control-plane node (it upgrades the cluster’s control-plane components and renews certificates). Every other node — control-plane or worker — uses kubeadm upgrade node. And every node is drained before and uncordoned after its kubelet is restarted, so the rolling upgrade never takes capacity you need. Because the version-skew policy lets kubelets lag the apiserver by up to three minors, draining workers one at a time through a single-minor bump is always safe.

Before any upgrade: take an etcd snapshot. It is your rollback if kubeadm upgrade apply goes wrong.

Managed vs self-managed: choosing well

Everything above is your job only if you self-manage. The honest architect’s question is whether you should.

Concern Self-managed (kubeadm/CAPI/kubespray) Managed (EKS / AKS / GKE)
Control-plane HA & etcd You build and operate it Provider runs it, multi-AZ, with an SLA
etcd backup/restore Your CronJob, your restore drill Automatic, provider-managed
Upgrades You stage them (this lesson) One click / one API call; you still upgrade nodes
Cert rotation Your responsibility Handled by the provider
Cost No control-plane fee (but your machines + your time) A control-plane fee, but far less ops labour
Control & portability Total — any infra, any version, air-gapped Constrained to the provider’s offering
Best for On-prem, edge, air-gapped, deep customisation, learning Almost every cloud workload — let the provider carry the toil

The default recommendation for a cloud workload is managed: the control plane and etcd are precisely the parts that are hard to run well and have nothing to do with your application’s value. Self-manage when you have a reason the managed offering cannot meet — on-prem or edge with no managed option, an air-gapped environment, a regulatory or customisation requirement, or simply to learn the machinery so you can operate the managed service intelligently. Even then, prefer a higher-level tool (Cluster API, kubespray, Rancher) over raw kubeadm for fleets — but everything those tools do, they do on top of the kubeadm concepts you have just learned.

Provisioning an HA Kubernetes cluster

The diagram traces the whole picture: a load balancer fronting three control-plane nodes, the stacked-versus-external etcd split, the certificate authority issuing component certs, the etcd snapshot flowing out to backup storage, and the staged upgrade order arrowing from control plane down to workers.

Hands-on lab

You will not hand-build a multi-node HA cluster in this lab — that needs several machines or VMs. Instead you will create a real multi-node cluster locally with kind (which runs each node as a container), then practise the operational skills that transfer directly to a kubeadm cluster: inspecting the control plane, taking an etcd snapshot, checking certificate expiry, and running the node lifecycle (cordon/drain/uncordon). Everything here is free and runs on your laptop.

Prerequisites: Docker (or Podman), kind, kubectl, and etcdctl installed. Install etcdctl from the etcd release if you do not have it.

Step 1 — Create a multi-node cluster (1 control plane + 2 workers). kind uses kubeadm under the hood, so the artefacts are the real thing.

cat <<'EOF' > kind-ha.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
EOF

kind create cluster --name provlab --config kind-ha.yaml
kubectl get nodes -o wide

Expected: three nodes — one control-plane and two worker, all Ready within a minute or two.

Step 2 — See the control plane as static Pods. These are exactly what kubeadm wrote to /etc/kubernetes/manifests/.

kubectl get pods -n kube-system -o wide | grep -E 'etcd|apiserver|scheduler|controller'

Expected: etcd-provlab-control-plane, kube-apiserver-..., kube-scheduler-..., kube-controller-manager-... all Running. Note how etcd is stacked on the control-plane node.

Step 3 — Take an etcd snapshot from inside the control-plane container. This is the production backup command, run against the real etcd.

docker exec provlab-control-plane sh -c '\
  ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key'

# Verify the snapshot and count the keys it holds
docker exec provlab-control-plane sh -c '\
  ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-backup.db --write-out=table'

Expected: Snapshot saved at /tmp/etcd-backup.db, then a table showing a non-zero TOTAL KEYS — proof you captured live cluster state. In production you would copy this file off the node to object storage.

Step 4 — Check certificate expiry. Run kubeadm’s own check inside the node.

docker exec provlab-control-plane kubeadm certs check-expiration

Expected: a table of certificates (apiserver, etcd, front-proxy, etc.) with expiry roughly one year out and CERTIFICATE AUTHORITY entries about ten years out.

Step 5 — Practise the node lifecycle. Cordon and drain a worker, watch a workload move, then return the node.

kubectl create deployment web --image=nginx --replicas=4
kubectl rollout status deployment/web

kubectl cordon provlab-worker          # no new Pods will land here
kubectl get nodes                      # note SchedulingDisabled on that node

kubectl drain provlab-worker --ignore-daemonsets --delete-emptydir-data
kubectl get pods -o wide               # web Pods have moved off the drained node

kubectl uncordon provlab-worker        # back in the pool
kubectl get nodes

Expected: after drain, no web Pods remain on provlab-worker; after uncordon it is schedulable again.

Validation. You should have: a 3-node cluster, a verified etcd snapshot with a non-zero key count, a certificate-expiry report, and a worker you successfully drained and returned to service — the four operational skills of running a self-managed cluster.

Cleanup.

kind delete cluster --name provlab
rm -f kind-ha.yaml

Cost note. Zero — kind runs entirely in local containers. The identical commands (etcdctl snapshot save, kubeadm certs check-expiration, kubectl drain) are exactly what you run on a real kubeadm cluster; only the host changes.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Cannot add a second control-plane node kubeadm init was run without --control-plane-endpoint; certs point at one node Rebuild with a load-balancer endpoint, or follow the (painful) cert-regeneration procedure
New control-plane join fails with a cert error The --upload-certs Secret expired (2-hour TTL) Re-run sudo kubeadm init phase upload-certs --upload-certs to get a fresh --certificate-key
Cluster “froze” overnight; apiserver logs show TLS errors Certificates expired (cluster untouched > 1 year) kubeadm certs renew all, restart static Pods, refresh admin.conf
etcd cluster unavailable after losing 2 of 3 members Quorum lost — only a minority remain Restore from snapshot; you cannot write until a majority is restored — this is why you snapshot
kubectl drain hangs and never completes A PodDisruptionBudget would be violated, or a Pod has no controller Check kubectl get pdb; add replicas, or use --force for unmanaged Pods (knowingly)
Restore “succeeded” but cluster shows old/empty state Restored into a dir the etcd Pod isn’t using, or didn’t stop the control plane first Stop control plane, restore to a new dir, repoint the etcd static Pod’s data volume, restart
kubeadm upgrade apply refuses to run Trying to skip a minor version, or kubeadm binary not upgraded first Upgrade one minor at a time; upgrade the kubeadm package before apply
metrics-server / kubectl logs fail with cert errors kubelet serving CSRs unapproved (serverTLSBootstrap) kubectl get csr; approve with kubectl certificate approve <csr> (or run an approver)

Best practices

Security notes

Interview & exam questions

  1. Why must you run an odd number of etcd members, and why is three the common choice? etcd needs a majority (quorum) to accept writes; fault tolerance is (n-1)/2. Three tolerates one failure; two tolerates zero (worse than one); four is no better than three. Odd numbers maximise resilience per node, and three is the cost/resilience sweet spot.

  2. Stacked vs external etcd — what’s the trade-off? Stacked co-locates etcd on the control-plane nodes (simpler, cheaper, kubeadm’s default) but couples a node loss to losing both an apiserver and an etcd member, and etcd shares disk I/O. External runs etcd on dedicated machines (decoupled blast radius, isolated fast disks, independent scaling) at the cost of a second cluster to provision, secure, back up, and upgrade.

  3. Walk me through restoring a cluster from an etcd snapshot. Stop the control plane (move static Pod manifests aside), etcdctl snapshot restore into a new data dir, repoint the etcd static Pod’s data volume at that dir, move the manifests back so the kubelet restarts the control plane, then verify kubectl get nodes/pods. The whole cluster rolls back to the snapshot’s moment.

  4. What does --control-plane-endpoint do and why is it critical for HA? It bakes the load balancer’s stable address into every certificate and kubeconfig. Without it, the certs point at a single node and you can never add another control-plane node without regenerating the PKI.

  5. A worker node needs an OS patch with zero disruption — what’s your sequence? kubectl cordon (stop new Pods) → kubectl drain --ignore-daemonsets (evict, respecting PDBs) → patch/reboot → kubectl uncordon. The drain blocks if it would violate a PodDisruptionBudget.

  6. Explain the version-skew policy for kubelet vs apiserver. The kubelet may be up to three minor versions older than the apiserver but never newer. Nodes may lag the control plane; they must never lead it.

  7. What is the correct order to upgrade a cluster, and which kubeadm command runs where? Control plane first, then workers, one minor at a time. kubeadm upgrade apply runs once on the first control-plane node; every other node (control-plane or worker) uses kubeadm upgrade node. Drain each node before restarting its kubelet.

  8. What is a static Pod and why does the control plane use them? A Pod defined by a manifest in /etc/kubernetes/manifests/ that the kubelet runs directly, with no apiserver or scheduler. It solves the bootstrap chicken-and-egg: the kubelet can start the API server before the API server exists.

  9. How does kubeadm handle certificate rotation, and what’s the gotcha? It auto-renews all control-plane certs on every kubeadm upgrade. The gotcha is a cluster left untouched over a year — its one-year certs expire and you must kubeadm certs renew all manually and restart the static Pods. The kubelet rotates its own client cert automatically; serving-cert CSRs need approval.

  10. What happens if etcd loses quorum, and how do you recover? Writes stop entirely (the store goes read-only / unavailable) until a majority is restored. If you cannot bring members back, you restore from a snapshot — which is precisely why scheduled, tested etcd backups are non-negotiable.

  11. Where do Kubernetes Secrets actually live, and how do you protect them? As keys in etcd, base64-encoded but not encrypted by default. Enable encryption at rest via an EncryptionConfiguration, lock down etcd’s ports to control-plane nodes, and treat etcd snapshots as top-secret copies of every credential.

  12. When would you choose self-managed over managed Kubernetes? When the managed offering cannot meet a requirement: on-prem/edge/air-gapped environments, strict customisation or regulatory needs, or to learn the machinery. For ordinary cloud workloads, managed is the default — the control plane and etcd are exactly the toil with no application value.

Quick check

  1. How many etcd-member failures does a five-member cluster tolerate?
  2. Which single kubeadm command runs only on the first control-plane node during an upgrade?
  3. What is the difference between cordon and drain?
  4. By how many minor versions may a kubelet trail the API server?
  5. What is the first action you take before starting any cluster upgrade?

Answers

  1. Two. Quorum of five is three, so it survives losing two.
  2. kubeadm upgrade apply (every other node uses kubeadm upgrade node).
  3. cordon only marks the node unschedulable (existing Pods stay); drain also evicts the existing Pods (respecting PDBs) so they reschedule elsewhere.
  4. Three minor versions (older only — never newer than the apiserver).
  5. Take (and verify) an etcd snapshot so you have a rollback.

Exercise

On the local kind cluster from the lab, simulate a backup-and-recovery drill and a maintenance window:

  1. Deploy a workload that writes a recognisable object (e.g. kubectl create configmap before-snap --from-literal=marker=v1).
  2. Take an etcd snapshot (Step 3 of the lab) and copy it out of the container to your host with docker cp.
  3. Create a second ConfigMap after-snap so you can prove what a restore would and would not contain.
  4. Run kubeadm certs check-expiration and note which certificates expire soonest and which CAs last longest.
  5. Drain one worker, confirm your workload’s Pods relocated, then uncordon it.

Write a short paragraph answering: if you restored the snapshot from step 2, which of the two ConfigMaps would survive, and why? What does that tell you about the difference between a backup and an undo? Then state, for your own context, whether you would self-manage or use managed Kubernetes — and the single most important reason.

Certification mapping

This lesson maps to the CKA (Certified Kubernetes Administrator) “Cluster Architecture, Installation & Configuration” domain — the most heavily weighted and the most hands-on part of the exam:

CKA is a performance exam: do the lab and exercise until etcdctl snapshot save, kubectl drain, and the upgrade sequence are reflex, and use kubeadm --help and the official docs (allowed in the exam) for exact flags under time pressure.

Glossary

Next steps

You can now build, recover, and upgrade a cluster. Next, zoom out from a single cluster to the full range of designs in The Kubernetes Architecting Ladder: From a Single Cluster to Multi-Region Mission-Critical, which uses requirements (RTO/RPO, scale) to drive the choice between one cluster and many. To go deeper on the specific failure modes hinted at here — etcd quorum loss, apiserver outages, certificate expiry — see Advanced Kubernetes Troubleshooting: Control-Plane, etcd & Complex Incident RCA. And to apply the multi-AZ, autoscaling, GitOps shape of a real production cluster on a managed platform, revisit Production AKS: Networking & Observability.

KuberneteskubeadmetcdHigh AvailabilityCluster UpgradesCKA
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading