Docker is a client and a daemon that wraps containerd. Once you are running Kubernetes, the Docker shim is gone and containerd is the actual runtime under every node. So when a node misbehaves — a stuck snapshot, an image that will not unpack, a pod that needs a hardware-isolated sandbox — you are debugging containerd whether you meant to or not. This guide operates containerd on its own terms: namespaces and snapshotters, layer encryption with ocicrypt, and per-workload sandboxing with gVisor and Kata wired through RuntimeClass. Everything assumes containerd 1.7+ on a Linux node you control.
1. The containerd object model
containerd is a daemon (containerd) exposing a gRPC API over /run/containerd/containerd.sock, organized into a small set of subsystems you will touch constantly.
| Subsystem | Responsibility | Where it lives |
|---|---|---|
| Namespaces | Hard tenancy boundary for all metadata — images, containers, snapshots | metadata DB (/var/lib/containerd/io.containerd.metadata.v1.bolt) |
| Content store | Immutable, content-addressed blobs (manifests, configs, compressed layers) | /var/lib/containerd/io.containerd.content.v1.content |
| Snapshotter | Builds the writable root filesystem from layers (overlayfs, native, stargz) | /var/lib/containerd/io.containerd.snapshotter.v1.<name> |
| Runtime (shim) | One containerd-shim-runc-v2 per container; calls the OCI runtime |
per-container shim process |
| CRI plugin | Implements the Kubernetes Container Runtime Interface | in-process gRPC plugin |
The detail that surprises people: namespaces are not Linux namespaces. A containerd namespace is a metadata partition. The Docker CLI uses moby; nerdctl defaults to default; Kubernetes via CRI uses k8s.io. An image pulled in one namespace is invisible in another — even though both share the same content store on disk, because the content is deduplicated by digest but the references are namespaced. This single fact explains most “I pulled it but the pod can’t find it” tickets.
# List namespaces, then inspect what k8s.io actually holds
ctr namespace ls
ctr -n k8s.io images ls | head
ctr -n k8s.io containers ls
ctr is the low-level debug client shipped with containerd. It is intentionally unfriendly — no build, no compose, no logs. For day-to-day work use nerdctl, which is a Docker-compatible CLI that speaks directly to containerd and supports namespaces, BuildKit, encryption, and lazy pulling.
2. Driving containerd with nerdctl and ctr
Install nerdctl (the full bundle pulls in containerd, CNI, BuildKit, and RootlessKit) and confirm it talks to the daemon.
nerdctl --address /run/containerd/containerd.sock version
nerdctl info | grep -E 'Snapshotter|cgroup|Runtime'
The mental model maps cleanly onto Docker, with the namespace as an explicit flag:
# Pull, run, inspect — note -n selects the containerd namespace
nerdctl -n default pull --platform=linux/amd64 docker.io/library/nginx:1.27
nerdctl -n default run -d --name web -p 8080:80 docker.io/library/nginx:1.27
nerdctl -n default ps
nerdctl -n default inspect web --format '{{.State.Status}} {{.Process.Pid}}'
When you need to see what containerd itself sees — bypassing CRI and nerdctl’s bookkeeping — drop to ctr. This is invaluable when a Kubernetes image is “present” to containerd but a pod still fails:
# What is physically in the content store for this image?
ctr -n k8s.io images ls | grep nginx
ctr -n k8s.io content ls | head
# Mount a snapshot read-only to inspect a layer without starting a container
ctr -n k8s.io snapshots ls
Rule of thumb: use
nerdctlto do things, usectrto find out why containerd will not. Never mix the two for lifecycle (don’tctr task killa container nerdctl started) — they keep separate labels.
3. Snapshotter choices: overlayfs vs stargz lazy pulling
The snapshotter assembles the container root filesystem. The default, overlayfs, is fine for steady-state but pays a tax you feel on cold start: containerd must download and fully decompress every layer before the container can start. On a 1.5 GB image where the process only reads 40 MB of files, that is mostly wasted I/O.
stargz (Seekable tar.gz, via the stargz-snapshotter) fixes this with lazy pulling: the image is stored in a seekable format, and files are fetched on demand over HTTP range requests as the container reads them. Cold start drops from “download the whole image” to “download the bytes you touch.”
Register the stargz snapshotter as a proxy plugin in /etc/containerd/config.toml:
version = 2
[proxy_plugins]
[proxy_plugins.stargz]
type = "snapshot"
address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]
# Make CRI (Kubernetes) use stargz, and tell the shim to discard the
# local snapshotter's notion of "unpacked" so lazy pulls take effect.
snapshotter = "stargz"
disable_snapshot_annotations = false
Then run the containerd-stargz-grpc daemon and pull an eStargz-formatted image with lazy semantics:
systemctl enable --now stargz-snapshotter
# --snapshotter=stargz selects it; eStargz images carry a TOC + landmarks
nerdctl pull --snapshotter=stargz ghcr.io/stargz-containers/python:3.12-esgz
nerdctl run --snapshotter=stargz --rm ghcr.io/stargz-containers/python:3.12-esgz python -c 'print("up")'
The trade-offs are real and you should state them to your team:
- stargz wins for large images with sparse read patterns (ML inference, JVM apps, anything with a fat base) and for autoscaling where cold-start latency dominates.
- stargz costs a format conversion step in CI (images must be repacked as eStargz), keeps a network dependency for the lifetime of the container (a layer you never read until hour three still fetches at hour three), and ties you to registries that honor range requests.
- overlayfs wins for small images, air-gapped nodes, and workloads where you’d rather pay once up front than carry a runtime registry dependency.
4. Encrypting image layers with ocicrypt
Cosign signs an image so you can verify who built it; it does nothing to stop reading it. For images carrying proprietary models or embedded secrets, you want the layers themselves encrypted at rest in the registry. That is OCI image encryption (the ocicrypt spec), and nerdctl drives it directly.
Generate a recipient key pair. JWE (JSON Web Encryption, RSA-wrapped) is the documented, portable mode:
openssl genrsa -out mykey.pem 4096
openssl rsa -in mykey.pem -pubout -out mypubkey.pem
Encrypt an existing local image to a new tag. Encryption is per layer — you can encrypt all layers or only the ones that carry sensitive data, leaving the public base layers shared and cacheable:
# Encrypt every layer for both architectures, push-ready
nerdctl image encrypt \
--recipient=jwe:mypubkey.pem \
--platform=linux/amd64,linux/arm64 \
myapp:plain registry.example.com/myapp:encrypted
nerdctl push registry.example.com/myapp:encrypted
On the consuming node, decryption is transparent at run time as long as the private key is in containerd’s ocicrypt key directory. No flag, no wrapper:
# Root containerd looks here; rootless looks in ~/.config/containerd/ocicrypt/keys
sudo install -d -m 0700 /etc/containerd/ocicrypt/keys
sudo install -m 0600 mykey.pem /etc/containerd/ocicrypt/keys/myapp.pem
# Now a normal run decrypts on the fly using the key from the directory
nerdctl run --rm registry.example.com/myapp:encrypted /app/healthcheck
To inspect an encrypted image without unpacking, or to produce a decrypted copy offline:
# Pull the manifest/layers without unpacking the (still-encrypted) rootfs
nerdctl pull --unpack=false registry.example.com/myapp:encrypted
# Materialize a decrypted image locally with an explicit key
nerdctl image decrypt --key=mykey.pem registry.example.com/myapp:encrypted myapp:decrypted
The operational reality is key distribution, not the crypto. Shipping mykey.pem to every node by hand defeats the purpose. In production you wire ocicrypt to a key provider (a gRPC/exec plugin referenced by OCICRYPT_KEYPROVIDER_CONFIG) that fetches the decryption key from a KMS or Vault per pull, so the private key never lands on disk. Start with the file-based directory to prove the pipeline, then replace it with a keyprovider before you go wide.
5. Installing gVisor (runsc) and Kata as alternative runtimes
A container shares the host kernel. For genuinely untrusted code — customer-supplied images, CI of unknown PRs, multi-tenant SaaS — that shared kernel is the attack surface. Two sandboxes shrink it, with different mechanics:
- gVisor (
runsc) runs an application kernel in userspace that intercepts syscalls in Go, so the workload almost never touches the host kernel directly. Low overhead, but some syscalls are unimplemented and raw I/O is slower. - Kata Containers boots each pod inside a lightweight VM (via QEMU/Cloud Hypervisor) with its own real kernel. Stronger isolation (a true VM boundary), higher memory cost, and it needs nested virtualization or bare metal.
Both plug into containerd as runtime handlers under the CRI plugin. Install the binaries first.
# gVisor: installs runsc + the containerd shim, then wires config.toml
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=amd64,arm64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" \
| sudo tee /etc/apt/sources.list.d/gvisor.list >/dev/null
sudo apt-get update && sudo apt-get install -y runsc
# Kata: install the static release (self-contained, /opt/kata)
sudo apt-get install -y kata-runtime kata-proxy kata-shim 2>/dev/null || \
echo "or use the kata static tarball / kata-deploy DaemonSet on k8s"
Now declare both handlers in /etc/containerd/config.toml. The handler name (runsc, kata) is the string Kubernetes will reference:
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
# Default, unsandboxed runtime
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
# gVisor handler -> containerd-shim-runsc-v1
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
ConfigPath = "/etc/containerd/runsc.toml"
# Kata handler -> containerd-shim-kata-v2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata.options]
ConfigPath = "/etc/kata-containers/configuration.toml"
Restart and smoke-test each handler with ctr, which can target a runtime directly without Kubernetes in the loop:
sudo systemctl restart containerd
# Boot a container under gVisor and prove the kernel is the sandbox kernel
sudo ctr run --rm --runtime io.containerd.runsc.v1 \
docker.io/library/alpine:3.20 gv uname -a
# Under Kata, /proc/version reports the guest VM kernel, not the host
sudo ctr run --rm --runtime io.containerd.kata.v2 \
docker.io/library/alpine:3.20 kt cat /proc/version
6. Wiring RuntimeClass to schedule sandboxed pods
You do not want every pod paying the gVisor/Kata tax. RuntimeClass is the Kubernetes object that maps a friendly name to a containerd handler, so workloads opt in per pod. Create one class per handler:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc # must match the containerd runtimes.<name>
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata
handler: kata
scheduling:
# Only schedule kata pods onto nodes advertising VM support
nodeSelector:
katacontainers.io/kata-runtime: "true"
overhead:
# Account for the VM's memory/CPU so the scheduler bin-packs correctly
podFixed:
memory: "160Mi"
cpu: "250m"
A pod opts in with one field. Everything else is a normal Deployment:
apiVersion: v1
kind: Pod
metadata:
name: untrusted-job
spec:
runtimeClassName: gvisor # routes this pod to the runsc handler
containers:
- name: app
image: registry.example.com/customer-code:latest
resources:
requests: { cpu: "200m", memory: "256Mi" }
Two production details that bite teams:
scheduling.nodeSelectoris mandatory in mixed clusters. If only some nodes have gVisor/Kata installed, the RuntimeClass must pin pods to them or the kubelet will fail the pod withRunContainerError: failed to create containerd task ... unknown runtime.overhead.podFixedmatters for Kata specifically. The VM consumes memory the workload never sees; without declaring overhead, the scheduler over-packs the node and the OOM killer arrives at 3 a.m.
7. Registry mirrors, hosts.toml, and pull-through caches
Hard-coding registry mirrors into config.toml means a daemon restart for every change and one giant unreadable block. The modern mechanism is the certs.d host directory: per-registry hosts.toml files that containerd reads live, no restart required.
Point containerd at the directory once:
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
Then create one directory per upstream namespace. Here we front Docker Hub with a local pull-through cache and fall back to the real registry on a miss:
sudo install -d /etc/containerd/certs.d/docker.io
# /etc/containerd/certs.d/docker.io/hosts.toml
server = "https://registry-1.docker.io"
[host."https://mirror.internal.example.com/v2/dockerhub"]
capabilities = ["pull", "resolve"]
# skip_verify only for an internal mirror with a private CA you trust
skip_verify = false
[host."https://registry-1.docker.io"]
capabilities = ["pull", "resolve"]
The semantics are precise and worth internalizing: server is the canonical upstream. Each [host.<url>] is tried in file order; containerd hits the mirror first, and on a miss or error falls through to the next host. capabilities gates what each host may serve — listing only ["pull", "resolve"] ensures the mirror is never used for pushes. For a registry that omits the standard /v2 prefix, add override_path = true to the host entry. Because this lives under certs.d, edits take effect on the next pull with no systemctl restart containerd.
Verify
Confirm each layer works before you trust it in production.
# 1) Namespaces and snapshotter are what you expect
ctr namespace ls
nerdctl info | grep -i snapshotter # -> stargz (or overlayfs)
# 2) Lazy pulling is active (stargz mounts show up, not full extractions)
ctr -n default snapshots --snapshotter stargz ls | head
# 3) Encryption round-trips: a fresh node with the key runs the image,
# and the same image WITHOUT the key fails to unpack
sudo mv /etc/containerd/ocicrypt/keys/myapp.pem /tmp/ # remove key
nerdctl run --rm registry.example.com/myapp:encrypted true # expect: unpack error
sudo mv /tmp/myapp.pem /etc/containerd/ocicrypt/keys/ # restore
nerdctl run --rm registry.example.com/myapp:encrypted true # expect: success
# 4) Sandbox handlers are live and report a different kernel
sudo ctr run --rm --runtime io.containerd.runsc.v1 docker.io/library/alpine:3.20 g dmesg 2>&1 | head -1
sudo crictl info | grep -A3 -i runtimes # runsc + kata present to CRI
# 5) RuntimeClass routes correctly: the pod's sandbox uses the sandboxed runtime
kubectl get runtimeclass
kubectl run gv --image=alpine:3.20 --restart=Never \
--overrides='{"spec":{"runtimeClassName":"gvisor"}}' -- dmesg
kubectl logs gv | head -1 # gVisor banner, not the host kernel ring buffer
# 6) Mirror is actually serving pulls
sudo crictl pull docker.io/library/busybox:latest
sudo journalctl -u containerd --since "2 min ago" | grep -i mirror.internal
8. Debugging containerd with crictl, logs, and events
When a node is wedged, these are the four tools, in escalation order.
crictl is the CRI-level debugger — it sees exactly what the kubelet sees. Use it when kubectl shows a pod stuck and you need ground truth on the node:
# Pods, containers, and images AS CRI sees them (always k8s.io namespace)
sudo crictl pods
sudo crictl ps -a
sudo crictl images
# The single most useful command for "why won't this pull/start"
sudo crictl logs <container-id>
sudo crictl inspectp <pod-id> | jq '.status.metadata, .info.runtimeType'
The events stream is containerd’s real-time firehose. Tail it in one terminal while you reproduce a failure in another — you will see image pulls, snapshot prepares, and task exits as they happen:
ctr -n k8s.io events
# Filter the noise to just the lifecycle moments that matter
ctr -n k8s.io events | grep -E 'TaskExit|ImageCreate|content'
Daemon logs carry the snapshotter and shim errors that never surface to Kubernetes:
sudo journalctl -u containerd -f --no-pager
# Raise verbosity temporarily for a gnarly snapshot or pull bug
sudo sed -i 's/level = "info"/level = "debug"/' /etc/containerd/config.toml
sudo systemctl restart containerd # remember to revert; debug is loud
The plugin sanity check catches the silent failure where a plugin failed to load (a bad stargz socket path, a missing CNI binary). A plugin in any state but ok explains a whole class of “containerd is running but nothing works”:
ctr plugins ls | grep -v ok # any non-ok plugin is your bug
The decision tree I give on-call: if kubectl is confused, drop to crictl. If crictl shows a clean pull but a dead container, tail ctr events. If events show a snapshot prepare that never completes, you have a snapshotter problem — check journalctl and ctr plugins ls. This walks down the stack one layer at a time, and it almost always lands on the real cause within three commands.
Enterprise scenario
A fintech platform team ran a multi-tenant CI service: customers pushed Git repos, the platform built and executed test suites from arbitrary, untrusted code on a shared EKS-on-bare-metal cluster. The constraint from their security org was absolute — no untrusted process may run on the host kernel — but a blanket switch to Kata for every workload was a non-starter, because their own trusted platform services (the controller, the queue workers) saw a 30 to 40 percent throughput drop and a memory tax under VM isolation they could not afford at their pod density.
The resolution was to make isolation a scheduling decision, not a cluster-wide default. They installed Kata via the kata-deploy DaemonSet, which labels capable nodes and registers the handler in containerd. They defined a single RuntimeClass whose scheduling.nodeSelector pinned sandboxed pods to a dedicated bare-metal node pool, and declared overhead.podFixed so the scheduler accounted for the guest VM and stopped over-packing those nodes. The controller then injected runtimeClassName: kata only into the ephemeral pods that executed customer code — every trusted platform component kept running under plain runc on the general node pool at full speed.
# Injected by the build controller onto customer-code execution pods only
apiVersion: v1
kind: Pod
metadata:
generateName: ci-run-
labels: { tenant-workload: "untrusted" }
spec:
runtimeClassName: kata # VM-isolated; trusted pods omit this entirely
automountServiceAccountToken: false
containers:
- name: runner
image: ci-internal/runner:pinned
resources:
requests: { cpu: "1", memory: "1Gi" }
limits: { cpu: "2", memory: "2Gi" }
The result: untrusted code never shared a kernel with the host or with other tenants, while 90 percent of the platform’s pods paid zero isolation overhead. The key architectural move was recognizing that “untrusted” is a property of a specific pod, and RuntimeClass is exactly the lever that lets you price isolation per workload instead of per cluster.