A reference you can keep open in a second tab. Grouped by tool, ordered roughly basic → advanced.
Docker — images & containers
# Build & tag
docker build -t myapp:1.0 .
docker build -t myapp:1.0 --build-arg ENV=prod --target runtime . # multi-stage target
docker buildx build --platform linux/amd64,linux/arm64 -t myapp:1.0 --push . # multi-arch
# Run
docker run -d --name web -p 8080:80 --restart unless-stopped myapp:1.0
docker run --rm -it --env-file .env myapp:1.0 sh # ephemeral debug shell
docker run -v $(pwd):/app -w /app node:20 npm test # bind-mount + workdir
# Inspect & debug
docker ps -a # all containers
docker logs -f --tail 100 web # follow logs
docker exec -it web sh # shell into a running container
docker stats # live resource usage
docker inspect web | jq '.[0].NetworkSettings'
# Images & cleanup
docker images
docker image prune -a # remove dangling/unused images
docker system prune -af --volumes # reclaim everything (careful)
docker history myapp:1.0 # see layer sizes (find bloat)
A good production Dockerfile (multi-stage, non-root, cached)
# ---- build stage ----
FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev # cache deps layer separately from source
COPY . .
RUN npm run build
# ---- runtime stage ----
FROM node:20-alpine AS runtime
ENV NODE_ENV=production
WORKDIR /app
RUN addgroup -S app && adduser -S app -G app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
USER app # never run as root
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
Dockerfile rules of thumb: order layers least- → most-frequently-changed; copy lock files before source; use multi-stage to keep build tools out of the runtime image; pin base image tags; run as non-root; add a HEALTHCHECK; keep a .dockerignore (node_modules, .git, dist).
kubectl — the daily driver
# Context & config
kubectl config get-contexts
kubectl config use-context aks-prod
kubectl config set-context --current --namespace=payments # stop typing -n
# Inspect
kubectl get pods -A -o wide
kubectl get pods -l app=payments --watch
kubectl describe pod payments-7d9 -n payments # events at the bottom = gold
kubectl get events -n payments --sort-by=.lastTimestamp
# Logs & exec
kubectl logs -f deploy/payments -n payments --all-containers
kubectl logs payments-7d9 -n payments --previous # crashed container's logs
kubectl exec -it deploy/payments -n payments -- sh
kubectl debug -it payments-7d9 --image=busybox --target=app # ephemeral debug container
# Apply / diff / rollout
kubectl apply -f k8s/ --recursive
kubectl diff -f k8s/ # preview before apply
kubectl rollout status deploy/payments -n payments
kubectl rollout undo deploy/payments -n payments # roll back
kubectl rollout restart deploy/payments -n payments # bounce pods (re-pull secrets)
# Scale & resources
kubectl scale deploy/payments --replicas=5 -n payments
kubectl top pods -n payments # needs metrics-server
kubectl get hpa -n payments
# Networking & access
kubectl port-forward svc/payments 8080:80 -n payments
kubectl auth can-i create deployments --as system:serviceaccount:ci:deployer
# Power moves
kubectl get pods -o jsonpath='{.items[*].metadata.name}'
kubectl get pod payments-7d9 -o yaml | kubectl neat # clean YAML (krew plugin)
kubectl explain ingress.spec.rules # schema docs inline
Troubleshooting flow when a pod won’t start:
kubectl get pod→ status:ImagePullBackOff?CrashLoopBackOff?Pending?kubectl describe pod→ read Events (image pull auth, scheduling, probes).Pending→kubectl describe node/ check requests vs. capacity, taints, PVCs.CrashLoopBackOff→kubectl logs --previous; check the liveness probe & command.ImagePullBackOff→ registry auth (imagePullSecrets), tag typo, private registry firewall.
Helm — package & release management
# Repos & search
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
helm search repo postgres
# Render & inspect before installing (always)
helm template myrel bitnami/postgresql -f values.yaml | less # see the YAML it will apply
helm install myrel bitnami/postgresql -f values.yaml --dry-run --debug
# Install / upgrade
helm install payments ./chart -n payments --create-namespace -f values.prod.yaml
helm upgrade --install payments ./chart -n payments -f values.prod.yaml --atomic --timeout 5m
# --install -> install if absent, else upgrade
# --atomic -> auto-rollback on failure
# --wait -> block until resources are Ready
# Lifecycle
helm list -A
helm history payments -n payments
helm rollback payments 3 -n payments # revert to revision 3
helm uninstall payments -n payments
# Authoring a chart
helm create mychart # scaffolds Chart.yaml, values.yaml, templates/
helm lint ./mychart
helm package ./mychart # -> mychart-0.1.0.tgz
Chart layout:
mychart/
├── Chart.yaml # name, version, appVersion, dependencies
├── values.yaml # default config (override per env with -f)
├── templates/
│ ├── deployment.yaml # uses {{ .Values.* }} and {{ .Release.* }}
│ ├── service.yaml
│ ├── ingress.yaml
│ └── _helpers.tpl # reusable template snippets (labels, names)
└── charts/ # vendored sub-charts (dependencies)
A templating snippet you’ll use constantly:
# templates/deployment.yaml
spec:
replicas: {{ .Values.replicaCount }}
template:
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
{{- with .Values.resources }}
resources: {{- toYaml . | nindent 12 }}
{{- end }}
Enterprise scenario
A payments platform team running EKS pushed a Helm upgrade that silently wedged production. The chart used helm upgrade --install payments ./chart --atomic --timeout 5m. The new revision changed a Deployment readiness probe path, pods never went Ready, and --atomic rolled back — but the rollback also timed out because the old ReplicaSet’s pods had already been terminated. Helm reported another operation is in progress, and the release was stuck in pending-upgrade. No helm upgrade would run again.
The constraint: --atomic rollback is itself a release operation, and if it exceeds --timeout you get a half-applied state plus a lock. The fix had two parts. First, clear the stuck lock and restore the last known-good revision directly:
helm history payments -n payments # find last DEPLOYED revision (e.g. 41)
helm rollback payments 41 -n payments --wait --timeout 10m
kubectl rollout status deploy/payments -n payments
If helm rollback still refused because of the pending-upgrade status, they patched the release secret so Helm stopped treating it as in-flight:
kubectl get secret -n payments -l owner=helm,name=payments \
--sort-by=.metadata.creationTimestamp
kubectl delete secret sh.helm.release.v1.payments.v42 -n payments # the failed rev only
The durable lesson: never let --timeout be shorter than a realistic rollout, and gate the probe change behind helm template | kubectl diff -f - in CI so a bad probe path is caught before it ever reaches the cluster. They also added --wait-for-jobs and bumped timeouts to 10m on stateful releases.
Quick mental map
- Docker builds and runs the image.
- kubectl talks to the cluster (imperative debugging + declarative
apply). - Helm packages many manifests into a versioned, parameterized release.
Keep kubectl diff and helm template/--dry-run in your muscle memory — previewing changes before applying them is the single habit that prevents the most production incidents.