Kubernetes Jobs, CronJobs & DaemonSets, In Depth

So far in this course you have run workloads that are meant to stay up forever: a Deployment keeps a fixed number of identical Pods alive, restarting them whenever they die. That is exactly right for a web server or an API — but a huge amount of real work is not like that. Sometimes you need to run a task once, until it finishes: a database migration, a backup, a batch import, a one-off report. Sometimes you need to run that task on a schedule — every night at 02:00, every five minutes, on the first of the month. And sometimes you need exactly the opposite of “a fixed number of Pods”: you need one Pod on every node — a log collector, a metrics agent, a storage driver — that automatically appears on new nodes and disappears when nodes leave.

Kubernetes has three workload controllers built for precisely these shapes:

Job — runs Pods until a set number of them complete successfully, then stops. The right tool for run-to-completion batch work.
CronJob — creates Jobs on a repeating cron schedule. The right tool for scheduled batch work.
DaemonSet — ensures a copy of a Pod runs on every (matching) node. The right tool for per-node agents.

This lesson covers all three exhaustively — every field, what it does, what values it takes, its default, when to set it, and the gotcha that bites people in production. It is long on purpose: by the end you will understand these three objects well enough to design batch and infrastructure workloads with confidence and to answer the exam questions that probe them. Everything targets Kubernetes v1.30+, where the newer features (completionMode: Indexed, podFailurePolicy, the .spec.suspend field, the timeZone field on CronJobs, and native sidecar containers) are stable or beta-on-by-default.

Learning objectives

By the end of this lesson you can:

Explain what a Job is, how completions and parallelism interact, and the difference between NonIndexed and Indexed completion modes.
Control Job failure handling with backoffLimit, activeDeadlineSeconds, restartPolicy, and a Pod failure policy — and clean up finished Jobs automatically with ttlSecondsAfterFinished.
Write a CronJob with a correct cron schedule and timeZone, and choose the right concurrencyPolicy (Allow, Forbid, Replace).
Understand CronJob safety valves: startingDeadlineSeconds, successfulJobsHistoryLimit, failedJobsHistoryLimit, and suspend.
Deploy a DaemonSet, target it at a subset of nodes, tolerate control-plane taints, and roll it out with RollingUpdate vs OnDelete.
Choose correctly between Job, CronJob, DaemonSet, Deployment and StatefulSet for a given requirement.

Prerequisites & where this fits

You need a local cluster and basic comfort with kubectl and a Pod spec. If you have not set up a cluster yet, do the lab in What Is Kubernetes? Control Plane, Nodes, etcd & the kubelet — it walks you through a free local cluster with kind or minikube. Because all three controllers wrap a Pod template, it helps to have met Pods and their fields (containers, restartPolicy, probes, resources) in Pods, ReplicaSets, Deployments & Services: The Core Objects. Knowing how a Deployment owns a ReplicaSet which owns Pods gives you the contrast that makes Jobs and DaemonSets click.

This is Lesson 11 of the Kubernetes Zero-to-Hero course (Foundation tier). It follows the RBAC & Service Accounts fundamentals lesson and leads into advanced scheduling — affinity, topology spread, taints and preemption, which builds directly on the node-targeting ideas you meet here with DaemonSets.

Core concepts: controllers, run-to-completion vs run-forever, and the Pod template

Every workload in Kubernetes is managed by a controller — a control-loop running in the controller manager that constantly compares desired state (what you declared) with actual state (what exists) and acts to close the gap. A Deployment’s controller says “I always want 3 Pods up”; if one dies, it makes another. The three objects in this lesson are controllers too, but with different goals:

Controller	Desired state it enforces	Stops when…	Pod `restartPolicy` allowed
Deployment (via ReplicaSet)	N identical Pods are always running	never (you delete it)	`Always` only
Job	A target number of Pods complete successfully	the target is met	`Never` or `OnFailure`
CronJob	Jobs are created on a schedule	you delete/suspend it	(inherited by the Jobs it creates)
DaemonSet	One Pod runs on every matching node	never (you delete it)	`Always` (default)

That restartPolicy column is the single most important mental model for batch work. A Deployment’s Pods run a long-lived process that should never exit on its own — so the only sensible policy is Always (Kubernetes restarts the container if it ever stops). A Job’s Pods run a process that is meant to exit — so Always is forbidden (it would restart a task that already finished). We will return to this repeatedly.

All three objects embed a Pod template under .spec.template — the same Pod spec you already know (containers, env, volumes, resources, probes). The controller stamps out Pods from that template. So you are not learning a new way to describe a Pod; you are learning three new wrappers that decide how many Pods, when, and where.

One more shared idea: labels, selectors and ownerReferences. Each controller adds labels to the Pods it creates and watches for Pods matching a selector, and each created object carries an ownerReference back to its controller. This is what lets kubectl delete job my-job garbage-collect the Pods it owns, and what links a CronJob → its Jobs → their Pods. For Jobs you almost never write the selector yourself — the Job controller generates a guaranteed-unique one for you (this is the controller-uid label). Do not set .spec.selector on a Job manually unless you truly know what you are doing; getting it wrong makes a Job adopt or fight over the wrong Pods.

Jobs: run-to-completion work

A Job runs one or more Pods and tracks how many have completed successfully. When enough have succeeded, the Job is marked Complete and stops creating Pods. If a Pod fails, the Job (by default) makes a new one, up to a retry budget. This is the foundation for every batch task in Kubernetes — and CronJobs are just Jobs on a timer, so understanding Jobs deeply gets you most of the way through this whole lesson.

Here is the smallest useful Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      restartPolicy: Never        # required for Jobs: Never or OnFailure
      containers:
        - name: pi
          image: perl:5.34
          command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
  backoffLimit: 4                  # give up after 4 failed retries

Apply it and watch:

kubectl apply -f pi.yaml
kubectl get job pi -w
# NAME   STATUS     COMPLETIONS   DURATION   AGE
# pi     Complete   1/1           7s         9s
kubectl logs job/pi               # prints 2000 digits of pi

The Job spec, field by field

This is the matrix to internalise. Every field below lives under .spec of a batch/v1 Job.

Field	What it does	Values	Default	When to set	Gotcha
`template`	The Pod template the Job stamps out. Same as any Pod spec.	a Pod template	required	always	The template’s `restartPolicy` must be `Never` or `OnFailure` — `Always` is rejected.
`template.spec.restartPolicy`	What the kubelet does if the container in a Pod exits non-zero.	`Never`, `OnFailure`	(none — you must set it)	always	`OnFailure` restarts the container in place (Pod stays, restart count climbs); `Never` lets the Pod fail and the Job makes a new Pod. See the deep dive below.
`completions`	How many Pods must succeed for the Job to be `Complete`.	integer ≥ 0	`1`	parallel/indexed batch	With `Indexed` mode, also sets the number of indexes (0…completions-1).
`parallelism`	How many Pods may run at the same time.	integer ≥ 0	`1`	to speed up batch work	Setting it to `0` pauses the Job (no new Pods) without deleting it.
`completionMode`	How completions are counted and indexed.	`NonIndexed`, `Indexed`	`NonIndexed`	partitioned/SPMD work	`Indexed` gives each Pod a unique index via the `JOB_COMPLETION_INDEX` env var and a hostname suffix.
`backoffLimit`	How many Pod failures to tolerate before marking the Job `Failed`.	integer ≥ 0	`6`	always (tune it)	Counts failures across retries; once exceeded the Job stops and existing Pods are terminated. With `OnFailure` it counts container restarts; with `Never` it counts Pod failures.
`backoffLimitPerIndex`	(Indexed Jobs) failure budget per index instead of for the whole Job.	integer	unset	indexed jobs where one bad index shouldn’t kill all	Requires `completionMode: Indexed`. Pair with `maxFailedIndexes`.
`maxFailedIndexes`	(Indexed Jobs) how many indexes may fail before the whole Job fails.	integer	unset	indexed jobs	Lets the Job finish the good indexes even if some are doomed.
`activeDeadlineSeconds`	Wall-clock time budget for the whole Job once it starts.	integer (seconds)	unset (no limit)	any job that could hang	On expiry the Job is `Failed` with reason `DeadlineExceeded` and all Pods are killed — this overrides `backoffLimit` (a hard stop regardless of retries).
`ttlSecondsAfterFinished`	Auto-delete the Job (and its Pods) this many seconds after it finishes.	integer (seconds)	unset (kept forever)	almost always	`0` deletes immediately on completion; without it, finished Jobs pile up and clutter the namespace.
`podFailurePolicy`	Rules to react to specific failures (exit codes, conditions) instead of blindly retrying.	list of rules	unset	production batch	Requires `restartPolicy: Never`. Lets you fail fast on a non-retryable error or ignore a disruption. See below.
`suspend`	Pause the Job: terminate running Pods and create none until un-suspended.	`true`, `false`	`false`	queueing / scheduled start	Suspending an active Job deletes its running Pods (their work is lost). Resuming resets the start time.
`selector`	Label selector matching the Pods this Job manages.	label selector	auto-generated	almost never	Leave it unset. Setting it wrong causes the Job to adopt foreign Pods. To override you must also set `manualSelector: true`.
`manualSelector`	Opt out of the auto-generated, collision-free selector.	`true`, `false`	`false`	legacy/advanced only	Footgun. Only for migrating very old Jobs.
`podReplacementPolicy`	When to create a replacement Pod: as soon as the old one is `Failed`, or only once it’s fully `Terminating`-then-gone.	`TerminatingOrFailed`, `Failed`	`TerminatingOrFailed` (or `Failed` if a failure policy is set)	strict at-most-one work	`Failed` avoids briefly running two Pods for the same index.

`completions` and `parallelism`: the three Job patterns

These two numbers together define the shape of your batch work. There are three classic patterns:

Pattern	`completions`	`parallelism`	Behaviour	Use for
Single Job	`1` (default)	`1` (default)	One Pod runs once; succeed → done.	A migration, a one-off report.
Fixed completion count	N	M (≤ N)	Run until N Pods succeed, up to M at a time.	Process N work items where any Pod can take the next item from a queue.
Work queue	unset (leave default but…)	M	Pods coordinate via an external queue; Job completes when any Pod exits 0 and no others are running, OR when `completions` is met.	A shared queue where Pods pull tasks until it’s empty.

A worked example — process 12 items, 4 at a time:

apiVersion: batch/v1
kind: Job
metadata:
  name: import
spec:
  completions: 12     # 12 successful Pods = done
  parallelism: 4      # at most 4 Pods running concurrently
  backoffLimit: 6
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: ghcr.io/example/importer:1.2

The Job controller keeps 4 Pods running; each time one succeeds it starts another, until 12 have succeeded. If a Pod fails it counts toward backoffLimit and is replaced.

Indexed Jobs: giving each Pod a number

With completionMode: Indexed, the Job hands each Pod a unique index from 0 to completions-1. Each index must succeed exactly once for the Job to complete. Kubernetes exposes the index three ways: the JOB_COMPLETION_INDEX environment variable, an annotation (batch.kubernetes.io/job-completion-index), and — if you set a subdomain and a headless Service — a stable hostname suffix (<job>-<index>). This is how you partition a dataset deterministically (Pod 0 handles shard 0, Pod 1 shard 1) or run SPMD-style workloads (MPI, distributed training) where each worker needs a rank.

apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-shard
spec:
  completions: 5
  parallelism: 5
  completionMode: Indexed       # <-- each Pod gets index 0..4
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: shard
          image: busybox:1.36
          command:
            - sh
            - -c
            - 'echo "processing shard $JOB_COMPLETION_INDEX"; sleep 3'

Contrast with the default NonIndexed mode, where Pods are interchangeable and “completed” just means “this many succeeded, regardless of which Pod did what.” Use NonIndexed for a homogeneous work-queue; use Indexed when which unit of work a Pod does matters.

`restartPolicy: Never` vs `OnFailure` (the field people get wrong)

Both are legal for a Job, but they behave very differently, and backoffLimit counts different things in each case:

OnFailure — when the container exits non-zero, the kubelet restarts the container inside the same Pod. The Pod object survives; you see its RESTARTS count climb. backoffLimit here limits the container restart count. Good when retries are cheap and you want to keep the same Pod (and its node-local scratch).
Never — when the container exits non-zero, the Pod is marked Failed and stays around; the Job controller creates a brand-new Pod. backoffLimit here limits the number of failed Pods. This is the mode podFailurePolicy requires, and it gives you per-attempt Pods you can inspect after the fact.

A subtle gotcha with Never: failed Pods are not automatically deleted (so you can read their logs), which means a flapping Job can leave a litter of Failed Pods until ttlSecondsAfterFinished or you clean up. With OnFailure you instead get one Pod with a high restart count.

`backoffLimit` and the back-off timer

Retries are not instant. After each failure the Job controller waits with exponential back-off, starting at 10 seconds and doubling up to a cap of 6 minutes (10s, 20s, 40s, …, 360s). So a Job with backoffLimit: 6 that keeps failing can take many minutes to give up — budget for that. Once the limit is exceeded the Job’s .status.conditions gains a Failed condition with reason BackoffLimitExceeded, and any still-running Pods are terminated.

`activeDeadlineSeconds`: the hard stop

backoffLimit bounds failures; activeDeadlineSeconds bounds time. It is a wall-clock budget for the entire Job measured from when it starts running. If the deadline passes — even if the Job is making progress, even if backoffLimit is not exhausted — the Job is terminated with reason DeadlineExceeded. This is your safety net against a task that hangs forever (a stuck network call, an infinite loop). Note there is also a template.spec.activeDeadlineSeconds that bounds an individual Pod; the Job-level one bounds the whole Job. Set the Job-level one for “this batch must not run longer than an hour, full stop.”

`ttlSecondsAfterFinished`: clean up after yourself

By default, a finished Job (and the Pods it created) stays in the cluster forever so you can inspect it. In any real environment that means thousands of stale Jobs accumulating, slowing down kubectl get, and eventually pressuring etcd. The TTL-after-finished controller fixes this: set ttlSecondsAfterFinished and the Job is deleted that many seconds after it reaches Complete or Failed. Setting it to 0 deletes the Job the moment it finishes. A common production default is something like ttlSecondsAfterFinished: 86400 (keep finished Jobs for a day so you can debug failures, then auto-clean). CronJob history limits (below) are a separate, complementary mechanism.

Pod failure policy: react to why a Pod failed

By default a Job treats every Pod failure the same — count it, back off, retry — until backoffLimit. But not all failures are equal. An exit code 42 might mean “bad input, will never succeed — stop now”; a DisruptionTarget condition means “the node was drained — that’s not the app’s fault, don’t count it.” podFailurePolicy (stable since v1.31, requires restartPolicy: Never) lets you encode exactly that:

apiVersion: batch/v1
kind: Job
metadata:
  name: smart-retry
spec:
  backoffLimit: 6
  podFailurePolicy:
    rules:
      - action: FailJob            # non-retryable error -> fail the whole Job now
        onExitCodes:
          containerName: main
          operator: In
          values: [42]
      - action: Ignore             # node disruption -> don't count toward backoffLimit
        onPodConditions:
          - type: DisruptionTarget
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: main
          image: ghcr.io/example/batch:1.0

The action for a matched rule is one of:

`action`	Effect
`FailJob`	Fail the entire Job immediately, skipping remaining retries. Use for non-retryable errors.
`Ignore`	The failure does not count toward `backoffLimit`. Use for infrastructure disruptions (preemption, drains, spot eviction).
`Count`	Count it toward `backoffLimit` (the default behaviour, stated explicitly).
`FailIndex`	(Indexed Jobs) fail just this index, not the whole Job. Needs `backoffLimitPerIndex`.

This is the difference between a robust production batch system and one that wastes 30 minutes retrying an error that can never succeed, or that fails a perfectly good Job because a node happened to be drained.

Job status: how to read it

kubectl describe job <name> shows the fields that tell you what’s happening:

.status.active — Pods currently running.
.status.succeeded — Pods that completed successfully.
.status.failed — Pods that failed.
.status.startTime / .status.completionTime — wall-clock bounds.
.status.conditions — Complete or Failed (with a reason like BackoffLimitExceeded, DeadlineExceeded).
.status.completedIndexes — (Indexed) which indexes are done, as a compact range string like 0-2,4.

CronJobs: Jobs on a schedule

A CronJob does one thing: on a repeating schedule, it creates a Job. Everything you just learned about Jobs applies to the Jobs a CronJob spawns — you write the Job spec under .spec.jobTemplate. The CronJob adds the when (a cron expression and time zone) and the what-if-they-overlap safety controls.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-backup
spec:
  schedule: "0 2 * * *"            # every day at 02:00
  timeZone: "Asia/Kolkata"        # interpret the schedule in IST (v1.27+ stable)
  concurrencyPolicy: Forbid       # never run two backups at once
  startingDeadlineSeconds: 300    # if we miss the slot, only start within 5 min
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  jobTemplate:                    # <-- a full Job spec lives here
    spec:
      backoffLimit: 2
      ttlSecondsAfterFinished: 3600
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: backup
              image: ghcr.io/example/pg-backup:1.0

The CronJob spec, field by field

Field	What it does	Values	Default	When to set	Gotcha
`schedule`	The cron expression defining when Jobs are created.	standard 5-field cron, or macros like `@daily`	required	always	Five fields: minute hour day-of-month month day-of-week. See the cron syntax below.
`timeZone`	The IANA time zone in which `schedule` is interpreted.	e.g. `Asia/Kolkata`, `UTC`, `Etc/UTC`	the kube-controller-manager’s time zone (historically UTC)	always, for clarity	Stable since v1.27. Without it your “02:00” is in the controller’s zone, which surprises people. Use a valid IANA name (not `IST`/`PST` abbreviations).
`concurrencyPolicy`	What to do when it’s time to run but the previous Job hasn’t finished.	`Allow`, `Forbid`, `Replace`	`Allow`	overlapping/long jobs	See the table below — the most consequential CronJob field.
`startingDeadlineSeconds`	If the controller misses a scheduled time (it was down, or `concurrencyPolicy: Forbid` blocked it), how late may it still start that run?	integer (seconds)	unset (no deadline)	always recommended	If unset and the controller is down past the slot, a run can be skipped silently. If set too low, transient delays drop runs. Do not set it absurdly high — the controller only counts 100 missed schedules before giving up and logging an error.
`suspend`	Stop creating new Jobs (running ones continue).	`true`, `false`	`false`	maintenance / disable	Suspending does not stop a Job already running — only future scheduling. Toggling back on does not “catch up” missed runs.
`successfulJobsHistoryLimit`	How many completed Jobs to keep for inspection.	integer ≥ 0	`3`	tune for retention	`0` deletes successful Jobs immediately (you lose their logs/history).
`failedJobsHistoryLimit`	How many failed Jobs to keep.	integer ≥ 0	`1`	tune for retention	Keep at least `1` so you can debug the last failure.
`jobTemplate`	The Job that gets created on each tick — a complete Job spec.	a Job template	required	always	Everything from the Job section applies here (backoffLimit, ttl, restartPolicy, podFailurePolicy).

Cron syntax, in full

The schedule uses standard cron — five space-separated fields:

┌───────────── minute        (0 - 59)
│ ┌───────────── hour        (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month   (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sunday = 0; 7 also = Sunday)
│ │ │ │ │
* * * * *

The operators in each field:

Operator	Meaning	Example	Reads as
`*`	every value	`* * * * *`	every minute
`,`	list	`0 0,12 * * *`	at 00:00 and 12:00
`-`	range	`0 9-17 * * *`	every hour from 09:00 to 17:00
`/`	step	`/5 * * *`	every 5 minutes
combo	mix	`0 8-18/2 * * 1-5`	every 2 hours, 08:00–18:00, Mon–Fri

Common schedules to memorise: */5 * * * * (every 5 min), 0 * * * * (hourly, on the hour), 0 0 * * * (daily at midnight), 0 0 * * 0 (weekly, Sunday midnight), 0 0 1 * * (monthly, 1st at midnight). Kubernetes also accepts the macros @yearly/@annually, @monthly, @weekly, @daily/@midnight, @hourly. A non-standard but supported extension lets you write @every 1h30m. Use crontab.guru to sanity-check an expression — a wrong field is the single most common CronJob bug.

A classic exam trap: day-of-month and day-of-week are OR-ed, not AND-ed, when both are restricted. 0 0 13 * 5 means “at midnight on the 13th of the month or any Friday,” not “Friday the 13th.”

`concurrencyPolicy`: the field that matters most

What happens when a new run is due but the previous Job is still running? This is the question that decides whether your nightly backup quietly piles up ten overlapping copies and exhausts the cluster.

Policy	Behaviour	When to use	Risk if wrong
`Allow` (default)	Start the new Job regardless; multiple runs may overlap.	Jobs that are short, idempotent, and safe to run concurrently.	A slow job (or a backlog) spawns many overlapping Pods → resource exhaustion, duplicate side-effects.
`Forbid`	If the previous Job is still running, skip this run (and count it as a missed schedule for `startingDeadlineSeconds`).	Jobs that must never overlap (backups, anything that writes shared state).	If a job routinely overruns its interval, runs get skipped — watch for that.
`Replace`	Cancel the still-running previous Job and start the new one.	“Only the latest matters” jobs (e.g. a periodic full-refresh where a stale run is pointless).	The killed Job’s work is lost mid-flight; not safe for jobs with side-effects you can’t interrupt.

For almost any job that writes to shared state, Forbid is the safe default — pair it with a sensible startingDeadlineSeconds so a brief overrun doesn’t silently drop the next several runs.

`startingDeadlineSeconds` and missed schedules

The CronJob controller wakes up periodically and asks “are there scheduled times I haven’t acted on yet?” If it was down, or Forbid blocked a slot, time may have passed. startingDeadlineSeconds says how late a missed run may still be started. If a scheduled time is older than that deadline, it’s skipped. With the field unset, there is effectively no deadline — but there is a separate hard limit: if the controller finds more than 100 missed schedules since it last succeeded (e.g. the CronJob was suspended for a long time, or the deadline is huge), it stops trying and records the event FailedNeedsStart rather than firing a flood of Jobs. The practical guidance: always set startingDeadlineSeconds to a value larger than your controller’s poll interval but smaller than your schedule interval (e.g. 100–300s for an hourly job).

CronJob history and cleanup

A CronJob keeps the last successfulJobsHistoryLimit (default 3) successful Jobs and failedJobsHistoryLimit (default 1) failed Jobs, deleting older ones — including their Pods. This is separate from a Job’s own ttlSecondsAfterFinished. Use both: history limits cap how many Job objects the CronJob retains; the Job-level TTL cleans Pods promptly even within a retained Job. Setting a history limit to 0 deletes that category immediately and is occasionally what you want for very high-frequency CronJobs that would otherwise generate churn.

CronJob status and suspension

kubectl get cronjob shows LAST SCHEDULE (when it last fired) and ACTIVE (how many Jobs are running right now). To temporarily stop a CronJob without deleting it — for a maintenance window, say — set spec.suspend: true (or kubectl patch cronjob nightly-backup -p '{"spec":{"suspend":true}}'). Running Jobs continue; no new ones are created. Crucially, un-suspending does not back-fill missed runs — it simply resumes future scheduling.

DaemonSets: one Pod per node

A DaemonSet ensures that a copy of a Pod runs on every node (or every node matching a selector). When a new node joins the cluster, the DaemonSet controller automatically adds the Pod there; when a node leaves, that Pod is garbage-collected. There is no replicas field — the desired count is “the number of matching nodes,” and the controller tracks it for you. This is the workload type for node-level infrastructure: log collectors (Fluent Bit), metrics agents (node-exporter), CNI network plugins (Calico, Cilium), CSI storage drivers, and security agents.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-logger
  labels:
    app: node-logger
spec:
  selector:
    matchLabels:
      app: node-logger
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: node-logger
    spec:
      tolerations:
        - operator: Exists          # tolerate ALL taints -> run on every node incl. control-plane
      containers:
        - name: logger
          image: fluent/fluent-bit:3.0
          resources:
            requests: { cpu: 50m, memory: 64Mi }
            limits:   { memory: 128Mi }
          volumeMounts:
            - name: varlog
              mountPath: /var/log
              readOnly: true
      volumes:
        - name: varlog
          hostPath:
            path: /var/log

The DaemonSet spec, field by field

Field	What it does	Values	Default	When to set	Gotcha
`selector`	Label selector identifying the DaemonSet’s Pods. Must match `template.metadata.labels`.	label selector	required	always	Unlike a Job, you must set this, and it is immutable after creation.
`template`	The Pod template run on each node.	a Pod template	required	always	`restartPolicy` defaults to (and must be) `Always` — these are long-lived agents.
`template.spec.nodeSelector`	Restrict the DaemonSet to nodes carrying these labels.	map of node labels	unset (= all nodes)	targeting a node subset	e.g. `kubernetes.io/os: linux` to skip Windows nodes; or a custom label like `gpu: "true"`.
`template.spec.affinity.nodeAffinity`	A richer way to target nodes (expressions, in/not-in).	node affinity rules	unset	complex targeting	Use instead of `nodeSelector` when you need OR / NotIn logic.
`template.spec.tolerations`	Lets the Pod schedule onto tainted nodes (control-plane, dedicated pools).	list of tolerations	(some added automatically — see below)	running on control-plane / tainted nodes	Without a matching toleration the DaemonSet skips tainted nodes. `operator: Exists` with no key tolerates everything.
`updateStrategy.type`	How to roll out a changed Pod template.	`RollingUpdate`, `OnDelete`	`RollingUpdate`	always consider it	See the rollout section below.
`updateStrategy.rollingUpdate.maxUnavailable`	During a rolling update, how many node Pods may be down at once.	integer or `%`	`1`	tune blast radius	Higher = faster rollout, more nodes briefly without the agent.
`updateStrategy.rollingUpdate.maxSurge`	Allow a new Pod to start on a node before the old one is gone (brief double-run per node).	integer or `%`	`0`	zero-downtime agents	Stable since v1.25. With `maxSurge > 0`, `maxUnavailable` must be `0`. Needs the agent to tolerate two copies briefly.
`minReadySeconds`	How long a new Pod must be `Ready` before it’s considered available (gates the rollout pace).	integer (seconds)	`0`	flaky agents	Slows a rollout so a crash-looping new version doesn’t take out every node at once.
`revisionHistoryLimit`	How many old `ControllerRevision` objects to keep for rollback.	integer	`10`	rarely change	Lets `kubectl rollout undo daemonset/...` work.

How DaemonSets are scheduled (and why they ignore some rules)

Modern DaemonSets are scheduled by the default scheduler (using node affinity the DaemonSet controller injects from your nodeSelector/affinity), not by a special path. Two consequences worth knowing:

DaemonSet Pods tolerate many node conditions automatically. The controller adds tolerations for taints like node.kubernetes.io/not-ready, unreachable, disk-pressure, memory-pressure, pid-pressure, and unschedulable so that an agent keeps running on a node that’s having trouble — exactly when you most want your logging/monitoring agent present. You do not need to add these yourself.
To run on control-plane nodes you still need to tolerate their taint. Control-plane nodes carry node-role.kubernetes.io/control-plane:NoSchedule. A monitoring DaemonSet that must cover control-plane nodes needs either a specific toleration for that taint or a blanket tolerations: [{operator: Exists}].

nodeName is not set by you — the controller targets nodes via affinity. And note that DaemonSet Pods are not evicted by node-pressure the way ordinary Pods are; they’re treated as critical to the node.

Targeting a subset of nodes

You rarely want literally every node. Two common patterns:

# Only Linux nodes (skip Windows):
template:
  spec:
    nodeSelector:
      kubernetes.io/os: linux

# Only GPU nodes (using a custom label you applied to those nodes):
template:
  spec:
    nodeSelector:
      hardware: gpu

Apply the matching label to nodes with kubectl label node <node> hardware=gpu. Add or remove the label later and the DaemonSet adds/removes the Pod on that node automatically.

Rolling out DaemonSet changes: `RollingUpdate` vs `OnDelete`

When you change the Pod template (a new image, say), the update strategy decides what happens:

Strategy	Behaviour	When to use
`RollingUpdate` (default)	Automatically delete old Pods and create new ones, node by node, respecting `maxUnavailable`/`maxSurge`.	Almost always — controlled, automatic, observable with `kubectl rollout status`.
`OnDelete`	Do nothing on a template change; the new template is only applied to a node when you manually delete that node’s old Pod.	Sensitive agents (CNI, storage drivers) where you want to choose exactly when each node is touched, often draining first.

For RollingUpdate, track and control it just like a Deployment:

kubectl rollout status daemonset/node-logger
kubectl rollout history daemonset/node-logger
kubectl rollout undo daemonset/node-logger          # roll back to previous revision

maxUnavailable: 1 means one node loses its agent at a time during the rollout; raise it (or use a percentage) for a faster rollout at the cost of more nodes briefly missing the agent. Use maxSurge when the agent must never be absent and can tolerate a momentary second copy on the node.

Kubernetes core objects

The diagram above places Jobs, CronJobs and DaemonSets alongside the other core objects — notice that all three are controllers wrapping a Pod template, just like a Deployment, but each enforces a different notion of “desired state”: a count of completions (Job), a schedule (CronJob), or one Pod per node (DaemonSet).

When to use each (and when not to)

This decision table is the heart of the lesson — and a guaranteed interview question:

You need…	Use	Not	Why
A long-running service (web/API)	Deployment	Job	Jobs stop when done; Deployments stay up and self-heal.
A task that runs once and exits	Job	Deployment	A Deployment would restart the “finished” task forever.
That task on a repeating schedule	CronJob	a Job + external cron	CronJob is native, observable, and handles concurrency/history.
One agent on every node	DaemonSet	Deployment with many replicas	DaemonSet auto-scales with nodes and pins exactly one per node.
Stable identity + per-Pod storage (databases)	StatefulSet	Deployment/Job	StatefulSets give stable names, ordered rollout, and per-Pod volumes.
Batch work where each Pod needs a fixed rank/shard	Indexed Job	NonIndexed Job	Indexed assigns each Pod a deterministic index.

Two sharp contrasts beginners blur:

Job vs Deployment: a Deployment’s Pods must never exit (restartPolicy: Always); a Job’s Pods are meant to exit (Never/OnFailure). If you find yourself running a batch script inside a Deployment and watching it restart endlessly, you wanted a Job.
DaemonSet vs Deployment: a Deployment with replicas: N gives you N Pods placed wherever the scheduler likes (possibly several on one node, none on another). A DaemonSet gives you exactly one per matching node, tracking node count automatically. “An agent on every node” is a DaemonSet, full stop.

Hands-on lab

Free, on your laptop, using kind (or minikube). We will run a Job, an Indexed Job, a CronJob, and a DaemonSet, then clean up.

1. Create a cluster (with two worker nodes so the DaemonSet is interesting):

cat <<'EOF' | kind create cluster --name batch-lab --config -
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
EOF

kubectl get nodes
# expect: 1 control-plane + 2 workers, all Ready

2. A simple Job — fixed completions with parallelism:

cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: hello-batch
spec:
  completions: 6
  parallelism: 2
  backoffLimit: 4
  ttlSecondsAfterFinished: 120
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: hello
          image: busybox:1.36
          command: ["sh", "-c", "echo done on $(hostname); sleep 2"]
EOF

kubectl get job hello-batch -w
# COMPLETIONS climbs 0/6 -> 6/6, two Pods at a time
kubectl get pods -l batch.kubernetes.io/job-name=hello-batch
kubectl logs -l batch.kubernetes.io/job-name=hello-batch --tail=1

Validation: kubectl get job hello-batch -o jsonpath='{.status.succeeded}' prints 6. After ~2 minutes the Job auto-deletes (thanks to the TTL) — re-run kubectl get job hello-batch and it’s gone.

3. An Indexed Job — each Pod gets a unique index:

cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-demo
spec:
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: shard
          image: busybox:1.36
          command: ["sh", "-c", "echo shard $JOB_COMPLETION_INDEX; sleep 2"]
EOF

kubectl wait --for=condition=complete job/indexed-demo --timeout=60s
kubectl logs -l batch.kubernetes.io/job-name=indexed-demo --prefix=true | sort
# four lines: shard 0, shard 1, shard 2, shard 3 (each Pod saw a distinct index)
kubectl get job indexed-demo -o jsonpath='{.status.completedIndexes}{"\n"}'
# -> 0-3

4. A CronJob — runs every minute, never overlaps:

cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: ping
spec:
  schedule: "*/1 * * * *"
  timeZone: "Etc/UTC"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 30
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 90
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: ping
              image: busybox:1.36
              command: ["sh", "-c", "date; echo tick"]
EOF

kubectl get cronjob ping
# note LAST SCHEDULE updates each minute
# wait ~2 minutes, then:
kubectl get jobs -l batch.kubernetes.io/cronjob-name=ping
kubectl logs -l batch.kubernetes.io/cronjob-name=ping --tail=2

# suspend it so it stops firing:
kubectl patch cronjob ping -p '{"spec":{"suspend":true}}'
kubectl get cronjob ping -o jsonpath='{.spec.suspend}{"\n"}'   # -> true

5. A DaemonSet — one Pod per node:

cat <<'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hello-ds
spec:
  selector:
    matchLabels: { app: hello-ds }
  updateStrategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 1 }
  template:
    metadata:
      labels: { app: hello-ds }
    spec:
      tolerations:
        - operator: Exists          # also land on the control-plane node
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
EOF

kubectl get daemonset hello-ds
# DESIRED CURRENT READY = 3 (one per node, control-plane included)
kubectl get pods -l app=hello-ds -o wide
# one Pod on each of the 3 nodes

Now watch a rolling update — change the image and observe node-by-node replacement:

kubectl set image daemonset/hello-ds pause=registry.k8s.io/pause:3.10
kubectl rollout status daemonset/hello-ds
kubectl rollout history daemonset/hello-ds

Validation: kubectl get ds hello-ds -o jsonpath='{.status.numberReady}' equals the node count (3). Remove the control-plane toleration and re-apply, and the DESIRED count drops to 2 (the control-plane taint now excludes it).

Cleanup:

kubectl delete cronjob ping
kubectl delete daemonset hello-ds
kubectl delete job hello-batch indexed-demo --ignore-not-found
kind delete cluster --name batch-lab

Cost note: entirely free — kind runs the whole cluster in local Docker containers on your machine. Nothing is provisioned in any cloud, so there is no bill. The only resource consumed is your laptop’s CPU/RAM while the cluster is up; deleting the kind cluster reclaims it.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
`Job "x" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always"`	Left the default Deployment-style `restartPolicy`.	Set `restartPolicy: Never` or `OnFailure` in the Job’s Pod template.
Job never completes, Pods keep restarting with rising RESTARTS	`restartPolicy: OnFailure` plus a container that always exits non-zero.	Fix the command, or switch to `Never` to get discrete failed Pods you can inspect, and check `backoffLimit`.
Finished Jobs/Pods pile up forever	No `ttlSecondsAfterFinished`; for CronJobs, history limits too high.	Set `ttlSecondsAfterFinished` on the Job template and tune `successfulJobsHistoryLimit`/`failedJobsHistoryLimit`.
CronJob “runs at the wrong time”	Schedule interpreted in the controller’s zone, or a bad cron field.	Set `timeZone` explicitly; verify the expression on crontab.guru; remember DOM/DOW are OR-ed.
CronJob skips runs unexpectedly	`concurrencyPolicy: Forbid` and the previous Job overran the interval, or `startingDeadlineSeconds` too small.	Make the Job faster, lengthen the interval, or raise `startingDeadlineSeconds`.
CronJob silently stopped firing	Controller was down long enough to miss >100 schedules.	Re-check `startingDeadlineSeconds`; recreate or un-suspend; look for the `FailedNeedsStart` event.
DaemonSet has fewer Pods than nodes	Nodes are tainted (e.g. control-plane) and the Pod has no matching toleration, or a `nodeSelector` excludes them.	Add the right `tolerations`/`nodeSelector`; check `kubectl describe ds` events for `FailedScheduling`.
DaemonSet image change doesn’t roll out	`updateStrategy.type: OnDelete`.	Delete the old Pods manually, or switch to `RollingUpdate`.
`kubectl logs job/x` shows nothing	The Pod failed before producing output, or logs are on a deleted Pod.	Use `kubectl get pods -l batch.kubernetes.io/job-name=x` and `kubectl describe` the failed Pod; consider `restartPolicy: Never` to keep failed Pods.

Best practices

Always set ttlSecondsAfterFinished on Jobs (and on CronJob jobTemplates). Without it, finished objects accumulate and eventually pressure etcd. A day’s retention plus a TTL is a sane default.
Always set backoffLimit and activeDeadlineSeconds to bound failures and time. A batch job with neither can retry for many minutes or hang indefinitely.
Set CPU/memory requests and limits on batch Pods. A parallel Job or a per-node DaemonSet without resource requests can starve your real workloads — and DaemonSet Pods run on every node, so a small leak is multiplied by node count.
Pick concurrencyPolicy deliberately. Default Allow plus a slow job is how clusters get buried in overlapping Pods. Use Forbid for anything touching shared state.
Always set timeZone on CronJobs. Relying on the controller’s zone is a portability and clarity trap.
Use podFailurePolicy for production batch: FailJob on non-retryable exit codes, Ignore for disruptions (spot eviction, drains) so node maintenance doesn’t waste your retry budget.
Make batch jobs idempotent. Pods can be retried, replaced, or (with Replace) interrupted — running the same task twice must be safe.
Use Indexed Jobs for partitioned work instead of inventing your own index via env vars or a queue — it’s built-in, deterministic, and survives retries.
For DaemonSets, prefer RollingUpdate with a small maxUnavailable so an agent regression takes out one node at a time; reserve OnDelete for CNI/CSI where you want manual, drained rollouts.
Use the well-known labels Kubernetes adds — batch.kubernetes.io/job-name, batch.kubernetes.io/cronjob-name — to select Pods, rather than hand-rolling your own.

Security notes

DaemonSets often need elevated access (hostPath mounts, hostNetwork, hostPID, privileged containers) to do node-level work — and they run on every node, so a compromised DaemonSet Pod is a cluster-wide foothold. Grant the minimum: read-only hostPath where possible, drop capabilities, set a restrictive securityContext, and avoid privileged: true unless the agent genuinely requires it.
Scope ServiceAccount permissions tightly. A DaemonSet or batch Job rarely needs broad RBAC. Give each its own ServiceAccount with only the verbs it needs (see the RBAC fundamentals lesson), and set automountServiceAccountToken: false if the Pod doesn’t call the API at all.
Batch Jobs frequently handle secrets (database credentials for a migration, cloud keys for a backup). Mount them from a Secret, never bake them into the image or the command, and prefer short-lived/projected tokens.
Apply Pod Security Standards. Batch and agent Pods should meet at least the baseline, ideally restricted, profile — non-root, no privilege escalation, read-only root filesystem where feasible.
A CronJob is an automatic, recurring execution path. Treat its image and command as production code: pin image digests, scan them, and review changes — a poisoned CronJob image runs on your schedule, unattended.
Set resource limits to contain blast radius. An unbounded parallel Job or a leaky DaemonSet can become an accidental (or deliberate) denial-of-service across the cluster.

Interview & exam questions

What is the difference between completions and parallelism in a Job? completions is how many Pods must succeed for the Job to finish; parallelism is how many Pods may run concurrently. With completions: 12, parallelism: 4, the controller keeps 4 running until 12 have succeeded.
When would you choose restartPolicy: Never over OnFailure for a Job? Use Never when you want each attempt as a separate, inspectable Pod (failed Pods stick around) and when you need podFailurePolicy (which requires Never). Use OnFailure when retries are cheap and you’re happy to restart the container in place and watch the restart count. Note backoffLimit counts Pod failures under Never and container restarts under OnFailure.
A Job is making progress but you need it to stop after one hour no matter what — which field? activeDeadlineSeconds: 3600. It’s a wall-clock hard stop that overrides backoffLimit; on expiry the Job is Failed with reason DeadlineExceeded.
What does completionMode: Indexed give you? A unique index (0…completions-1) per Pod, exposed via JOB_COMPLETION_INDEX, an annotation, and a hostname suffix. Each index must succeed once. It’s for partitioned/sharded or SPMD work where which unit a Pod processes matters.
How do you stop finished Jobs from accumulating? ttlSecondsAfterFinished on the Job deletes it (and its Pods) N seconds after completion. For CronJobs, also tune successfulJobsHistoryLimit/failedJobsHistoryLimit.
Explain the three concurrencyPolicy values for a CronJob. Allow (default) lets runs overlap; Forbid skips a new run if the previous is still going; Replace cancels the running one and starts the new. Forbid is the safe default for state-mutating jobs.
A CronJob’s schedule is "0 0 13 * 5" — when does it run? Midnight on the 13th of every month or every Friday — day-of-month and day-of-week are OR-ed when both are restricted, not “Friday the 13th.”
What does startingDeadlineSeconds do, and what happens if a CronJob misses too many schedules? It bounds how late a missed run may still start; older misses are skipped. If more than 100 schedules are missed since the last success, the controller stops trying and emits FailedNeedsStart rather than firing a flood of Jobs.
Why does a DaemonSet have no replicas field? Its desired count is the number of matching nodes. The controller runs exactly one Pod per matching node and adjusts automatically as nodes join or leave.
How do you make a DaemonSet run on control-plane nodes? Add a toleration for node-role.kubernetes.io/control-plane:NoSchedule (or a blanket tolerations: [{operator: Exists}]). The controller already tolerates not-ready/pressure taints automatically, but not that one.
RollingUpdate vs OnDelete for a DaemonSet? RollingUpdate (default) replaces Pods node-by-node automatically, honouring maxUnavailable/maxSurge. OnDelete applies the new template only when you manually delete a node’s old Pod — used for CNI/CSI agents where you want to control (and drain) each node yourself.
You need an agent on every node — DaemonSet or a Deployment with replicas equal to the node count? DaemonSet. A Deployment doesn’t guarantee one-per-node (the scheduler could stack several on one node) and doesn’t track node count as nodes scale.

Quick check

True or false: a Job’s Pod template may use restartPolicy: Always.
Which CronJob field ensures two runs never overlap, skipping the new one if the old is still running?
What environment variable does an Indexed Job set in each Pod?
Which field auto-deletes a finished Job after a set time?
What’s the default updateStrategy.type for a DaemonSet?

Answers

False. A Job’s Pod template must use Never or OnFailure; Always is rejected because the task is meant to terminate.
concurrencyPolicy: Forbid.
JOB_COMPLETION_INDEX (also available as the annotation batch.kubernetes.io/job-completion-index).
ttlSecondsAfterFinished.
RollingUpdate (with maxUnavailable: 1 by default).

Exercise

Build a small “scheduled, resilient backup” workload on a local kind cluster:

Create a CronJob named db-backup that runs every 2 minutes in timeZone: Etc/UTC, with concurrencyPolicy: Forbid, startingDeadlineSeconds: 60, successfulJobsHistoryLimit: 2, and failedJobsHistoryLimit: 2.
Its jobTemplate should use restartPolicy: Never, backoffLimit: 2, activeDeadlineSeconds: 30, and ttlSecondsAfterFinished: 120. The container can be busybox running sh -c "date; echo backing up; sleep 5".
Add a podFailurePolicy that does FailJob on exit code 1 (simulate a non-retryable error) — then temporarily change the command to sh -c "exit 1" and confirm the Job fails immediately without exhausting backoffLimit.
Separately, deploy a DaemonSet node-agent using the pause image that runs on worker nodes only (use nodeSelector: { node-role.kubernetes.io/worker: "" } or label your workers) and does a RollingUpdate with maxUnavailable: 1. Verify the Pod count equals the worker count.
Roll the DaemonSet’s image forward with kubectl set image, watch kubectl rollout status, then kubectl rollout undo it.
Clean up with kind delete cluster.

Success criteria: the CronJob fires on schedule and keeps only the configured history; the failing variant fails fast via the failure policy; the DaemonSet lands exactly one Pod per worker node and rolls out/back cleanly.

Certification mapping

CKAD — Application Design and Build explicitly covers “understand Jobs and CronJobs.” Expect to write a Job with completions/parallelism, a CronJob with a correct schedule, and to reason about restartPolicy, backoffLimit, and activeDeadlineSeconds under time pressure. Knowing kubectl create job --from=cronjob/<name> to trigger a CronJob run on demand is a handy exam trick.
CKA — Workloads & Scheduling includes DaemonSets and understanding the different workload resources. You should deploy and update a DaemonSet, target nodes with selectors/tolerations, and explain how DaemonSets schedule onto tainted nodes.
Both exams reward fluency with the imperative generators: kubectl create job pi --image=perl -- perl -e ..., kubectl create cronjob hi --image=busybox --schedule="*/1 * * * *" -- echo hi, and kubectl create job --from=cronjob/<name> <manual-run>.

Glossary

Job — a controller that runs Pods until a target number complete successfully, then stops.
CronJob — a controller that creates Jobs on a repeating cron schedule.
DaemonSet — a controller that runs one Pod on every (matching) node, tracking node membership automatically.
Run-to-completion — a workload that is expected to finish and exit (vs run-forever services).
completions — the number of Pods that must succeed for a Job to be Complete.
parallelism — the maximum number of a Job’s Pods running at once.
completionMode — NonIndexed (interchangeable Pods) or Indexed (each Pod gets a unique index).
backoffLimit — how many failures a Job tolerates before being marked Failed.
activeDeadlineSeconds — a wall-clock time limit for a Job (or Pod); a hard stop overriding retries.
ttlSecondsAfterFinished — auto-delete delay for a finished Job and its Pods.
podFailurePolicy — rules to react to specific Pod failures (fail fast, ignore disruptions).
concurrencyPolicy — CronJob behaviour on overlap: Allow, Forbid, or Replace.
startingDeadlineSeconds — how late a missed CronJob run may still be started.
suspend — pause a Job (terminate its Pods) or a CronJob (stop creating Jobs) without deleting it.
updateStrategy — how a DaemonSet rolls out template changes: RollingUpdate or OnDelete.
maxUnavailable / maxSurge — DaemonSet rollout knobs for how many node Pods may be down / how many may briefly double up.
toleration — permission for a Pod to schedule onto a node carrying a matching taint (e.g. control-plane).
ownerReference — the link from a child object (Pod, Job) to its controller, enabling garbage collection.

Next steps

You now have the three batch and infrastructure workload controllers in your toolkit. The natural next topic is where Pods land and how to shape that placement deliberately — node and pod affinity, topology spread constraints, taints and tolerations (which you met here for DaemonSets), and priority-based preemption. Continue with Advanced Kubernetes Scheduling: Affinity, Topology Spread Constraints, Taints, and Priority-Based Preemption. If you skipped it, the prior lesson on RBAC & Service Accounts is what you’ll lean on to scope the ServiceAccounts your Jobs and DaemonSets run as.