DevOps Troubleshooting: Pipelines, Builds, Deployments, Runners & Artifacts

The engineer everyone pages when the pipeline is red is rarely the one who has memorised the most YAML keys. They are the one with a method. When the build that passed an hour ago suddenly fails, when a deploy halts at an approval that nobody can see, when a runner sits “idle” while jobs queue for twenty minutes, the strong engineer does not start blindly re-running the job and hoping — they run a short, fixed sequence that pins the failure to one stage (source, build, test, package/publish, or deploy) and one layer (the code/config, the pipeline definition, the runner/agent, the artifact registry, or the target environment), form a falsifiable hypothesis, prove it from logs, fix it with the smallest safe change, and verify with a clean run. Everyone else clicks Re-run jobs five times, then git commit --allow-empty -m "trigger CI", and burns an afternoon.

This lesson gives you that method and then turns it into playbooks: for each common failure you get a table of symptom → likely cause → the diagnostic that confirms it → the fix. We cover the failures you genuinely hit in production and that come up in interviews and on the AWS / Azure / GCP DevOps engineer exams — build failures (unpinned or unresolvable dependencies, a stale or missing cache, environment and toolchain drift, out-of-memory and disk-full, and flaky tests); pipeline and YAML errors (indentation and type-coercion traps, templating-expression mistakes, and the single biggest source of confusion — variable and secret scope); runner and agent problems (offline or unregistered, no capacity, missing permissions and labels, and the perennial Docker-in-Docker mess); artifact and registry issues (401/403 auth, pull-rate limits, and missing or immutable versions); and deployment failures (a stuck approval gate, OIDC/credential errors, ImagePullBackOff, and a rollout that never becomes healthy). We close with a dedicated flaky-pipeline playbook because intermittent failure is its own discipline. Examples are given for GitHub Actions, GitLab CI, Azure Pipelines and Jenkins, since the symptoms are universal even when the syntax differs.

Learning objectives

By the end of this lesson you can:

Apply a repeatable troubleshooting loop — observe, isolate the stage and layer, reproduce, hypothesise, fix with the smallest safe change, verify, prevent — to any CI/CD failure.
Diagnose and fix build failures: dependency resolution and lockfiles, cache hits/misses and cache poisoning, environment/toolchain drift (“works on my machine”), and resource exhaustion (OOM, disk-full).
Read pipeline/YAML errors correctly — indentation and the Norway/octal coercion traps, templating-expression mistakes, and especially variable and secret scoping across triggers, jobs, environments and forks.
Resolve runner/agent problems: offline/unregistered runners, capacity and concurrency limits, label/permission mismatches, and Docker-in-Docker vs the Docker socket.
Fix artifact and registry failures: 401 vs 403, registry pull-rate limits, missing/yanked versions, and immutable-tag pushes.
Triage deployment failures: approval/gate stalls, OIDC/credential and trust-policy errors, ImagePullBackOff/ErrImagePull, and a rollout stuck “progressing” — and run a flaky-test/flaky-pipeline quarantine-and-fix workflow.

Prerequisites & where this fits

You need a passing familiarity with a CI/CD system — a repository with a pipeline you can edit (GitHub Actions, GitLab CI, Azure Pipelines or Jenkins) and permission to view its logs and re-run jobs. The hands-on lab uses a free GitHub Actions workflow on a throwaway repository, so it costs nothing within the free tier. You should already have the core model from earlier in this course: what a pipeline is and how it is structured from the DevOps fundamentals — culture, CI/CD and the DevOps lifecycle, the YAML you need for pipelines — anchors, templates and the gotchas, how a pipeline is composed from stages, gates and artifacts, and the deployment strategies — rolling, blue/green, canary and rollback whose failures we triage here. We re-explain each failure from first principles, so it is fine if some of that is still fresh. This is the Troubleshooting lesson of the DevOps Zero-to-Hero track — the bridge between building pipelines and operating them under pressure. The next lesson steps up to the DevOps architecting ladder, from a single pipeline to an internal developer platform.

The method: a loop, not a re-run

Almost every pipeline failure yields to the same loop. The discipline is to follow it in order rather than hammering the re-run button — which, for a non-flaky failure, just wastes minutes and tells you nothing new.

Observe — read the log to the failing line, not the summary. CI surfaces a red ✗ and an exit code, but the cause is usually a specific line a few screens up. Open the failing step, expand it, and scroll to the first error (a later error is often a consequence of the first). Note the exit code (1 = generic, 137 = OOM/SIGKILL, 143 = SIGTERM/cancelled, 124 = timeout, 127 = command-not-found). Resist re-running until you have read the actual failure.
Isolate the stage and the layer. Every failure lives in one stage — source (checkout, submodules, LFS), build (compile/deps), test, package/publish (artifact/registry), or deploy (the target environment) — and one layer: the code/config, the pipeline definition (YAML/expressions/scope), the runner/agent, the registry, or the target environment. Naming both eliminates most of the search space: a YAML parse error is the pipeline layer at parse time; ImagePullBackOff is the deploy stage at the registry/environment boundary; “no runner available” is the runner layer before any of your steps run.
Reproduce — shrink the loop. A 12-minute pipeline is a terrible debugger. Reproduce the failing step locally where possible (run the exact build/test command in the same container image), or shrink CI feedback: run only the failing job, add a debug step (env | sort, docker version, whoami, ls -la), or re-run with debug logging (ACTIONS_RUNNER_DEBUG/ACTIONS_STEP_DEBUG for GitHub, CI_DEBUG_TRACE: "true" for GitLab, System.Debug=true for Azure Pipelines). The goal is the shortest path from change to signal.
Form one hypothesis. State it as a falsifiable sentence: “the build fails because the cache restored a node_modules built for a different lockfile, so a native module is missing.” A vague hunch (“CI is broken”) cannot be tested; a specific claim can — and points straight at the diagnostic (here: clear the cache and re-run).
Fix — the smallest, safest, reproducible change. Prefer pinning a version over loosening one, scoping a secret correctly over widening its scope, fixing a label over adding runs-on: self-hosted everywhere. Change one thing, and prefer a change that makes the pipeline more deterministic, not less. (Re-running until it passes is not a fix — it’s hiding a flake; see the flaky playbook.)
Verify, then prevent. A genuine fix is a green run from a clean state (fresh cache where relevant), not “the re-run happened to pass”. Then ask the prevention question: what guardrail — a committed lockfile, a cache key that includes the lockfile hash, a pinned action/image digest, a required status check, a timeout, a concurrency group — stops this recurring?

DevOps troubleshooting decision tree

The decision tree above encodes step 2: start from the symptom at the top, branch on “which stage failed — build, pipeline parse, runner, artifact/registry, or deploy?”, and each leaf points you at the playbook below. Under pressure, walking the tree keeps you honest about which layer you are in before you change the pipeline.

The signals that answer everything

You can diagnose the large majority of failures from a handful of signals. Know precisely what each one tells you — this is what stops the flailing.

Signal	Question it answers	Where to look
The first error line in the failing step	What actually broke (vs downstream noise)?	Expand the failing step, scroll to the top of the red output
The exit code	What kind of failure? (`137` OOM, `143`/`130` cancelled, `124` timeout, `127` not-found, `126` not-executable)	The step’s “exited with code N” line
Job vs step vs setup boundary	Did your code fail, or the runner/checkout/auth before it?	Failure in a setup/auth step → infra; in your script → code/config
`env` dump (`env \| sort`)	Which variables/secrets are actually present in this context?	A temporary debug step at the top of the failing job
Debug/verbose logs (`ACTIONS_STEP_DEBUG`, `CI_DEBUG_TRACE`, `System.Debug`)	What is the runner doing internally (expressions, masking, network)?	Re-run with debug; read the expanded trace
Runner/agent status	Is there a runner online, idle, with the right labels?	Settings → Actions/CI runners; the agent pool view
Timing (“works in the morning, fails at 5pm”)	Resource contention, rate limits, or a time-of-day dependency?	Compare timestamps across runs; correlate with load

A few high-leverage habits: always read the first error, not the last; check the exit code before the message (it often names the class of failure instantly); when a value is “missing”, dump the environment rather than guessing whether the secret reached the job; reach for debug logging the moment the surface log is insufficient; and remember that a failure in a setup/checkout/login step is an infrastructure problem, while a failure in your script is a code/config problem — the boundary tells you which playbook to open.

Build failures: dependencies, cache, environment & resources

The build stage fails for four recurring reasons: it can’t resolve dependencies, the cache lied to it, the environment differs from where the code was written, or it ran out of a resource. Name which one before you touch the YAML — the fixes are completely different.

Symptom	Likely cause	Diagnostic	Fix
`npm ERR! could not resolve` / `Could not find a version that satisfies` / `404 Not Found` for a package	An unpinned or yanked dependency, a private registry not configured, or a transient registry outage	Read the exact package + version; check whether it’s private; retry to rule out a blip	Commit the lockfile and install with `npm ci` / `pip install -r … --require-hashes` / `mvn -o` where possible; configure the private registry/auth; pin the version
Build passes locally, fails in CI with a missing native module / wrong binary	The cache restored artifacts built against a different lockfile, OS, or arch (cache key too loose)	Compare the cache key to the lockfile; check the runner OS/arch vs local	Make the cache key include a hash of the lockfile (`hashFiles('**/package-lock.json')`) and the OS/arch; bust the cache once to recover
First build of the day is slow / re-downloads everything; “cache not found”	Cache expired (e.g. GitHub evicts after ~7 days unused or at the repo cache size limit) or the key changed	Look for “Cache not found for input keys” in the log	Expected behaviour — add a `restore-keys` fallback prefix so a partial cache still helps; don’t treat a cold cache as a bug
`Error: ENOSPC: no space left on device` / `no space left` during build	Disk full — a large image, build artifacts, or accumulated layers on a self-hosted runner	`df -h` as a debug step; check artifact/log sizes	On hosted runners, free space (`docker system prune`, remove unused toolchains) or use a larger runner; on self-hosted, add a cleanup step / bigger disk
Job killed with exit code 137 / “Killed” / OOMKilled	The build process exceeded the runner’s memory (a big compile, a memory-hungry test, a JVM with no heap cap)	Exit code 137 = SIGKILL (OOM); check memory use; look for the killed process	Reduce memory (cap JVM `-Xmx`, lower parallelism), or move to a larger runner; for containers raise the memory limit
`command not found` / wrong language version / “works on my machine”	Toolchain drift — the runner has a different Node/Python/JDK/Go version than your laptop	Print the version in CI (`node -v`, `python --version`); compare to local/`.tool-versions`	Pin the toolchain explicitly (`actions/setup-node@… with node-version`, a version file, or a container image that fixes every tool); don’t rely on the runner default
Intermittent network errors pulling deps (`ETIMEDOUT`, `EAI_AGAIN`, TLS errors)	Transient registry/network blip, proxy/firewall on self-hosted, or DNS	Re-run once; check whether it’s a self-hosted egress/proxy issue	Add bounded retries to the install step; fix proxy/DNS on self-hosted runners; mirror critical registries
`permission denied` running a script	The script isn’t executable, or line endings (CRLF) broke the shebang	`ls -la` the script; `file script.sh` (CRLF shows as “with CRLF line terminators”)	`chmod +x` (and commit the bit), enforce LF via `.gitattributes` (`*.sh text eol=lf`)

Two reflexes to burn in. First, make the build reproducible before you make it fast: a committed lockfile plus an install-from-lockfile command (npm ci, not npm install) plus a pinned toolchain (ideally a container image) eliminates the entire “works on my machine” class — most CI-only build failures are an environment difference, not a code bug. Second, treat the cache as a suspect, not a given: a cache key that doesn’t include the lockfile hash will happily restore stale, mismatched artifacts and produce baffling failures, so when a build fails in a way that “makes no sense”, bust the cache and re-run as your first cheap experiment — and then fix the key so it can’t poison again.

Pipeline & YAML errors: syntax, expressions & scope

These are pipeline-layer failures — the system rejects or mis-runs the workflow because of the YAML, an expression, or (most commonly) where a variable or secret is visible. They fail differently from build errors: often the job never starts, or a value is mysteriously empty.

Symptom	Likely cause	Diagnostic	Fix
“Invalid workflow file” / “mapping values are not allowed here” / YAML parse error	Indentation error, a tab character, or a missing/extra colon — the file isn’t valid YAML	Run `yamllint`; check for tabs (YAML forbids them); validate against the schema	Fix the indentation (spaces only); use an editor/linter that flags tabs and schema errors before push
A string value comes through as `true`/`false`/`null`/a number unexpectedly (e.g. a version `1.20` becomes `1.2`)	YAML type coercion — the Norway problem (`no`→false), octal/number coercion, unquoted versions	Reproduce in `yamllint`/a YAML parser; echo the value	Quote ambiguous scalars: `"true"`, `"no"`, `"1.20"`, `"08"`; never leave version strings or country codes unquoted
An expression renders literally (e.g. you see `${{ ... }}` in output) or evaluates wrong	Wrong expression context/syntax — using `${{ }}` where it isn’t evaluated, or mixing runtime vs compile-time evaluation	Print the raw vs evaluated value; check the platform’s expression docs for that position	Use the correct syntax for the position (GitHub `${{ }}`; GitLab `$VAR`/`rules`; Azure `$( )` runtime vs `${{ }}` compile-time); don’t expect expressions in places that take plain strings
A `secret`/variable is empty in the job (auth fails, value blank)	The value isn’t in scope: wrong environment, not passed to a reusable workflow, or a secret on a different scope (org vs repo vs environment)	Add `env \| sort` (secrets show masked but present); check which environment/scope the job uses	Define/grant the secret at the right scope; pass `secrets:`/`with:` explicitly into reusable/called workflows; select the correct `environment:`
Secrets are empty in a fork / pull_request from a fork	By design, secrets are not exposed to workflows triggered by a fork’s PR (a security boundary)	The job is triggered by `pull_request` from a fork; secrets are blank	Don’t rely on secrets in fork PRs; use `pull_request_target` carefully (it runs with secrets but checks out base — never run untrusted code with it), or split trusted post-merge steps
A masked secret appears ``* in logs but downstream parsing breaks	Secret masking redacted a substring that also appears in normal output	Look for `***` where a real value should be	Don’t print secrets; if a benign value collides with the mask, change it; pass secrets via files/env, not command-line echo
`set -e`/pipefail not active, so a failing command doesn’t fail the job	The shell didn’t exit on error (no `set -euo pipefail`), or an error is hidden inside a pipe	The step is green though a command clearly failed	Add `set -euo pipefail` at the top of multi-line `run` blocks; check exit codes explicitly for piped commands
`if:` condition runs the step when it shouldn’t (or skips it)	Expression evaluates a string as truthy, or wrong context (`success()`/`always()`/`failure()` misuse)	Print the condition’s operands; read the truthiness rules (non-empty string is true)	Use explicit comparisons (`if: github.ref == 'refs/heads/main'`); use the status functions deliberately (`if: always()` to run cleanup)
Matrix job explodes into too many combinations or fails only on one leg	The matrix product is larger than intended, or one OS/version combination is genuinely broken	Read the matrix expansion; identify the failing leg	Trim with `exclude:`/`include:`; mark known-bad legs `continue-on-error` while you fix them; don’t let one leg block the rest unintentionally

The single most valuable idea here is variable and secret scope. A pipeline value is not “global” — it lives at a level (organisation, repository, environment, job, step) and is visible only where that level reaches, and secrets are deliberately withheld from fork PRs as a security boundary. So when a value is “missing”, the question is never “did I set it?” but “is it in scope here?” — and the fastest answer is a one-line env | sort debug step (secrets print masked but present, which instantly distinguishes “not in scope” from “wrong value”). The second idea is YAML coerces types: an unquoted no, 1.20, or 08 is not the string you think — quote ambiguous scalars by default, and run yamllint before you push so parse errors never cost you a pipeline run.

Runner & agent problems: offline, capacity, permissions & Docker

Runner problems are insidious because they fail before your steps run — the job sits queued, or dies in setup — so engineers waste time debugging code that never executed. The four buckets: there is no runner (offline/unregistered), there is no capacity (all busy / concurrency capped), the runner lacks permission or the right labels, or the build needs Docker and the runner can’t provide it.

Symptom	Likely cause	Diagnostic	Fix
Job stuck “Queued”/“Waiting for a runner” indefinitely	No runner with the requested labels is online; self-hosted runner offline/unregistered; hosted-minutes exhausted	Check the runners list (online? labels?); check billing/minutes; read `runs-on`/`pool`/`tags`	Bring a runner online / re-register it; fix the label so it matches an available runner; top up minutes or switch to hosted
Jobs queue while runners show “idle”	Label/tag mismatch — runners are online but none matches the job’s required labels	Compare the job’s `runs-on`/`tags`/`demands` to the runners’ labels	Align the labels (add the required label to a runner, or fix the job’s requested label); for GitLab check `tags` vs runner tags
Only N jobs run at once; the rest wait	Concurrency/parallelism limit — runner count, org concurrency, or `concurrency`/parallel caps	Count online runners; check org/plan concurrency and any `concurrency:` group	Add runners (or autoscale them), raise the plan’s concurrency, or accept/serialise with an explicit `concurrency` group
Self-hosted runner goes offline repeatedly	The runner service crashed, host rebooted, token expired, or the host ran out of disk/memory	Check the runner service status and logs on the host; `df -h`, `free -m`	Run the runner as a resilient service (auto-restart), monitor host resources, rotate registration tokens; prefer ephemeral autoscaled runners
Setup step fails: `permission denied` / can’t write / can’t reach a resource	The runner’s OS user or identity lacks permission (filesystem, Docker, cloud, network)	`whoami`, `id`, `ls -la` the path; test the network/cloud call in a debug step	Grant the runner user the needed permission (group membership, file mode, IAM role, firewall egress); least privilege, not `root` everywhere
`Cannot connect to the Docker daemon at unix:///var/run/docker.sock`	The build needs Docker but the runner has no Docker (or the user isn’t in the `docker` group)	`docker version` in a debug step; check the daemon and group membership	Install/enable Docker on the runner and add the user to the `docker` group, or mount the host Docker socket, or use Docker-in-Docker (see below)
Docker-in-Docker build fails / can’t reach the daemon / TLS errors (GitLab `docker:dind`)	DinD misconfigured — the `dind` service, `DOCKER_HOST`/TLS, or privileged mode is wrong	Read the `services:`/`DOCKER_*` config; check the runner is `privileged` if DinD requires it	Configure DinD correctly (`docker:dind` service + `DOCKER_HOST`/`DOCKER_TLS_CERTDIR`), or avoid it: use the host socket or a daemonless builder (Kaniko, BuildKit/buildx, Buildah)
State/files from a previous job appear in a new one (or are unexpectedly missing)	A non-ephemeral self-hosted runner reuses its workspace; or you expected persistence across ephemeral runners	Inspect the workspace; check whether the runner is reused or fresh	Use ephemeral runners for isolation; clean the workspace between jobs; pass state via artifacts/cache, never the runner’s local disk
`The job running on runner X has exceeded the maximum execution time` (timeout)	The job genuinely hung (waiting on input, a deadlock, a slow network) or the timeout is too low	Read where it stalled; check for an interactive prompt or an external wait	Fix the hang (non-interactive flags, timeouts on external calls); set a sensible `timeout-minutes`; never leave a job able to run forever

Two reflexes here. First, “queued forever” and “idle runners” are almost always a label/capacity problem, not a code problem — before you read a single line of your build script, check that a runner is online and that its labels match what the job requested; a job that never started can’t have a code bug. Second, prefer ephemeral, daemonless container builds: long-lived self-hosted runners accumulate disk cruft, leak state between jobs, and turn Docker-in-Docker into a recurring support ticket — autoscaled ephemeral runners (e.g. GitHub’s Actions Runner Controller, or Azure DevOps scale-set agents) plus a daemonless builder (Kaniko/BuildKit/Buildah) removes most of this class of failure at the root.

Artifact & registry issues: auth, rate limits & versions

The package/publish boundary fails for three reasons: the runner can’t authenticate to the registry, it got rate-limited, or the version/tag it wants is absent, already present, or immutable. The 401 vs 403 distinction is the one to internalise.

Symptom	Likely cause	Diagnostic	Fix
`401 Unauthorized` pulling/pushing an image or package	Not authenticated — no/expired token, not logged in, wrong registry URL	Check the login step ran; inspect the token/credential; confirm the registry host	Authenticate before the pull/push (`docker login`, registry token, cloud `… get-login-password`); refresh expired tokens; verify the registry URL
`403 Forbidden` although the login succeeded	Authenticated but the identity lacks permission (push to a protected repo, read a private package, wrong scope)	Confirm who the token is and what scope it has; check the registry’s RBAC	Grant the principal the needed permission/scope (push/pull on that repo); least privilege. (This is authorisation, not authentication)
`toomanyrequests` / `429` / “pull rate limit exceeded” (often Docker Hub)	Pull-rate limit for anonymous/free pulls, or an aggressive matrix hammering the registry	The error is explicit; correlate with many parallel jobs pulling the same image	Authenticate pulls (raises the limit), cache base images, mirror via a pull-through cache, or pin a digest and pull once; reduce redundant pulls in a matrix
`manifest unknown` / `not found` / `version X does not exist`	The tag/version was never published, was deleted/yanked, or you’re pulling from the wrong repo/registry	List the available tags/versions; check the publish job actually ran and succeeded	Publish the missing version (ensure the publish stage ran on the right trigger); fix the tag/coordinate; pull from the correct registry
`tag already exists` / `cannot overwrite` / `409 Conflict` on push	The registry has immutable tags, or you’re republishing the same version	Check whether immutability is enabled; compare the version to what’s published	Bump the version (don’t reuse a released version); push a new tag/digest; if a retag is truly needed, use a mutable tag policy deliberately
`unauthorized: authentication required` only in a fork PR	Secrets/registry creds aren’t exposed to fork PRs (security boundary) — the pull/push has no credentials	The job is a fork `pull_request`; the login step has no secret	Don’t push from fork PRs; gate publish on trusted post-merge events; for public base images use anonymous + caching
`denied: requested access to the resource is denied`	Wrong repository path/namespace, or the identity has no rights to that namespace	Compare the image reference to the registry namespace; check RBAC	Use the correct `<registry>/<namespace>/<repo>:<tag>`; grant the identity access to that namespace
Checksum/signature verification fails on pull (`cosign`/Notation, or `--require-hashes`)	The artifact was tampered with, re-published, or signed by a key you don’t trust	Verify the digest/signature against the expected key/policy	Re-pull by digest; verify the signature with the right key; if it genuinely changed, investigate the supply chain before trusting it

The two reflexes. First, 401 means “I don’t know who you are”, 403 means “I know you, but you can’t do that” — 401 sends you to the login/credential step (did it run? is the token expired? right registry?), while 403 sends you to permissions/RBAC (does this identity have push/pull on this repo?); confusing them wastes the most time at this boundary. Second, stop pulling base images anonymously at scale — Docker Hub’s anonymous pull-rate limit will eventually 429 a busy pipeline, so authenticate pulls, cache base layers, and pin by digest so you pull a fixed image once; running your own Harbor or a cloud artifact registry with a pull-through cache removes the dependency on a shared public quota entirely.

Deployment failures: gates, OIDC, image pull & stuck rollouts

The deploy stage spans the boundary between your pipeline and the target environment, so it fails in the richest variety of ways: the pipeline waits on a gate that never clears, it can’t authenticate to the cloud (now usually OIDC), the cluster can’t pull the image, or the rollout never becomes healthy.

Symptom	Likely cause	Diagnostic	Fix
Pipeline stuck “waiting” at a deploy stage; nothing happens	A manual approval / environment protection gate is pending and no eligible reviewer has approved	Check the environment’s protection rules / approvals; is a reviewer assigned and notified?	Approve as an eligible reviewer (or add one); for unattended deploys use an automated gate; ensure approvers are notified
Deploy fails: `Error assuming role with OIDC` / `Not authorized to perform sts:AssumeRoleWithWebIdentity` / `AADSTS70021`	OIDC trust misconfigured — the role/credential’s trust policy doesn’t match the workflow’s subject (`sub`)/audience, or `id-token: write` is missing	Read the federated identity’s trust conditions vs the token claims; check `permissions: id-token: write`	Fix the trust policy’s `sub`/`aud` to match the repo/branch/environment; grant `id-token: write`; verify the audience. (See keyless OIDC deploys)
`401`/`403`/`could not find default credentials` deploying to a cloud	No credentials in the job, expired short-lived token, wrong region/subscription/project, or missing IAM permission	`aws sts get-caller-identity` / `az account show` / `gcloud auth list` as a debug step	Wire OIDC or inject creds; select the right region/subscription/project; grant the deploy identity the minimum permission
Kubernetes pod `ImagePullBackOff` / `ErrImagePull` after deploy	The cluster can’t pull the image — wrong tag, private registry with no imagePullSecret, or the registry is unreachable	`kubectl describe pod` (the Events show the exact pull error); check the image reference	Fix the image tag/path; add/repair the `imagePullSecret` (or workload-identity registry access); ensure the cluster can reach the registry
Pod `CrashLoopBackOff` right after deploy	The app starts and crashes — bad config/secret, missing env var, failed migration, wrong command	`kubectl logs <pod> --previous`; `kubectl describe pod`; check config/secrets	Fix the app config/secret/migration; roll back to the last good revision while you fix forward
Rollout stuck “progressing”; never goes healthy; eventually times out	Readiness/health probe failing, insufficient resources to schedule, or a quota/`PodDisruptionBudget` blocking	`kubectl rollout status`; `kubectl get pods`/`describe`; check probes, requests, quotas, events	Fix the probe/threshold; provide resources/quota; for a bad release, roll back (`kubectl rollout undo` / redeploy previous)
Canary/blue-green analysis gate aborts the rollout	The progressive-delivery controller (Argo Rollouts/Flagger) saw metrics breach the threshold, or the metric query is wrong	Read the rollout/analysis status; check the metric query and thresholds	If the app is genuinely bad, the abort is correct — fix and re-release; if the query/threshold is wrong, fix the analysis template, not the gate
Deploy “succeeds” but traffic still hits the old version	Traffic not actually shifted — service/ingress/slot swap didn’t happen, or DNS/LB cached	Check the router/ingress/slot config and what it points at; verify endpoints	Complete the traffic switch (slot swap, service selector, weight shift); account for DNS TTL/LB warm-up before declaring success
Database migration fails mid-deploy and the app won’t start	A migration that isn’t backward-compatible ran against a schema the old/new pods can’t both use	Read the migration error; check the order (migrate vs deploy) and compatibility	Use expand/contract (backward-compatible) migrations; separate schema change from code change; never run a destructive migration in the same step as the cutover
Deploy works in staging, fails in prod with “resource not found”/permission	Environment-specific config/identity differs — different account, secret name, namespace, or role	Diff the per-environment variables/secrets/identity; check the target	Make environments parameterised and consistent; ensure the prod identity has the same (scoped) permissions; promote the same artifact, only config differs

The single most valuable idea is that kubectl describe and --previous logs answer almost every Kubernetes deploy failure: ImagePullBackOff is a registry/auth problem the pod Events name exactly, while CrashLoopBackOff is an application problem the previous container’s logs explain — reading those two before changing anything tells you whether to fix the pipeline’s registry auth or the app’s config. The second idea is OIDC failures are trust-policy mismatches: the cloud rejected the federated token because its sub/aud claims don’t match what the role/credential trusts (wrong branch, environment, or repo, or a missing id-token: write permission) — so read the trust condition against the token’s claims rather than reaching for long-lived keys, which reintroduce the very secret-sprawl OIDC removes.

The flaky-pipeline playbook

A flaky pipeline passes and fails without any change to the code — the most corrosive failure mode, because it trains the team to ignore red and to re-run on faith, which lets real failures slip through. Treat flakiness as a first-class bug with its own workflow, not as background noise.

First, prove it’s flaky, not broken. Re-run the same commit with no changes. If it sometimes passes and sometimes fails, it’s flaky; if it always fails, it’s a real bug — go back to the playbooks above. Flakiness is, by definition, non-determinism.

Then find the source of non-determinism. Flaky failures nearly always trace to one of a small set of causes:

Flake source	Tell-tale sign	Fix
Test ordering / shared state	Fails only in a certain order or in parallel; passes when run alone	Isolate tests (fresh fixtures per test); remove shared mutable state; randomise order in CI to surface it, then fix
Timing / race conditions	“Element not found”, “connection refused” that appears under load	Replace `sleep` with explicit waits/polling for the actual condition; add retries only around genuinely async waits
External dependency (network, third-party API, registry)	Fails on `ETIMEDOUT`/`429`/5xx intermittently	Mock/stub external services in tests; add bounded retries for unavoidable network calls; mirror flaky registries
Resource contention (CPU/memory/disk on the runner)	Fails under parallelism or “at busy times”; exit 137 sometimes	Reduce parallelism or use larger/ephemeral runners; cap memory; isolate heavy jobs
Time/date/timezone & randomness	Fails near midnight, month-end, or 1-in-N	Inject a fixed clock/seed; never assert on real `now()`/random without control
Non-deterministic ordering (maps, sets, concurrency)	Assertions on order that isn’t guaranteed	Sort before asserting; assert on sets, not sequences, where order is irrelevant
Port/resource collisions	“address already in use” intermittently	Allocate ephemeral ports; ensure teardown; use unique names per job

Then quarantine, fix, and re-admit — in that order. The workflow that actually clears flakiness:

Detect & measure. Track per-test/per-job pass rate over many runs (many CI platforms and test frameworks flag flaky tests automatically). You cannot fix what you don’t measure, and “it feels flaky” isn’t a metric.
Quarantine the worst offenders. Move proven-flaky tests to a non-blocking lane (a separate job, continue-on-error, or a quarantine tag) so they stop blocking everyone — but keep running them so they’re still visible. Quarantine is a holding pen, not a delete.
Fix the root cause, not the symptom. Resist the cheap “wrap it in a retry” fix for anything that isn’t a genuine async wait — a blanket retry hides real regressions. Fix the determinism (isolate state, wait on conditions, mock externals, control the clock/seed).
Re-admit with evidence. Only move a test back to the blocking lane once it has passed N consecutive runs. Then add the guardrail: a flaky-test detector in CI, a policy that new tests must be deterministic, and a budget that caps allowed flake rate.

The meta-rule: a retry is a painkiller, not a cure. Bounded retries are legitimate for genuinely unavoidable non-determinism (a network call, an eventually-consistent cloud API), but reaching for retries: 3 to silence a flaky unit test just buries a real bug and erodes trust in the pipeline. Every retry you add should come with a written reason for why the underlying operation is legitimately non-deterministic.

Hands-on lab: break it, diagnose it, fix it

You’ll plant several classic CI faults in a free GitHub Actions workflow on a throwaway repository, then walk each through the loop. (The symptoms are identical on GitLab/Azure/Jenkins; only the YAML differs.)

1. Create a throwaway repo and a first workflow.

mkdir ci-ts-lab && cd ci-ts-lab && git init -b main
mkdir -p .github/workflows
cat > .github/workflows/ci.yml <<'EOF'
name: ci
on: [push, workflow_dispatch]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: show env
        run: env | sort | grep -E '^(GITHUB_REF|RUNNER_OS)='
      - name: build
        run: echo "build ok"
EOF
git add . && git commit -m "ci: first workflow"
# Create the repo and push (requires the gh CLI, authenticated):
gh repo create ci-ts-lab --private --source=. --push

Open the Actions tab and confirm the run is green. (No gh? Create an empty repo in the UI and git push to it.)

2. Fault A — a YAML parse error (the pipeline layer at parse time). Introduce a tab and a bad indent:

# Break indentation deliberately:
printf '      - name: oops\n         run: echo nope\n' >> .github/workflows/ci.yml
git commit -am "break: bad indentation" && git push

The run fails before any step with “Invalid workflow file”. Diagnose by validating locally — yamllint .github/workflows/ci.yml points at the exact line. Fix by correcting the indentation (spaces, consistent depth), commit, push, and confirm green. Lesson: a parse error never reaches your build — it’s the pipeline layer, caught by yamllint before you push.

3. Fault B — secret scope (the most common “missing value”). Add a step that needs a secret you haven’t set:

cat >> .github/workflows/ci.yml <<'EOF'
      - name: needs a secret
        run: |
          env | sort | grep -i token || true
          test -n "$API_TOKEN" && echo "have token" || { echo "MISSING token"; exit 1; }
        env:
          API_TOKEN: ${{ secrets.API_TOKEN }}
EOF
git commit -am "add step needing API_TOKEN" && git push

It fails with MISSING token. Diagnose: the env | sort step shows API_TOKEN is absent (not masked-but-present) — it isn’t in scope because you never defined it. Fix: gh secret set API_TOKEN --body "dummy-value", re-run, and watch the same step now show the secret masked-but-present and pass. Lesson: “missing” almost always means “not in scope” — and the env dump distinguishes “absent” from “wrong value” in seconds.

4. Fault C — a non-zero exit hidden by a pipe, then a timeout. Show why set -euo pipefail and timeouts matter:

cat >> .github/workflows/ci.yml <<'EOF'
      - name: hidden failure
        run: |
          set -euo pipefail        # remove this line to see the bug hide
          false | cat              # without pipefail, the job stays GREEN despite 'false'
          echo "this line should NOT run"
EOF
git commit -am "demonstrate pipefail" && git push

With set -euo pipefail the job correctly fails at false | cat; delete that line, push again, and watch the job go green despite a failed command — the classic silent failure. Fix: keep set -euo pipefail in every multi-line run. Lesson: an exit code you don’t enforce is an error you won’t see.

5. Fault D — flakiness, and why blind retries lie. Add a deliberately flaky step and observe re-runs:

cat >> .github/workflows/ci.yml <<'EOF'
      - name: flaky
        run: |
          if [ $(( RANDOM % 2 )) -eq 0 ]; then echo "pass"; else echo "fail"; exit 1; fi
EOF
git commit -am "add a flaky step" && git push

Re-run the same commit several times (Re-run all jobs): it passes and fails without any change — the definition of flaky. The tempting “fix” is a retry wrapper; the real fix is removing the non-determinism (here, the RANDOM). Lesson: re-run the same commit to prove flakiness, then fix the source of non-determinism — don’t paper over it with retries.

Validation. A clean walk-through shows: Fault A caught by yamllint before push; Fault B’s env dump distinguishing absent-vs-present and the secret fixing it; Fault C going green-when-it-shouldn’t once pipefail is removed; and Fault D flapping across identical re-runs until the RANDOM is removed.

Cleanup (so nothing is left behind):

gh repo delete ci-ts-lab --yes     # remove the throwaway repo
cd .. && rm -rf ci-ts-lab

Cost note: free. GitHub Actions on a private repo includes a monthly free-minutes allowance; this tiny workflow uses seconds per run. Delete the repo when done to leave no trace.

Common mistakes & troubleshooting

A meta-table — the mistakes engineers make while troubleshooting CI/CD, which keep them stuck or hide real problems:

Mistake	Why it bites	Do this instead
Hitting Re-run before reading the log	Wastes minutes; for a non-flaky failure it changes nothing	Read the first error and the exit code, name the stage/layer, then act
Reading the last error instead of the first	The last error is often a consequence of the first	Scroll to the top of the red output; fix the earliest cause
Assuming a “missing” value means “I forgot to set it”	It’s usually a scope problem (env/fork/reusable-workflow)	`env \| sort` debug step — masked-but-present vs absent tells you instantly
Treating the cache as always correct	A loose cache key restores stale/mismatched artifacts	Key the cache on the lockfile hash + OS/arch; bust it as a first cheap experiment
`npm install` (or unpinned deps) in CI	Non-reproducible builds; “works on my machine”	Commit the lockfile; install from it (`npm ci`); pin the toolchain/container
Debugging your build script when the job never started	The failure is a runner/label/capacity issue, not your code	Check a runner is online with matching labels before reading the script
Confusing `401` with `403` at a registry	Sends you to the wrong fix (creds vs permissions)	`401` → login/token; `403` → RBAC/scope
Pulling base images anonymously at scale	Docker Hub `429` rate limits flake the pipeline	Authenticate pulls, cache/pin by digest, use a pull-through cache
Wrapping a flaky unit test in `retries`	Hides real regressions; erodes trust in green	Fix the non-determinism; reserve bounded retries for genuine async waits
Leaving jobs with no timeout	A hung job ties up a runner indefinitely	Set `timeout-minutes`; use non-interactive flags and external-call timeouts
Running untrusted fork code with `pull_request_target`	It runs with secrets — a real exfiltration risk	Never check out/run untrusted code with secrets in scope; gate on trusted events

Best practices

Make pipelines reproducible by default. Commit lockfiles and install from them; pin actions/images by tag and digest and the toolchain by version (or a fixed container); cache on a lockfile-hash key. Reproducibility is what makes failures debuggable — and most CI-only failures are an environment difference.
Fail fast and loud. Put set -euo pipefail at the top of multi-line scripts, set a sensible timeout-minutes on every job, and make required checks actually required so a red build blocks merge rather than being ignored.
Scope secrets tightly and never expose them to forks. Define secrets at the narrowest scope that works (environment over repo over org), pass them explicitly into reusable workflows, and rely on the fork-PR secret boundary — debug “missing value” with an env dump, not by widening scope.
Prefer ephemeral, daemonless runners. Autoscaled ephemeral runners isolate jobs and shed disk cruft; daemonless builders (Kaniko/BuildKit/Buildah) remove the Docker-in-Docker support burden.
Keep deploys observable and reversible. Read kubectl describe/--previous before changing anything; use backward-compatible (expand/contract) migrations; make every deploy strategy have a tested rollback; treat an aborted canary as a correct signal, not an obstacle.
Hunt flakiness as a first-class bug. Measure per-test/per-job flake rate, quarantine proven flakes to a non-blocking lane (still running), fix the determinism, and re-admit only with evidence — a retry is a painkiller, not a cure.
Use OIDC, not long-lived keys. Federate the pipeline’s identity to the cloud so a leaked log or artifact isn’t a standing breach, and so deploy-auth failures are trust-policy fixes, not key rotations.

Security notes

Troubleshooting a pipeline repeatedly puts you near its most sensitive assets — credentials, the registry, and the production environment — so do it safely. Never widen a secret’s scope to make a job pass: a “missing” value is almost always a scoping problem, and the fix is to grant it at the right level, not to expose it more broadly; in particular, secrets are deliberately withheld from fork PRs, and pull_request_target runs with secrets — never use it to check out or run untrusted code, which is a real exfiltration path. Debug logs can leak secrets: ACTIONS_STEP_DEBUG/CI_DEBUG_TRACE/System.Debug print far more, and masking only catches known values — capture debug output to a controlled location, scrub before sharing, and disable it again after. When a registry or cloud call returns 403, resist granting broad permissions (push/admin everywhere, or a wildcard IAM policy) just to make it pass — confirm who the identity is and grant the minimum missing scope; over-broad CI identities are a classic escalation path. Prefer OIDC/short-lived credentials over long-lived keys so a leaked artifact isn’t a standing breach, and pin actions/images by digest so a compromised upstream tag can’t silently change what your pipeline runs. Finally, treat self-hosted runners as security boundaries: they often hold cloud credentials and run arbitrary PR code, so isolate them (ephemeral, least-privilege, no shared state), and never run untrusted forks on a runner with production access. These themes are developed in the DevSecOps pipeline lesson and the keyless OIDC deploys lesson.

Interview & exam questions

A pipeline that passed an hour ago is now red with no code change. Walk me through your response. First, don’t re-run blindly — re-run the same commit once to test whether it’s flaky vs genuinely broken. Read the first error and the exit code; name the stage (build/test/publish/deploy) and layer (code/pipeline/runner/registry/environment). Common no-change culprits: a moved dependency or floating tag, an expired cache, an expired token, a registry rate limit, or a flaky test. Fix the specific cause and add the guardrail (pin the version/digest, fix the cache key) so it can’t recur.
A secret is empty in a job even though you “set it”. How do you debug this? It’s a scope problem, not a missing value. Add an env | sort debug step: a secret in scope prints masked but present (***), while one out of scope is absent — that instantly distinguishes the two. Then grant it at the right scope (environment/repo/org), pass it explicitly into reusable/called workflows, and remember fork PRs don’t receive secrets by design.
What do exit codes 137, 143, 124, and 127 tell you? 137 = SIGKILL, almost always OOM (out of memory) — reduce memory or use a bigger runner. 143 = SIGTERM, usually a cancelled/terminated job. 124 = a timeout (e.g. timeout wrapper). 127 = command not found (toolchain/PATH problem). Reading the code first often names the failure class before you read a line of the message.
A job sits “Queued” forever while the runners show “idle”. What’s happening? A label/tag mismatch: runners are online but none matches the job’s requested labels (runs-on/tags/demands). The job hasn’t started, so it can’t be a code bug. Fix by aligning labels — add the required label to a runner or correct the job’s requested label — and check capacity/concurrency if runners are genuinely all busy.
Distinguish 401 from 403 at a container registry, and how you’d fix each. 401 Unauthorized = not authenticated — no/expired token, not logged in, or wrong registry; fix the login/credential step. 403 Forbidden = authenticated but not permitted — the identity lacks push/pull on that repo; fix the RBAC/scope (least privilege). Confusing the two sends you to the wrong fix.
Your deploy fails with Not authorized to perform sts:AssumeRoleWithWebIdentity (or AADSTS70021). Cause and fix? An OIDC trust mismatch: the cloud rejected the federated token because the role/credential’s trust conditions don’t match the token’s sub/aud claims (wrong branch/environment/repo), or the workflow lacks id-token: write. Fix the trust policy to match the actual subject/audience and grant id-token: write — don’t fall back to long-lived keys.
A Kubernetes pod is ImagePullBackOff after deploy; another is CrashLoopBackOff. How do you tell them apart and fix each? ImagePullBackOff/ErrImagePull is a registry/auth problem — kubectl describe pod shows the exact pull error (wrong tag, missing imagePullSecret, unreachable registry); fix the image reference or pull credentials. CrashLoopBackOff is an application problem — kubectl logs --previous shows why it started and died (bad config/secret/migration); fix the app config and roll back while you fix forward.
What makes a pipeline “flaky”, and what’s the right workflow to deal with it? Flaky = passes/fails on the same commit with no change — non-determinism (test order/shared state, timing/races, external deps, resource contention, time/randomness). The workflow: measure flake rate, quarantine proven flakes to a non-blocking lane (still running, so still visible), fix the determinism (isolate state, wait on conditions, mock externals, control the clock/seed), and re-admit only after N green runs. A blanket retry on a unit test hides real regressions.
A build passes locally but fails in CI with a missing native module. Most likely cause? A cache restored against a different lockfile/OS/arch, or a toolchain difference between laptop and runner. Diagnose by comparing the cache key to the lockfile and printing the tool versions in CI. Fix by keying the cache on the lockfile hash + OS/arch (bust it once to recover) and pinning the toolchain/container so the environment matches.
docker: Cannot connect to the Docker daemon on a self-hosted runner. What are your options? The runner has no Docker (or the user isn’t in the docker group). Options: install/enable Docker and add the user to the docker group; mount the host Docker socket; use Docker-in-Docker (privileged, with DOCKER_HOST/TLS configured); or — best — a daemonless builder (Kaniko/BuildKit/Buildah) that needs no daemon at all.
Why are secrets withheld from fork pull requests, and what’s the danger of pull_request_target? Because a fork PR contains untrusted code; exposing secrets to it would let anyone exfiltrate them by editing the workflow/build. pull_request_target runs in the base repo’s context with secrets but checks out the base — so running the fork’s code under it (e.g. building/testing the PR) would hand secrets to untrusted code. Never run untrusted code with secrets in scope.
A deploy “succeeds” but users still hit the old version. What likely went wrong? The traffic switch didn’t complete — the slot swap, service selector, or ingress/LB weight wasn’t actually changed, or DNS/LB caching is serving the old target. Verify the router/ingress/slot points at the new version and account for DNS TTL / LB warm-up before declaring success; “deployed” is not “receiving traffic”.

Quick check

A job has been “Queued” for 15 minutes while the runners page shows two idle runners. Is this most likely a code bug, and what’s the first thing to check?
A step’s env | sort shows your secret as *** (masked). Is the secret in scope or out of scope, and what does that rule out?
A docker pull of a public base image starts failing intermittently with 429 toomanyrequests only when many jobs run at once. What’s the cause and two fixes?
A test passes when run alone but fails when the suite runs in parallel. What category of flake is this, and what’s the fix (not a retry)?
A deploy fails with 403 Forbidden after a successful docker login. Is this authentication or authorisation, and where do you look?

Answers

Not a code bug — a queued job never started, so your script can’t be the cause. First check is a label/tag mismatch: the idle runners’ labels don’t match the job’s runs-on/tags/demands. Align the labels (or check capacity/concurrency if they’re actually busy).
In scope — a masked *** means the secret reached the job (present but redacted). That rules out “not set / wrong scope / fork PR”; if auth still fails, the problem is the secret’s value/permission, not its visibility.
Docker Hub’s anonymous pull-rate limit, hit by many parallel pulls. Fixes: authenticate the pulls (higher limit) and cache/pin by digest (or use a pull-through-cache registry) so you pull the image once instead of per-job.
Test ordering / shared mutable state (it depends on isolation that parallelism breaks). Fix the determinism: give each test fresh fixtures and remove shared state; randomise order in CI to surface such bugs — don’t wrap it in a retry, which hides the real defect.
Authorisation — you authenticated (login succeeded), but the identity lacks permission (push/pull) on that repository. Look at the registry’s RBAC/scope for that principal and grant the minimum needed; a 401 (not 403) would have meant an authentication/credential problem instead.

Exercise

Build your own CI break-and-fix runbook (timed, free). On a fresh GitHub repo with a small Actions workflow, plant one fault per layer and prove you can diagnose each from observation alone — before you fix it.

Scaffold a workflow (checkout + a trivial build/test), confirm it’s green.
Plant five faults, one per layer:
- Pipeline (parse): introduce a YAML indentation/tab error and capture how it fails before any step.
- Pipeline (scope): reference a secret you haven’t set; capture the env-dump showing it absent, then set it and show it masked-but-present.
- Build (reproducibility): use an unpinned dependency or a cache key without the lockfile hash; force a mismatch and capture the failure.
- Runner (labels): request a runs-on/label that no runner provides; capture the job stuck “queued”.
- Deploy/flake: add a RANDOM-based flaky step; re-run the same commit repeatedly and record the pass/fail flapping.
For each, write down the stage, the layer, the diagnostic signal (first error / exit code / env dump / runner status), the root cause, and the fix — before fixing. Time yourself: under 6 minutes per fault.
Fix each with the smallest, most deterministic change (correct indentation; grant the secret at the right scope; pin the dep + lockfile-hash cache key; fix the label; remove the non-determinism), and verify each yields a green run from a clean state.

Self-assess:

Criterion	Target
Identified the correct stage + layer before touching anything	All 5
Found the root cause from the log/exit-code/`env`/runner status (not guessing)	All 5
Fixed with the smallest, most deterministic change	All 5
Verified a clean green run afterwards (fresh cache where relevant)	All 5
Whole drill completed	Under 30 minutes

Cleanup: delete the throwaway repo.

Cost note: free — a tiny Actions workflow on a private repo uses seconds of the monthly free-minutes allowance; delete the repo when done.

Certification mapping

AWS DevOps Engineer (DOP-C02), Azure DevOps Engineer (AZ-400), Google Cloud DevOps Engineer — these exams test CI/CD operationally, which is exactly this lesson: diagnosing a failed pipeline, fixing OIDC/identity trust for keyless deploys, resolving artifact-registry auth (401 vs 403) and rate limits, triaging a stuck/failed deployment (ImagePullBackOff, rollout health, rollback), and securing self-hosted runners/agents. The scenario questions (“a deploy fails with an OIDC error”, “a pod won’t pull its image”, “a build is flaky”) map directly to the playbooks here.
DevOps Institute — DevOps Foundation / DevSecOps — the method (isolate the stage/layer, fix the root cause, prevent recurrence) and the secret-handling/least-privilege themes align with the foundation and DevSecOps syllabi.
CKA / CKAD — the deployment-failure playbook is core Kubernetes troubleshooting: kubectl describe/logs --previous, ImagePullBackOff vs CrashLoopBackOff, rollout status/undo, probes and resource/quota issues are directly examinable.
GitHub Actions / GitLab certifications — variable/secret scope, expressions and contexts, reusable workflows, runners/agents, and caching are tested; the pipeline-layer and runner playbooks here cover those objectives.

Glossary

Stage / job / step — the nesting of a pipeline: a stage groups jobs, a job runs on one runner and contains ordered steps. Naming where a failure sits is half the diagnosis.
Runner / agent — the machine (hosted or self-hosted) that executes a job. “Queued forever” usually means no runner with matching labels is available.
Exit code — the numeric result of a command; 0 = success. Key non-zero codes: 137 (OOM/SIGKILL), 143/130 (terminated/cancelled), 124 (timeout), 127 (command not found), 126 (not executable).
Cache (CI) — stored dependencies/artifacts reused between runs to speed builds; a key that omits the lockfile hash can restore stale artifacts and cause baffling failures.
Lockfile — the pinned dependency manifest (package-lock.json, poetry.lock, etc.); installing from it (npm ci) makes builds reproducible.
Secret scope — the level at which a secret is defined and visible (org / repo / environment / job). Secrets are deliberately withheld from fork PRs.
Type coercion (YAML) — YAML interpreting unquoted scalars as booleans/numbers/null (the Norway problem: no→false; 1.20→1.2); quote ambiguous values.
Docker-in-Docker (DinD) — running a Docker daemon inside a CI container to build images; needs privileged mode/TLS config. Daemonless builders (Kaniko/BuildKit/Buildah) avoid it.
Docker socket — /var/run/docker.sock; mounting the host’s socket lets a job use the host daemon instead of DinD (with security trade-offs).
OIDC (keyless) auth — federating a pipeline’s identity to a cloud via short-lived tokens instead of long-lived keys; failures are trust-policy (sub/aud) mismatches, not bad passwords.
401 vs 403 — 401 Unauthorized = not authenticated (login/token problem); 403 Forbidden = authenticated but not permitted (RBAC/scope problem).
Pull-rate limit — a registry cap on pulls (notably Docker Hub for anonymous/free); causes intermittent 429 toomanyrequests under parallel load.
Immutable tags — a registry policy forbidding overwriting a published tag; a re-push of the same version then fails with a conflict.
ImagePullBackOff / ErrImagePull — Kubernetes can’t pull the image (wrong tag, missing imagePullSecret, unreachable registry); the pod Events name the cause.
CrashLoopBackOff — the container starts then repeatedly crashes (bad config/secret/migration/command); kubectl logs --previous explains why.
Rollout (stuck “progressing”) — a deployment that never becomes healthy — failing readiness probes, insufficient resources, or a quota/PDB blocking; resolved by fixing the probe/resources or rolling back.
Flaky test/pipeline — passes/fails on the same commit with no change, due to non-determinism; quarantine and fix the root cause rather than masking with retries.
Quarantine (flaky) — moving a proven-flaky test to a non-blocking lane (still running, still visible) until its determinism is fixed and it’s re-admitted with evidence.
pull_request_target — a GitHub trigger that runs in the base repo’s context with secrets; dangerous if used to run untrusted fork code.

Next steps

You can now diagnose the everyday CI/CD failures across every stage and layer — red builds, pipeline/YAML and scope errors, runner/agent problems, artifact/registry auth, and deployment failures — and run a disciplined flaky-pipeline workflow. The next lesson steps up from “fix this failure” to “design a delivery platform that rarely produces these failures in the first place”:

The DevOps Architecting Ladder: From a Single Pipeline to an Internal Developer Platform — the maturity rungs from one CI workflow to a governed, self-service platform.
Deployment Strategies: Rolling, Blue/Green, Canary, Progressive Delivery & Rollback — the strategies (and rollbacks) whose failures this lesson triages.
Building a DevSecOps Pipeline: SAST, DAST, SCA & Policy Gates — adding security gates without making the pipeline brittle.
Keyless GitHub Actions Deployments with OIDC to AWS, Azure, and GCP — the deep dive on the OIDC trust that fixes most deploy-auth failures.
Terraform Troubleshooting: State, Providers, Drift, Dependencies & Debugging — the IaC counterpart to this CI/CD playbook.