DevOps Deployment

Deployment Strategies: Rolling, Blue/Green, Canary, Progressive Delivery & Rollback

A pipeline that builds, tests, scans and publishes an artefact has done only half the job. The other half — the half that wakes you at 03:00 — is getting that artefact in front of real users without breaking them. That is the deployment strategy: the precise sequence by which new code replaces old, how traffic moves between the two, how you observe whether the new version is healthy, and how fast you can retreat if it is not.

Choosing a strategy is an exercise in trading three things against each other: blast radius (how many users a bad release can hurt), cost (how much spare capacity and tooling the strategy demands), and rollback speed (how quickly you can undo). A solo side-project can recreate-and-pray; a payments platform shifting £40m a day cannot. This lesson walks every mainstream strategy with a tradeoff table, then covers the idea that quietly changed the industry — decoupling deployment from release using feature flags — followed by rollback mechanics and the database migration patterns (expand/contract) that make any of it safe. By the end you will be able to look at a service, its SLOs, and its data model, and name the right strategy with reasons an interviewer will respect.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites

You should be comfortable with the CI/CD pipeline end-to-end — stages, quality gates and artefact promotion — as covered in CI/CD Pipeline Design: Stages, Quality Gates, Artifacts & Security Scans. A working mental model of containers and Kubernetes Deployment objects helps for the progressive-delivery sections, and a passing familiarity with a load balancer or service mesh (how traffic is weighted between backends) will make the canary discussion concrete. No specific cloud is assumed; examples lean on Kubernetes because it makes the mechanics explicit, but every pattern maps onto VMs, App Service slots, Lambda aliases and serverless revisions. This lesson sits in the Deployment module of the DevOps Zero-to-Hero course, immediately after pipeline design and before troubleshooting.

Core concepts: deploy, release, and the four levers

Two words are used loosely in casual conversation and must be kept apart for the rest of this lesson:

The single most important idea in modern delivery is that deploy and release are separable. Once you internalise that, half the strategies below stop being mutually exclusive and become composable.

Every strategy is then a position on four levers:

Lever Question it answers Why it matters
Blast radius If this release is bad, how many users/requests are harmed before we react? Directly bounds incident severity and SLO error-budget burn.
Rollback speed How long from “we see the problem” to “users are safe again”? The dominant term in MTTR; a fast rollback turns an outage into a non-event.
Cost / capacity How much extra compute, tooling and complexity does it demand? Blue/green doubles capacity during cutover; progressive delivery needs a mesh + metrics.
Verification depth What evidence of health do we gather before going wider? “Pods are Ready” is weak; “p99 latency and error rate within SLO on real traffic” is strong.

Keep this table in your head; the strategy comparison later is just these four columns scored per strategy.

A few supporting terms used throughout:

Recreate (stop-the-world)

The simplest strategy: terminate every instance of the old version, then start the new one. There is a window — between the last old pod dying and the first new pod becoming Ready — during which the service is down.

In Kubernetes this is one line:

spec:
  strategy:
    type: Recreate

When it is correct: development environments; batch workers with no live traffic; applications that cannot run two versions concurrently (for example, a singleton holding an exclusive database lock, or an old and new schema that genuinely conflict). It is also the honest choice when your data layer forbids running both versions at once and you have a maintenance window.

The tradeoffs: zero extra capacity (you never run two versions), trivially simple, but guaranteed downtime and a slow rollback (you must recreate the old version the same way). Never use it for a user-facing service with an availability SLO.

Rolling update (the default)

Replace instances incrementally: bring up some new pods, wait for them to pass readiness, retire an equal number of old pods, repeat until the fleet is converted. This is the default for a Kubernetes Deployment and for most VM scale sets and App Service multi-instance plans.

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%        # up to 25% extra pods during the roll
      maxUnavailable: 0    # never drop below desired capacity

Setting maxUnavailable: 0 with a positive maxSurge gives a zero-downtime roll: you always add capacity before removing any. Combine with a readinessProbe that genuinely reflects the app’s ability to serve, a preStop hook plus terminationGracePeriodSeconds for graceful connection draining, and a PodDisruptionBudget so cluster operations cannot evict too many at once.

The catch — and it is the whole reason canary exists: a readiness probe proves the process is up, not that it returns correct responses, not that p99 latency is within budget, not that a downstream call still resolves. A rolling update will therefore cheerfully roll a 200-on-/healthz-but-500-on-everything-else build to 100% as fast as probes allow. There is no metric feedback loop and rollback means rolling backwards (re-deploying the previous image), which takes as long as the roll did.

When it is correct: the workhorse for stateless services where backward/forward compatibility between adjacent versions holds (it must — both run simultaneously mid-roll), and where a readiness probe is a sufficient proxy for health. For anything where “healthy” needs evidence beyond “up”, graduate to canary.

Blue/green

Run two complete environments: blue (current, live) and green (new, idle). Deploy to green, smoke-test it out of band, then flip all traffic from blue to green in a single switch — a load-balancer target change, a DNS weight, a Kubernetes Service selector edit, an App Service slot swap, or a Lambda alias repoint. Keep blue warm; if green misbehaves, flip back instantly.

# Kubernetes "poor man's" blue/green: one Service, two Deployments by label.
# Cut over by editing the selector the Service points at.
kubectl patch service shop-web -p \
  '{"spec":{"selector":{"app":"shop-web","version":"green"}}}'

# Rollback is the same command with version: blue — sub-second, no rebuild.
kubectl patch service shop-web -p \
  '{"spec":{"selector":{"app":"shop-web","version":"blue"}}}'

On Azure App Service this is the slot swap (az webapp deployment slot swap), which additionally does a warm-up against the staging slot before the swap so the first real request does not hit a cold worker.

The defining property is rollback speed: because the old version is still running and fully provisioned, rollback is the switch in reverse — effectively instant, no rebuild, no re-roll. That single trait is why blue/green is beloved for high-stakes cutovers.

The tradeoffs: you pay for double the capacity during the cutover window (two full fleets). Stateful concerns are sharp: in-flight sessions on blue, database migrations that both versions must tolerate (see expand/contract), and message-queue consumers that might double-process. And the switch is all-or-nothing — 100% of users meet the new version simultaneously, so a defect that slipped smoke-testing hits everyone until you flip back. Blue/green bounds rollback time, not blast radius. Pair it with a brief canary if you need both.

Canary

Release the new version to a small slice of real traffic (say 5%), observe, and only then widen — 5% → 25% → 50% → 100% — with a bake time at each step. The name comes from the canary in a coal mine: the small exposed group warns you before the whole mine is affected.

Manual canary in Kubernetes can be approximated with two Deployments and replica ratios (9 stable : 1 canary ≈ 10% traffic), but that conflates replica count with traffic share and is coarse. Real canary shapes traffic weight at the mesh or ingress (Istio VirtualService, NGINX ingress canary annotations, Gateway API HTTPRoute weights, or a cloud LB’s weighted target groups), independent of pod counts.

The key difference from blue/green: blast radius is bounded by the percentage exposed, not by how fast pods schedule. A bad canary harms 5% of users for the few minutes before you notice, versus 100% with blue/green’s flip. The cost is operational: you must run two versions and have a way to weight traffic, and someone or something must watch the metrics during bake. If a human watches, you have a manual canary; automate that watcher and you have progressive delivery.

Progressive delivery (metric-driven canary)

Progressive delivery is canary with the human taken out of the promotion loop. A controller shifts a traffic weight, queries a metrics backend against thresholds (error rate, p99 latency, custom business KPIs), and promotes or aborts automatically. The unit of progress is a traffic weight; promotion is conditional on evidence; rollback is automatic on regression. Two CNCF tools dominate.

Argo Rollouts replaces the Deployment controller with a Rollout custom resource carrying canary (or blue/green) steps and inline AnalysisTemplate evaluation:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: checkout-canary
      stableService: checkout-stable
      trafficRouting:
        nginx:
          stableIngress: checkout
      steps:
        - setWeight: 5
        - pause: { duration: 5m }          # bake at 5%
        - analysis:                         # gate on metrics
            templates:
              - templateName: success-rate
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
  template:
    spec:
      containers:
        - name: checkout
          image: registry.example.com/checkout:v2
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99   # ≥99% non-5xx
      failureLimit: 2                        # abort after 2 breaches
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{service="checkout",code!~"5.."}[2m]))
            / sum(rate(http_requests_total{service="checkout"}[2m]))

When success-rate breaches failureLimit times, the Rollout aborts to the last stable ReplicaSet automatically — no human, no rebuild.

Flagger (Flux’s progressive-delivery operator) takes a different stance: you keep an ordinary Deployment, and a Canary resource drives the rollout, managing the canary/primary Deployments and the mesh/ingress objects for you. It supports Istio, Linkerd, App Mesh, NGINX, Gateway API and more, with built-in webhooks for load-testing and acceptance checks:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5            # max failed checks before rollback
    maxWeight: 50
    stepWeight: 10          # 10% → 20% → ... → 50%, then promote
    metrics:
      - name: request-success-rate
        thresholdRange: { min: 99 }
        interval: 1m
      - name: request-duration
        thresholdRange: { max: 500 }   # p99 ms
        interval: 1m
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://checkout-canary.shop:8080/"

Argo Rollouts vs Flagger in one line: Argo Rollouts gives you a new workload object (Rollout) and pairs naturally with Argo CD; Flagger wraps your existing Deployment and pairs naturally with Flux. Both deliver automated, metric-gated canaries; pick by which GitOps ecosystem you already run.

Shadow (traffic mirroring)

Send a copy of live production traffic to the new version while the real responses still come from the old one. The new version processes mirrored requests; its responses are discarded (or compared offline). Users are never exposed to the shadow’s output, so the user-facing blast radius is zero even though the new code runs against real-world request patterns.

Istio expresses this with mirror on a VirtualService:

http:
  - route:
      - destination: { host: checkout, subset: stable }
        weight: 100
    mirror:
      host: checkout
      subset: canary
    mirrorPercentage:
      value: 100.0     # mirror all traffic; responses are dropped

When it shines: validating performance and correctness of a risky rewrite (a new search engine, a re-platformed pricing service) under genuine production load and edge-case inputs, before any user sees it. The big caveat: side effects. A mirrored request that writes to a database, charges a card, sends an email or enqueues a job will do real damage unless the shadow is pointed at isolated/sandboxed dependencies or made strictly read-only. Shadowing is a correctness/performance test in production, not a release mechanism — at the end you still need canary or blue/green to actually release.

A/B testing

Superficially similar to canary — two versions serve simultaneously — but the intent and routing key differ. Canary asks an operational question (“is v2 healthy?”) and routes by percentage. A/B testing asks a product question (“does variant B convert better than A?”) and routes by user attribute (cohort, geography, plan tier, a sticky hash of user-id) so a given user consistently sees one variant, with results measured over days against a business metric and judged for statistical significance.

A/B is usually implemented with feature flags (next section) rather than infrastructure traffic-splitting, precisely because you need attribute-based, sticky, per-user targeting — not a blunt traffic percentage. You can absolutely run an A/B experiment on top of a canary-deployed binary: deploy once, then use flags to assign cohorts. That composition is the bridge to the most important idea in this lesson.

Strategy comparison (the tradeoff table)

This is the table to know cold for an interview. Scores are relative.

Strategy Blast radius Rollback speed Extra capacity / cost Verification depth Downtime Best for
Recreate 100% (all at once) Slow (recreate old) None None Yes Dev, batch, can’t run 2 versions
Rolling Grows during roll Slow (roll back) Low (maxSurge) Weak (readiness only) No (with surge) Default stateless service
Blue/green 100% at the flip Instant (flip back) High (2× fleet) Pre-cutover smoke test No High-stakes cutovers, fast rollback
Canary (manual) Small (e.g. 5%) Fast (shift weight back) Medium (2 versions + LB) Human watches metrics No Risky change, you have eyes on it
Progressive (auto) Small + bounded Fast + automatic Medium (mesh + metrics) Strong (automated ACA) No Mature teams, frequent releases
Shadow Zero (responses dropped) N/A (no user exposure) High (full mirror fleet) Strong (real traffic, no risk) No Validating a rewrite under load
A/B Per-cohort Toggle off variant Low (flag-driven) Business metric over days No Product experiments, conversion

Read it as four levers per row. Notice the standouts: blue/green wins rollback speed but loses blast radius; canary/progressive win blast radius; shadow wins verification with zero user risk but cannot release; A/B is a product tool, not an operational one. The mature default for a user-facing service is progressive delivery, often preceded by a shadow for a big rewrite and combined with feature flags for fine-grained control.

Deployment strategies compared

The diagram lays the strategies side by side — how traffic moves from old (blue) to new (green) version under each — so you can see at a glance why blast radius and rollback speed differ between them.

Feature flags: decoupling deploy from release

Every strategy above moves traffic at the infrastructure layer. Feature flags move the decision into the application layer, and in doing so deliver the separation we flagged in core concepts: deploy the binary continuously; release the behaviour on your own schedule.

A feature flag is a conditional that gates a code path at runtime:

if flags.is_enabled("new-checkout-flow", context={"user_id": user.id, "plan": user.plan}):
    return render_new_checkout()
return render_legacy_checkout()

Because the new code path is present but dark, the same binary can serve the old behaviour to everyone, then — with no deploy — be turned on for internal staff, then 1% of users, then a country, then everyone, and turned off instantly if something breaks. This unlocks a family of patterns:

Pattern What it is Why flags enable it
Dark launch Ship code to production switched off; enable later. Deploy decoupled from release; merge to trunk safely.
Ring / percentage rollout Enable for internal → beta → 1% → 100%. Per-attribute and percentage targeting in the flag service.
Kill switch Instantly disable a misbehaving feature. Toggle off in seconds — faster than any redeploy.
Operational toggle Shed load by disabling an expensive feature under stress. Runtime control without a release.
Entitlement Gate a feature to a plan/tenant. Targeting rules by user attribute.
A/B experiment Assign sticky cohorts to variants. Deterministic per-user bucketing + analytics.

Flags are also the enabler for trunk-based development: a half-finished feature can merge to main behind an off flag, keeping main always releasable without long-lived branches (see Migrating to Trunk-Based Development).

The tooling landscape

Tool Model Notable strengths Watch-outs
LaunchDarkly Commercial SaaS Mature targeting, experimentation, streaming SDK updates, governance Cost at scale; vendor lock-in unless abstracted
Unleash Open-source (+ hosted) Self-hostable, activation strategies, gradual rollout, OSS control You operate it (in OSS mode)
Flagsmith Open-source (+ hosted) Self-host or SaaS, remote config, segments Smaller ecosystem than LaunchDarkly
OpenFeature CNCF standard / SDK spec Vendor-neutral API + provider model; swap backends with zero app code change A spec, not a backend — needs a provider (flagd, LaunchDarkly, Unleash, …)

OpenFeature deserves emphasis: it is not a flag service but a vendor-neutral SDK specification with a provider abstraction, so you code against one API and back it with flagd, LaunchDarkly, Unleash or others — and switch with no application change. That is the same anti-lock-in posture a senior architect applies everywhere; treat the flag vendor as replaceable infrastructure. See Building a Vendor-Neutral Feature Flag Platform with OpenFeature for a deeper build.

The discipline flags demand

Flags are debt with interest. Each one multiplies the code paths you must reason about and test (n boolean flags imply up to 2ⁿ combinations). The non-negotiable hygiene: name and own every flag, distinguish short-lived release flags (delete the moment the feature is at 100% and stable) from long-lived operational flags (kill switches, entitlements that live forever), set expiry/review dates, and run a periodic stale-flag audit so dead conditionals are removed. A flag that has been at 100% for six months is not a flag — it is a comment that lies.

Rollback: the mechanic that decides your MTTR

The strategy you pick is your rollback story; they are two views of the same machine. There are three rollback mechanics, and which you get is dictated by how you released:

Mechanic How it works Speed Comes from
Instant traffic switch Repoint the LB/Service/DNS/alias/slot back to the old version still running. Seconds Blue/green, slot swap
Traffic shift back Reduce the canary weight to 0; stable version already serving. Seconds–minutes Canary / progressive (auto-abort)
Redeploy previous Re-roll or re-image the prior version; rebuild capacity. Minutes–tens of minutes Rolling, recreate
Flag off Disable the feature at runtime; binary unchanged. Seconds Feature flags

Two principles follow. First, the strategies with the smallest blast radius or a warm old version give the fastest rollback — which is exactly why progressive delivery and blue/green dominate for critical services, and why feature flags (a flag-off is faster than any redeploy) are the ultimate safety valve. Second, forward-fix versus rollback is a judgement call: rollback is the default reflex (restore service first, diagnose later), but if the previous version has its own known-critical bug, or — crucially — if a database migration cannot be reversed, you may be forced to roll forward with a fix. That constraint is almost always the database, which is why the next section is the one that actually makes rollback safe.

Database changes: expand/contract (and why rollback breaks without it)

Here is the trap that voids every clean rollback story above: you can roll code back in seconds, but you cannot roll a dropped column back at all. During any zero-downtime strategy, two versions of the application run against one database simultaneously — mid-roll, during a canary, on both sides of a blue/green flip. The schema must satisfy both versions at once. A migration that the new code needs but the old code cannot tolerate makes the old version un-runnable, which means you can no longer roll back.

The solution is the expand/contract pattern (also called parallel change): never make a breaking schema change in one step. Split every change into backward-compatible phases, each deployable and rollback-safe on its own.

Worked example — renaming full_name to display_name without downtime:

Phase Database action Application action Why it is safe
1. Expand Add new nullable column display_name; do not touch full_name. None yet (or dual-write begins). Old code ignores the new column; nothing breaks.
2. Migrate / dual-write Backfill display_name from full_name; new code writes both columns, reads display_name (falling back to full_name). Deploy reader/writer code. Both columns are populated and valid; either app version works.
3. Verify Confirm 100% backfilled and dual-write steady. Bake. Confidence before the irreversible step.
4. Contract Drop full_name (and the dual-write). Deploy code that uses only display_name. Only run after no version reads/writes the old column.

Each phase is independently deployable and, individually, rollback-safe; you never have a moment where one running version needs a column another running version forbids. The rules generalise:

Expand/contract is what converts “we have a fast rollback” from aspiration into fact. Skip it and your elegant blue/green flip becomes a one-way door the first time a release ships a schema change.

Choosing a strategy

Walk this decision sequence:

  1. Can two versions run at once (code and schema)? If norecreate in a maintenance window (and plan to fix the coupling — usually via expand/contract — so you can do better next time).
  2. Is downtime acceptable? Dev/batch with no live traffic → recreate is fine. Otherwise continue.
  3. How critical is the service / how strict the SLO? Low-risk internal tool with a sound readiness probe → rolling is enough. High-stakes, user-facing, tight error budget → keep going.
  4. Do you need instant rollback above all, and can you afford 2× capacity?blue/green (or slot swap).
  5. Do you need to bound blast radius and have a metrics backend + traffic-shaping?canary, and if your team is mature enough to trust automated analysis → progressive delivery (Argo Rollouts / Flagger). This is the modern default for critical services.
  6. Is the change a risky rewrite you want to validate under real load with zero user risk first?shadow it, then release via canary/blue/green.
  7. Is the question “which variant performs better” (product, not ops)?A/B via feature flags.
  8. Across all of the above: put new behaviour behind a feature flag so deploy is decoupled from release and you always have a sub-second kill switch — and gate every schema change with expand/contract.

The strategies are not mutually exclusive. A strong production posture is frequently: progressive delivery of the binary, behaviour behind feature flags, schema evolved by expand/contract, with shadowing reserved for big rewrites — giving you bounded blast radius, automatic metric-gated rollback, runtime kill switches, and migrations that never trap you.

Hands-on lab

We will demonstrate canary, automatic abort and a feature-flag kill switch entirely free and locally using kind (Kubernetes in Docker) and Argo Rollouts. Allow ~30 minutes.

Prerequisites: Docker, kind, kubectl, and the Argo Rollouts kubectl plugin.

# 1) Create a throwaway cluster
kind create cluster --name deploy-lab

# 2) Install Argo Rollouts (pin a real version in production; latest is fine for a lab)
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
kubectl rollout status deploy/argo-rollouts -n argo-rollouts

# 3) Install the kubectl plugin (macOS shown; Linux: download the release binary)
brew install argoproj/tap/kubectl-argo-rollouts
kubectl argo rollouts version

Create a Rollout with explicit canary steps:

cat <<'EOF' | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: demo
spec:
  replicas: 5
  selector:
    matchLabels: { app: demo }
  template:
    metadata:
      labels: { app: demo }
    spec:
      containers:
        - name: demo
          image: argoproj/rollouts-demo:blue
          ports: [{ containerPort: 8080 }]
          resources:
            requests: { cpu: 5m, memory: 32Mi }
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: { duration: 30s }   # bake at 20%
        - setWeight: 60
        - pause: { duration: 30s }
        - setWeight: 100
EOF

Trigger a canary by changing the image, then watch the weighted promotion:

kubectl argo rollouts set image demo demo=argoproj/rollouts-demo:yellow
kubectl argo rollouts get rollout demo --watch

Expected output: the dashboard shows demo stepping 20% → 60% → 100% with a 30s pause at each step, the canary ReplicaSet scaling up as the stable one scales down. This is a live canary.

Now exercise the rollback path — abort mid-canary:

kubectl argo rollouts set image demo demo=argoproj/rollouts-demo:red
# while it is paused at 20%, simulate "bad metrics" by aborting:
kubectl argo rollouts abort demo

Expected output: the Rollout immediately returns all traffic to the previous stable version — no rebuild, no re-roll — demonstrating canary’s traffic-shift-back rollback. kubectl argo rollouts get rollout demo shows status Degraded/aborted with stable serving 100%. Promote for real with kubectl argo rollouts promote demo.

Validation checklist:

# Two ReplicaSets exist during a canary (stable + canary):
kubectl get rs -l app=demo
# After abort, the stable ReplicaSet carries all 5 replicas:
kubectl argo rollouts get rollout demo

Feature-flag mini-demo (no extra infra): the same rollouts-demo image colour is effectively a runtime toggle — flipping the image colour mirrors what a flag does at the application layer, except a real flag flips with no deploy at all. To see a true flag service, follow the OpenFeature/flagd lab linked in Next steps.

Cleanup:

kind delete cluster --name deploy-lab

Cost note: £0kind runs in local Docker and everything is torn down with the cluster. No cloud resources are created at any point.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Rolling update ships a broken build to 100% fast Readiness probe only checks “process up”, not real health; no metric gate Make the probe meaningful and move to canary/progressive with automated analysis
Blue/green flip causes errors despite green passing smoke tests A schema change green needs is incompatible with blue, or in-flight sessions broke Apply expand/contract; drain connections (preStop + grace period); use sticky sessions only where required
Cannot roll back after a release A destructive migration ran in lockstep with the deploy; old code now un-runnable Never drop/rename in the same step as code; expand/contract; forward-only with corrective migrations
Canary “looks fine” then breaks at 100% Bake time too short, or analysis query not representative of real failure modes Lengthen bake; query the metrics that actually correlate with user pain (error rate, p99, business KPI)
Mirrored (shadow) traffic charged cards / sent emails Shadow pointed at production side-effecting dependencies Point shadow at sandboxed/read-only deps; never let mirrored writes hit real systems
Manual canary “10%” doesn’t match observed traffic Replica-ratio canary conflates pod count with traffic share Shape traffic weight at mesh/ingress, independent of replica counts
Feature flags accumulate; behaviour becomes unpredictable No flag lifecycle; stale flags never removed Owner + expiry per flag; separate release vs operational flags; periodic stale-flag audit
Progressive rollout never promotes / always aborts Threshold too strict, metrics backend unreachable, or no traffic to measure Verify Prometheus/Datadog connectivity; loosen thresholds to realistic SLOs; ensure load (loadtester webhook) during bake

Best practices

Security notes

Interview & exam questions

  1. What is the difference between deploy and release, and why does it matter? Deploy installs the binary on infrastructure; release exposes its behaviour to users. Separating them (via feature flags) lets you ship continuously, control exposure independently, dark-launch, ring-roll, run A/B tests, and kill-switch instantly without redeploying — decoupling delivery risk from deployment.

  2. Contrast blue/green and canary. When would you pick each? Blue/green flips 100% of traffic at once between two full environments — its strength is instant rollback (flip back to the still-running old version), its cost is 2× capacity and a full blast radius at cutover. Canary exposes a small slice first and widens gradually — its strength is bounded blast radius, its cost is needing traffic-shaping and metric observation. Pick blue/green when instant rollback dominates and you can afford double capacity; pick canary when limiting how many users a bad release touches matters most. Many teams combine: a brief canary, then widen.

  3. What does a rolling update not protect you from, and what fixes it? It promotes on readiness (“process up”), which says nothing about error rate, latency, or correctness — so it can roll a 200-on-healthz/500-on-everything build to 100%. Fix: progressive delivery with automated canary analysis gating promotion on real SLO metrics, with automatic abort.

  4. Explain the expand/contract pattern and the problem it solves. Because zero-downtime strategies run two app versions against one database, a breaking schema change can make a running version un-runnable and block rollback. Expand/contract splits a change into backward-compatible phases: expand (add nullable column/dual-write), migrate/verify, then contract (drop the old shape) only after no version uses it. Each phase is independently rollback-safe.

  5. What is progressive delivery, and how do Argo Rollouts and Flagger differ? Metric-driven canary: a controller shifts traffic weight, queries a metrics backend against thresholds, and promotes or aborts automatically. Argo Rollouts introduces a Rollout CRD (a Deployment replacement) and pairs with Argo CD; Flagger wraps an existing Deployment via a Canary CRD and pairs with Flux. Both automate metric-gated canaries; choose by GitOps ecosystem.

  6. How does shadow/traffic mirroring work, and what is its single biggest risk? A copy of live traffic hits the new version while real responses still come from the old; the shadow’s responses are discarded, so user-facing blast radius is zero. Biggest risk: side effects — mirrored requests that write/charge/send do real damage unless the shadow uses isolated/read-only dependencies.

  7. How is A/B testing different from canary even though both run two versions? Intent and routing. Canary is operational (“is v2 healthy?”) routed by percentage, judged in minutes on system metrics. A/B is a product experiment (“does B convert better?”) routed by user attribute (sticky per user), judged over days on a business metric for statistical significance — usually implemented with feature flags, not infra splitting.

  8. Name three rollback mechanics and which strategy each comes from. Instant traffic switch (blue/green / slot swap) — seconds, old version still warm. Traffic shift back (canary/progressive) — reduce canary weight to 0. Redeploy previous (rolling/recreate) — slowest, must rebuild capacity. Plus flag-off (feature flags) — faster than any redeploy.

  9. When is recreate the correct choice rather than a smell? When two versions genuinely cannot coexist — a singleton holding an exclusive lock, or schemas that truly conflict — and you have a maintenance window; or for dev/batch workloads with no live traffic. Elsewhere it means accepting downtime and a slow rollback.

  10. Why are feature flags described as “debt”, and how do you manage it? Each flag multiplies code paths and test combinations (2ⁿ for n flags). Manage with: an owner and expiry per flag, separating short-lived release flags (delete at 100%/stable) from long-lived operational flags (kill switches, entitlements), and a periodic stale-flag audit.

  11. What is OpenFeature and why might an architect mandate it? A CNCF vendor-neutral SDK specification with a provider abstraction. You code against one API and back it with flagd, LaunchDarkly, Unleash, etc., swapping backends with no application change — avoiding vendor lock-in on the flag layer.

  12. A release breaks at 100% but the previous version has a separate known-critical bug. Roll back or forward? You likely cannot safely roll back (old version is also broken), so roll forward with a fix — provided no irreversible migration is involved. This is exactly why expand/contract and a fast, well-rehearsed forward-fix path matter; the database constraint, not the code, usually forces the decision.

Quick check

  1. Which strategy gives the fastest rollback, and why?
  2. Which strategy gives the smallest blast radius for a bad release?
  3. In a zero-downtime deploy, how many application versions run against the database at once, and what pattern keeps that safe?
  4. What is the difference in routing key between canary and A/B testing?
  5. What is the one thing shadow traffic must never do?

Answers

  1. Blue/green (and slot swap) — the old version is still running and fully provisioned, so rollback is the cutover switch in reverse, effectively instant with no rebuild. (Feature-flag off is comparably fast for flagged behaviour.)
  2. Canary / progressive delivery — only a small percentage of users meets the new version before you observe and decide, bounding harm to that slice.
  3. Two (old and new run simultaneously mid-roll / across the flip / during canary). Expand/contract (parallel change) keeps the schema backward-compatible so both versions work and rollback stays possible.
  4. Canary routes by traffic percentage (operational: is it healthy?); A/B routes by user attribute with sticky per-user assignment (product: which converts better?).
  5. Cause real side effects — its responses must be discarded and its dependencies isolated/read-only; mirrored writes, charges or emails must never reach production systems.

Exercise

Take a service you know (or invent a realistic one: a checkout API with a Postgres database, a 99.9% availability SLO, ~500 req/s).

  1. Pick a strategy using the decision sequence, and write two sentences justifying it in terms of blast radius, rollback speed, cost and verification depth.
  2. Design the rollback path: which of the three mechanics applies, and what is the exact command/action and expected time-to-safe?
  3. Plan a breaking schema change (e.g. splitting address into structured fields) as an expand/contract sequence — list each phase’s DB action, app action, and why it is rollback-safe.
  4. Add one feature flag: name it, classify it (release vs operational), set an expiry/owner, and state what its kill-switch protects.
  5. Bonus: specify the automated canary analysis — which metric, threshold, bake time, and abort condition — and explain why that metric (not readiness) reflects user pain.

Write it up as a one-page deployment runbook; that artefact is exactly what a hiring panel or a change-advisory board wants to see.

Certification mapping

Glossary

Next steps

deployment-strategiescanaryblue-greenfeature-flagsprogressive-deliveryrollback
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading