DevOps Kubernetes

Blue-Green on Kubernetes with Argo Rollouts: Preview Services, Analysis Gates, and Automated Promotion

Canary shifts a percentage of live traffic onto the new version and measures it under partial exposure. Blue-green does the opposite: it stands up the entire new version, validates it out of band while zero production traffic touches it, then flips 100% of traffic in a single atomic selector change. When the new code path is right, that swap is instant; when it is wrong, the rollback is equally instant because the old ReplicaSet is still running. Argo Rollouts makes this a first-class strategy with a preview service, pre-promotion analysis, and a tunable window during which the old stack stays warm. This article builds the full flow and the guardrails around it.

1. Blue-green vs. canary: when a full-environment swap is the right call

Both are progressive delivery; they fail in different shapes. Canary bounds blast radius by traffic percentage over time. Blue-green bounds it by time-to-validate before any user is exposed at all, then accepts a binary cutover.

Dimension Blue-green Canary
Production exposure during validation Zero (preview service only) Live, weighted (e.g. 5% to 50%)
Cutover Atomic, 100% at once Gradual over steps
Rollback Re-point active selector, instant Set weight back to 0, near-instant
Cost during release 2x replicas (both stacks full) ~1x + canary delta
Best for Schema-coupled releases, batch/stateful workloads, “validate then flip” change windows Stateless HTTP services where you want real-traffic signal

Reach for blue-green when partial exposure is meaningless or dangerous: a release coupled to a forward-compatible database migration you want fully smoke-tested before any user hits it, a queue consumer where “5% of traffic” is not a coherent concept, or a regulated change window where you must prove the green stack healthy before flipping. Reach for canary when real user traffic is the only honest signal and you can tolerate a small cohort seeing the new version.

Blue-green’s defining cost is that you run two full copies of the workload simultaneously. Budget the headroom, and tune scaleDownDelaySeconds (covered below) so you do not pay 2x replicas any longer than your rollback window requires.

2. Install the controller (brief)

If you have not already deployed the controller, install it and the kubectl plugin. Pin a real release tag in production rather than latest, and manage the manifest through GitOps.

kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts \
  -f https://github.com/argoproj/argo-rollouts/releases/download/v1.7.2/install.yaml
kubectl rollout status deploy/argo-rollouts -n argo-rollouts

# kubectl plugin (macOS)
brew install argoproj/tap/kubectl-argo-rollouts
kubectl argo rollouts version

3. Define the Rollout with activeService, previewService, and the blue-green strategy

Blue-green needs two Service objects pointing at the same pod label set. The controller manages their selectors by injecting a generated rollouts-pod-template-hash so that active always routes to the live ReplicaSet and preview always routes to the new one being validated.

# services.yaml -- two Services, identical app selector, no hash (the controller adds it)
apiVersion: v1
kind: Service
metadata:
  name: checkout-active
  namespace: shop
spec:
  selector:
    app: checkout
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: checkout-preview
  namespace: shop
spec:
  selector:
    app: checkout
  ports:
    - port: 80
      targetPort: 8080

The Rollout itself. Note kind: Rollout, apiVersion: argoproj.io/v1alpha1, and the strategy.blueGreen block that wires the two services together:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout
  namespace: shop
spec:
  replicas: 6
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: registry.example.com/checkout:1.42.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
  strategy:
    blueGreen:
      activeService: checkout-active
      previewService: checkout-preview
      autoPromotionEnabled: false      # require an explicit promote (see step 5)
      scaleDownDelaySeconds: 600        # keep old RS warm 10 min after cutover
      prePromotionAnalysis:
        templates:
          - templateName: smoke-and-slo
        args:
          - name: preview-service
            value: checkout-preview
      postPromotionAnalysis:
        templates:
          - templateName: post-cutover-error-rate
        args:
          - name: active-service
            value: checkout-active

When you apply a new image, the controller creates a fresh ReplicaSet, points checkout-preview at it, and – because autoPromotionEnabled: falsepauses with the green stack fully scaled but receiving no production traffic. The active selector still points at the old (blue) ReplicaSet. Nothing has been promoted yet.

4. Validate the green stack with pre-promotion analysis and smoke jobs

prePromotionAnalysis runs an AnalysisRun against the preview service before the controller will cut over. If it fails, the Rollout is marked Degraded and the cutover never happens. This is where you put smoke tests and out-of-band SLO checks.

A robust template combines two measurement styles: a Job-based smoke test (run a pod, assert exit code 0) and a metric query (assert the preview is actually serving). Argo Rollouts treats an AnalysisRun as successful only when every metric meets its successCondition within the allowed failureLimit.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-and-slo
  namespace: shop
spec:
  args:
    - name: preview-service
  metrics:
    # 1) Job-based smoke test: hit the preview service end to end.
    - name: smoke
      provider:
        job:
          spec:
            backoffLimit: 1
            template:
              spec:
                restartPolicy: Never
                containers:
                  - name: smoke
                    image: registry.example.com/checkout-smoke:1.0.0
                    command: ["/bin/sh", "-c"]
                    args:
                      - |
                        set -e
                        curl -fsS http://{{args.preview-service}}/healthz
                        curl -fsS http://{{args.preview-service}}/api/cart/selftest
    # 2) Metric: preview success rate must be healthy under synthetic load.
    - name: preview-success-rate
      initialDelay: 30s
      interval: 30s
      count: 5
      successCondition: result[0] >= 0.99
      failureLimit: 1
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc.cluster.local:9090
          query: |
            sum(rate(http_requests_total{service="checkout-preview",code!~"5.."}[2m]))
            /
            sum(rate(http_requests_total{service="checkout-preview"}[2m]))

For the job provider, the measurement is the pod’s exit status: a zero exit is Successful, non-zero is Failed. That makes it the natural place for contract tests, migration dry-runs, or a Postman/k6 suite packaged as an image. The metric provider, meanwhile, requires that your synthetic traffic actually exercise the preview service so the query returns a meaningful ratio – a quiet preview returns NaN and fails the successCondition, which is the safe default.

5. Manual gates, autoPromotionEnabled, and scaleDownDelay tuning

These three knobs decide who promotes and how long you can roll back.

autoPromotionEnabled. With false, the Rollout pauses after pre-promotion analysis passes and waits for an explicit promote. With true (the default), it cuts over automatically the moment analysis succeeds. For a regulated change window you want false; for a fully metric-gated pipeline you may trust true. You can also set autoPromotionSeconds to auto-promote after a fixed soak even without manual action.

Promote manually with the plugin once you have eyeballed the green stack:

# Watch the rollout pause at the pre-promotion gate
kubectl argo rollouts get rollout checkout -n shop --watch

# Promote: flip active -> green
kubectl argo rollouts promote checkout -n shop

# Or abort and tear down the green stack instead
kubectl argo rollouts abort checkout -n shop

scaleDownDelaySeconds. After cutover, the old ReplicaSet is not deleted immediately – it is scaled to zero only after this delay (default 30 seconds). This is your instant-rollback window: as long as the old RS exists, undo re-points the active selector to it in one step. Set it to cover the time it takes your alerting and on-call to notice a bad release. A common production value is 300 to 900 seconds.

Trade-off: a longer scaleDownDelaySeconds means you pay for 2x replicas for that whole window. On expensive node pools, pair it with a scaleDownDelayRevisionLimit so you keep only the last N old ReplicaSets warm rather than an unbounded set.

strategy:
  blueGreen:
    activeService: checkout-active
    previewService: checkout-preview
    autoPromotionEnabled: false
    scaleDownDelaySeconds: 600
    scaleDownDelayRevisionLimit: 1   # keep only the immediately-previous RS warm
    antiAffinity:                    # optional: spread blue and green across nodes
      preferredDuringSchedulingIgnoredDuringExecution:
        weight: 100

6. Wiring Prometheus, Datadog, or Job-based providers into analysis

The same AnalysisTemplate shape supports multiple providers; you pick per metric. Job-based providers were shown in step 4. The two most common metric providers are Prometheus and Datadog.

Datadog requires a Secret with API and app keys, referenced by the provider. Use apiVersion: v2 of the Datadog provider for the current query semantics:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: post-cutover-error-rate
  namespace: shop
spec:
  args:
    - name: active-service
  metrics:
    - name: error-rate
      interval: 1m
      count: 5
      successCondition: result < 0.01
      failureLimit: 2
      provider:
        datadog:
          apiVersion: v2
          interval: 5m
          query: |
            sum:trace.http.request.errors{service:checkout}.as_count() /
            sum:trace.http.request.hits{service:checkout}.as_count()
apiVersion: v1
kind: Secret
metadata:
  name: datadog
  namespace: argo-rollouts        # provider reads it from the controller namespace
type: Opaque
stringData:
  api-key: "<DATADOG_API_KEY>"
  app-key: "<DATADOG_APP_KEY>"
  address: "https://api.datadoghq.com"

Two rules keep analysis honest regardless of provider. First, failureLimit should be greater than zero so a single scrape blip does not abort a good release, but small enough that a real regression trips it within a couple of intervals. Second, prefer ratios with a guard on volume – a successCondition of result[0] >= 0.99 on a numerator with near-zero denominator is a false pass; add a separate metric asserting minimum request volume during the analysis window.

7. Traffic cutover: service selectors vs. ingress/Gateway API

The default blue-green mechanism is pure service selector swapping: the controller mutates the selector of the active Service to carry the new pod-template hash. This works with any Service type and needs no traffic-management add-on – the cutover is a single Kubernetes API write and propagates as fast as kube-proxy/endpoints reconcile.

That default does not, by itself, control an external load balancer or an L7 router. If clients reach the app through an Ingress or the Gateway API, the swap of the Service selector still works as long as the Ingress backend targets the active Service by name – the Ingress points at checkout-active, and the controller changes which pods that Service selects. For finer control (header-based preview routing, or swapping which Service the route targets) Argo Rollouts integrates with traffic routers; for blue-green the common, robust pattern is to keep the Ingress/Gateway pinned to the active Service and let selector swapping do the cutover:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: checkout
  namespace: shop
spec:
  parentRefs:
    - name: shop-gateway
  rules:
    - backendRefs:
        - name: checkout-active     # pinned; controller swaps the Service's pods
          port: 80

Expose the preview Service to your validation tooling on a separate hostname or internal-only route so smoke jobs and manual checks can reach green without touching production ingress. Never wire the preview Service into the production route – that defeats the entire isolation guarantee of blue-green.

8. Instant rollback: re-point the active selector, preserve the old ReplicaSet

Because the old ReplicaSet stays scaled up for scaleDownDelaySeconds, rollback is a selector flip, not a redeploy. Two paths:

# If you are still within the pre-promotion pause (never cut over):
kubectl argo rollouts abort checkout -n shop

# If you already promoted but are still inside the scaleDownDelay window:
kubectl argo rollouts undo checkout -n shop          # roll back to previous revision
kubectl argo rollouts undo checkout -n shop --to-revision=41

undo re-points the active Service at the previous ReplicaSet’s pods. As long as that ReplicaSet has not been scaled to zero, traffic returns to the known-good version in seconds with no image pull, no scheduling, no cold start. This is the single most important reason to set scaleDownDelaySeconds deliberately: it is your rollback budget. Once the delay elapses and the old RS scales to zero, an undo still works but now incurs a full scale-up, so it is no longer instant.

postPromotionAnalysis (wired in step 3) gives you an automated version of the same safety: it runs after cutover, and if error rate breaches its successCondition while the old RS is still warm, the Rollout enters a Degraded state and you (or a controller-driven undo) can flip back immediately.

9. Dashboarding rollout state and alerting on aborted promotions

The controller exposes Prometheus metrics on port 8090; the most actionable is rollout_info with a phase label (Healthy, Paused, Degraded, Progressing). Scrape it and alert on Degraded and on aborted analysis.

groups:
  - name: argo-rollouts
    rules:
      - alert: RolloutDegraded
        expr: rollout_info{phase="Degraded"} == 1
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Rollout {{ $labels.name }} in {{ $labels.namespace }} is Degraded"
      - alert: RolloutStuckPaused
        expr: rollout_info{phase="Paused"} == 1
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Rollout {{ $labels.name }} paused >30m (awaiting promotion?)"

A Degraded phase almost always means a pre- or post-promotion AnalysisRun failed – which is exactly the aborted-promotion signal you want to page on. Pair the metric alert with the built-in dashboard from the kubectl plugin (kubectl argo rollouts dashboard) for a live view during change windows.

Verify

Confirm the blue-green machinery behaves before you trust it with production traffic.

# 1) Trigger a release and confirm it PAUSES at the pre-promotion gate (no cutover yet)
kubectl argo rollouts set image checkout checkout=registry.example.com/checkout:1.43.0 -n shop
kubectl argo rollouts get rollout checkout -n shop
#   Expect: status Paused, BlueGreenPause, active still on old RS, preview on new RS

# 2) Confirm the active Service still selects the OLD pod hash, preview selects the NEW one
kubectl get svc checkout-active checkout-preview -n shop \
  -o custom-columns=NAME:.metadata.name,HASH:.spec.selector.rollouts-pod-template-hash

# 3) Confirm the AnalysisRun for pre-promotion ran and passed
kubectl get analysisrun -n shop
kubectl argo rollouts get rollout checkout -n shop | grep -i analysis

# 4) Promote and confirm the active Service hash flips to the new RS
kubectl argo rollouts promote checkout -n shop
kubectl get svc checkout-active -n shop -o jsonpath='{.spec.selector.rollouts-pod-template-hash}'

# 5) Confirm the OLD ReplicaSet is still up (rollback window) until scaleDownDelay elapses
kubectl get rs -n shop -l app=checkout

# 6) Force a rollback and confirm traffic returns to the previous revision instantly
kubectl argo rollouts undo checkout -n shop
kubectl argo rollouts status checkout -n shop   # Expect: Healthy on prior revision

If step 1 cuts straight to Healthy without pausing, check that autoPromotionEnabled is false. If step 3 shows the AnalysisRun as Failed, inspect it with kubectl describe analysisrun <name> -n shop – a failed smoke Job or a NaN Prometheus result are the usual causes.

Enterprise scenario

A payments platform team ran a stateful ledger-reconciliation service behind blue-green. Their constraint: every release was coupled to a database migration, and compliance required that the new version be proven correct against a replica of production data before any customer transaction touched it – partial canary exposure was explicitly disallowed by their change-control policy. Their first cut used autoPromotionEnabled: true with a single Prometheus success-rate check, and it bit them: the preview service had almost no synthetic traffic, so the ratio query returned NaN, Argo Rollouts treated the analysis window as inconclusive-but-not-failing on an early scrape, and a release with a broken migration auto-promoted.

The fix had three parts. They moved validation into a Job-based pre-promotion metric that replayed a recorded transaction corpus through the preview service and asserted ledger balances reconciled to the cent (exit 0 or fail). They set autoPromotionEnabled: false so a release engineer had to issue the promote inside the approved window. And they raised scaleDownDelaySeconds to 900 with scaleDownDelayRevisionLimit: 1, giving on-call a 15-minute instant-rollback budget while capping the 2x-replica cost to one prior revision.

strategy:
  blueGreen:
    activeService: ledger-active
    previewService: ledger-preview
    autoPromotionEnabled: false
    scaleDownDelaySeconds: 900
    scaleDownDelayRevisionLimit: 1
    prePromotionAnalysis:
      templates:
        - templateName: ledger-replay   # Job: replay corpus, assert balances; exit 0 = pass
      args:
        - name: preview-service
          value: ledger-preview

The lesson generalized across their platform: blue-green analysis is only as trustworthy as the load you drive at the preview service. A metric query over a quiet preview is not validation – the Job provider, which asserts a concrete pass/fail on real inputs, is the right primitive when a regression must be impossible to promote past.

Checklist

argo-rolloutsblue-greenprogressive-deliverykubernetesrelease-engineering

Comments

Keep Reading