DevOps Multi-cloud

Configure Harness CD Pipelines with Continuous Verification and Canary Stages

A payments platform team ships a checkout service forty times a week to a 30-node EKS cluster, and the last three Sev-2 incidents all followed the same script: a deploy went green, the pods passed their readiness probes, and only fifteen minutes later did the p99 latency and the 5xx rate creep past the threshold — by which point the bad version was already taking 100% of traffic and the on-call was reverse-engineering a rollback at 2 a.m. The mandate from the new head of platform is blunt: “No release takes full traffic until the metrics prove it is healthy, and if it is not, roll it back without a human.” That is exactly what Harness Continuous Delivery with Continuous Verification (CV) does — progressive canary rollout where each traffic increment is gated by an automated analysis of Prometheus and Datadog metrics, comparing the canary against the stable baseline and auto-rolling-back on a regression. This guide builds that pipeline end to end.

The pattern is worth naming precisely because teams conflate two things. A canary shifts a small slice of traffic to the new version. Continuous Verification is the judgment layer on top: Harness queries your observability stack during the canary window, runs anomaly detection (time-series comparison or a fixed threshold) against the metrics you nominate, assigns a risk score, and either promotes to the next phase or fails the stage and triggers rollback. Without CV, a canary is just a slower way to ship a bad build to everyone. With it, the metrics make the promotion decision.

Prerequisites

Target topology

Configure Harness CD Pipelines with Continuous Verification and Canary Stages — topology

The moving parts, and where each tool earns its place. Harness Manager (SaaS control plane) holds the pipeline definition but never touches your cluster directly — every action runs through a Harness Delegate Pod inside the EKS VPC, which is what lets the metric queries hit Prometheus and Datadog over private networking. Engineers sign in to Harness through Okta, federated to Microsoft Entra ID over SAML/OIDC, so Harness RBAC keys off the same groups as the rest of the estate. The pipeline pulls its Kubernetes manifests from Git (GitHub Actions builds and pushes the image; Argo CD optionally owns the desired state if you run GitOps), and resolves all secrets — the AWS role, the Datadog keys — from HashiCorp Vault via Harness’s Vault Secret Manager, so nothing sensitive lives in the pipeline YAML. During the canary window the Delegate queries Prometheus (in-cluster, request latency and error rate) and Datadog (APM traces and host metrics) for the CV analysis. A high-risk verdict triggers automatic rollback and raises a ServiceNow change/incident record; Dynatrace (or Datadog dashboards) gives the humans the wider view. The image to deploy is the artifact built upstream; everything below is the delivery and verification layer on top of it.

1. Install and verify the Delegate

The Delegate is the workhorse. Install it once per cluster with Helm and confirm it registers before you build anything on top.

# Add the Harness Helm repo
helm repo add harness-delegate https://app.harness.io/storage/harness-download/delegate-helm-chart/
helm repo update

# Install the Delegate into a dedicated namespace.
# DELEGATE_TOKEN and the account/manager URLs come from
# Harness > Account Settings > Delegates > New Delegate (Helm).
helm upgrade --install harness-delegate harness-delegate/harness-delegate-ng \
  --namespace harness-delegate-ng --create-namespace \
  --set delegateName=eks-prod-delegate \
  --set accountId="$HARNESS_ACCOUNT_ID" \
  --set delegateToken="$DELEGATE_TOKEN" \
  --set managerEndpoint="https://app.harness.io" \
  --set delegateDockerImage="harness/delegate:25.05.85503" \
  --set replicas=2 --set upgrader.enabled=true

Verify it is healthy and connected:

kubectl -n harness-delegate-ng rollout status deploy/eks-prod-delegate
kubectl -n harness-delegate-ng get pods -l harness.io/name=eks-prod-delegate
# In the UI: Account Settings > Delegates — the row shows "Connected" (green).

Give the Delegate a Kubernetes ServiceAccount with enough RBAC to apply manifests and patch Deployments in your target namespace. A cluster-admin binding works for a lab; scope it to the app namespaces for production.

2. Connect to Entra ID SSO and wire Vault for secrets

Before any pipeline, get identity and secrets out of the pipeline body.

SSO (Okta → Entra ID → Harness). In Account Settings > Authentication, add a SAML provider pointed at Microsoft Entra ID. Because the workforce IdP is Okta, configure Okta to federate into Entra (Entra treats Okta as an external identity source), so the SAML assertion Harness receives carries the user’s Entra group claims. Map those groups to Harness User Groups, then bind roles — for example, only the platform-sre group gets the role that can approve a production canary promotion. Turn on “Enforce SAML login” so local passwords cannot bypass SSO.

Secrets (HashiCorp Vault). In Account Settings > Connectors > Secret Managers, add a HashiCorp Vault connector. Authenticate the Delegate to Vault with the Kubernetes auth method so no static Vault token is stored:

# On the Vault side: enable k8s auth and bind the Delegate's ServiceAccount
vault auth enable kubernetes
vault write auth/kubernetes/role/harness-delegate \
  bound_service_account_names=eks-prod-delegate \
  bound_service_account_namespaces=harness-delegate-ng \
  policies=harness-cd-read ttl=20m

# Policy: read-only on the paths the pipeline needs
vault policy write harness-cd-read - <<'EOF'
path "secret/data/cd/datadog/*" { capabilities = ["read"] }
path "secret/data/cd/aws/*"     { capabilities = ["read"] }
EOF

Now the AWS role ARN, the Datadog DD_API_KEY, and DD_APP_KEY are referenced in pipeline YAML as <+secrets.getValue("datadogApiKey")> and resolved by the Delegate at runtime from Vault. Never inline a credential in the pipeline.

3. Create the Project, Service, and Environment

Use the Harness CLI so the setup is reproducible. Authenticate once:

harness login --api-key "$HARNESS_API_KEY" --account-id "$HARNESS_ACCOUNT_ID"

Service — the what to deploy. Define a Kubernetes service whose manifests come from Git and whose image comes from your registry. Save this as service.yaml:

service:
  name: checkout
  identifier: checkout
  serviceDefinition:
    type: Kubernetes
    spec:
      manifests:
        - manifest:
            identifier: checkout_manifests
            type: K8sManifest
            spec:
              store:
                type: Github
                spec:
                  connectorRef: github_checkout   # GitHub Actions builds; this reads manifests
                  gitFetchType: Branch
                  branch: main
                  paths: [deploy/k8s]
              valuesPaths: [deploy/k8s/values.yaml]
      artifacts:
        primary:
          primaryArtifactRef: ecr_checkout
          sources:
            - identifier: ecr_checkout
              type: Ecr
              spec:
                connectorRef: aws_ecr            # role resolved from Vault
                region: ap-south-1
                imagePath: 480123456789.dkr.ecr.ap-south-1.amazonaws.com/checkout
                tag: <+input>

Environment + Infrastructure — the where. Point it at the EKS cluster via a Kubernetes connector that uses the in-cluster Delegate:

environment:
  name: prod
  identifier: prod
  type: Production
infrastructureDefinition:
  name: eks-prod
  identifier: eks_prod
  environmentRef: prod
  deploymentType: Kubernetes
  type: KubernetesDirect
  spec:
    connectorRef: eks_prod_k8s     # uses eks-prod-delegate
    namespace: checkout
    releaseName: release-<+INFRA_KEY_SHORT_ID>
harness service     apply --file service.yaml
harness environment apply --file environment.yaml

4. Register the health sources: Prometheus and Datadog

CV needs to know which metrics decide health. In Project Setup > Monitored Services, create a Monitored Service bound to checkout + prod, then add two Health Sources.

Prometheus health source — error rate and latency from in-cluster Prometheus. Add a Prometheus connector first (it points at the kube-prometheus-stack service URL, reachable by the Delegate):

# Sanity-check the queries the health source will run, from the Delegate's vantage point.
# 5xx rate for the checkout service:
curl -s "http://prometheus-operated.monitoring:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{app="checkout",status=~"5.."}[5m]))'

# p99 latency:
curl -s "http://prometheus-operated.monitoring:9090/api/v1/query" \
  --data-urlencode 'query=histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{app="checkout"}[5m])) by (le))'

In the health source, set the metric risk profile to Errors for the 5xx query and Performance/Response Time for the latency query, and tag the deployment-marker label so Harness can separate canary pods from stable. Choose Continuous Verification as the analysis type (not just SLO) so these metrics gate the pipeline.

Datadog health source — APM and host metrics. The Datadog connector reads DD_API_KEY / DD_APP_KEY from Vault:

# Example Datadog metric queries to register as CV metrics:
#   Latency (APM):     avg:trace.http.request.duration{service:checkout,env:prod}
#   Error rate (APM):  sum:trace.http.request.errors{service:checkout,env:prod}.as_rate()
#   Saturation (host): avg:kubernetes.cpu.usage.total{kube_deployment:checkout}

Set each to Higher counts = higher risk appropriately, and pick the Canary analysis method so Harness compares canary-tagged series against the stable baseline rather than a fixed threshold.

5. Build the Canary pipeline with a Continuous Verification step

Now the core. A canary deployment stage in Harness has three execution steps in order: Canary Deployment (bring up N canary pods), Verify (the CV gate), then Canary Delete + Rolling/Primary promotion. Define the stage YAML:

pipeline:
  name: checkout-canary-cv
  identifier: checkout_canary_cv
  projectIdentifier: payments
  orgIdentifier: default
  stages:
    - stage:
        name: Canary to prod
        identifier: canary_prod
        type: Deployment
        spec:
          deploymentType: Kubernetes
          service: { serviceRef: checkout }
          environment:
            environmentRef: prod
            infrastructureDefinitions: [{ identifier: eks_prod }]
          execution:
            steps:
              # 5.1 — stand up canary pods (25% of replica count)
              - step:
                  name: Canary Deployment
                  identifier: canaryDeployment
                  type: K8sCanaryDeploy
                  timeout: 10m
                  spec:
                    instanceSelection:
                      type: Percentage
                      spec: { percentage: 25 }
                    skipDryRun: false
              # 5.2 — THE GATE: Continuous Verification
              - step:
                  name: Continuous Verification
                  identifier: cv
                  type: Verify
                  timeout: 30m
                  spec:
                    type: Canary
                    monitoredService:
                      type: Default     # the checkout/prod Monitored Service from step 4
                    spec:
                      sensitivity: HIGH            # HIGH fails on smaller regressions
                      duration: 15m               # analysis window
                      deploymentTag: <+serviceConfig.artifacts.primary.tag>
                  failureStrategies:
                    - onFailure:
                        errors: [Verification]
                        action: { type: StageRollback }   # auto-rollback on bad metrics
              # 5.3 — promote to 100% only if CV passed
              - step:
                  name: Canary Delete
                  identifier: canaryDelete
                  type: K8sCanaryDelete
                  timeout: 10m
                  spec: {}
              - step:
                  name: Rolling Deployment
                  identifier: rolling
                  type: K8sRollingDeploy
                  timeout: 15m
                  spec: { skipDryRun: false }
          rollbackSteps:
            - step:
                name: Rolling Rollback
                identifier: rollingRollback
                type: K8sRollingRollback
                timeout: 15m
                spec: {}

The decisive lines are the Verify step and its failure strategy. type: Canary tells CV to compare canary-tagged metric series against the stable baseline; sensitivity: HIGH fails the stage on a smaller deviation; and onFailure → StageRollback is what removes the human from the 2 a.m. equation — a high-risk verdict rolls the stage back automatically and the bad version never advances past 25% traffic.

Apply and trigger:

harness pipeline apply --file pipeline.yaml
harness pipeline execute --pipeline-id checkout_canary_cv \
  --inputs-yaml-file run-inputs.yaml     # supplies the image tag to deploy

6. Add the approval and notification path (ServiceNow + GitOps)

Production changes usually need a paper trail and, often, a human checkpoint before the canary even starts.

Insert an Approval stage ahead of the canary that opens a ServiceNow change request and blocks until it is approved — the ITSM record auto-created here is the audit artifact compliance wants, and on a CV-triggered rollback Harness updates the same record to an incident:

    - stage:
        name: Change Approval
        identifier: change_approval
        type: Approval
        spec:
          execution:
            steps:
              - step:
                  name: ServiceNow Change
                  identifier: snowChange
                  type: ServiceNowApproval
                  timeout: 1d
                  spec:
                    connectorRef: servicenow_prod
                    ticketType: change_request
                    approvalCriteria:
                      type: KeyValues
                      spec:
                        conditions:
                          - { key: state, operator: equals, value: Implement }

If you run GitOps, the relationship to Argo CD is clean: GitHub Actions builds and pushes the image and bumps the tag in the Git manifest; Argo CD reconciles desired state to the cluster; and the Harness Canary + CV pipeline orchestrates how that change rolls out and verifies it, rather than letting Argo sync 100% at once. Use Harness GitOps (its packaged Argo CD) for the canary-with-CV flow, and keep plain Argo CD for the always-on infra apps.

Validation

Prove the gate works before you trust it in anger.

# 1. Happy path: deploy a known-good tag, watch the canary come up.
kubectl -n checkout get pods -l harness.io/track=canary -w
# Expect 25% canary pods, then CV runs for 15m, then promotion to stable.

# 2. Confirm the CV verdict in the UI:
#    Pipeline execution > Continuous Verification step shows per-metric
#    risk (green/amber/red) for every Prometheus and Datadog query.

# 3. Negative test — THE important one. Deploy an image that injects
#    latency or 5xx (a fault-injection build), and assert auto-rollback:
kubectl -n checkout get events --sort-by=.lastTimestamp | grep -i rollback
kubectl -n checkout rollout history deploy/checkout
# The Verify step should fail, StageRollback should fire, and stable
# replicas should be restored with zero canary pods remaining.

Independently confirm the metrics the gate saw by re-running the same Prometheus and Datadog queries from step 4 for the canary window — they should line up with the risk Harness reported. If they do not, your deploymentTag / canary label filter is wrong and CV is analysing the wrong series.

Rollback and teardown

Manual rollback of a release that slipped through:

# Re-run the pipeline targeting the previous good tag, OR roll the Deployment directly:
kubectl -n checkout rollout undo deploy/checkout
kubectl -n checkout rollout status deploy/checkout

Teardown of the whole setup (lab cleanup):

# Remove the workload and its canary remnants
kubectl delete namespace checkout

# Remove the Delegate
helm uninstall harness-delegate -n harness-delegate-ng
kubectl delete namespace harness-delegate-ng

# In Harness UI: delete the Pipeline, Monitored Service, Environment,
# Service, and the Prometheus/Datadog/Vault/ServiceNow connectors.
# In Vault: revoke the role.
vault delete auth/kubernetes/role/harness-delegate

Common pitfalls

Security notes

Keep the Delegate’s Kubernetes RBAC least-privilege — scoped to the app namespaces it deploys to, not cluster-admin — and run it on a dedicated node pool. All credentials (AWS role, Datadog API/App keys, ServiceNow auth) resolve from HashiCorp Vault at runtime via the Kubernetes auth method with short TTLs, so nothing static sits in Harness or Git. Human access is gated by Okta → Entra ID SSO with group-mapped Harness roles, so only platform-sre can approve or override a production promotion. Pair this with your existing posture and runtime controls — Wiz (and Wiz Code scanning the manifests/IaC in the repo) flags a misconfigured Deployment or an over-broad RBAC binding before it ships, and CrowdStrike Falcon sensors on the EKS nodes catch runtime threats on the canary and stable workloads alike. Pin the Delegate image to an explicit version (as above) rather than a floating tag so the control plane in your cluster does not drift.

Cost notes

The expensive failure mode is the one this pipeline prevents — a bad release reaching 100% of traffic and causing an outage — so CV pays for itself on the first auto-rollback. Beyond that: size the Delegate modestly (two replicas, ~1 vCPU / 2 GiB each is plenty for this throughput) and let it idle between deploys. Canary pods are short-lived and only 25% of replica count, so the extra compute during the verification window is marginal. The real cost lever is observability: Datadog bills on custom metrics and APM host count, so register only the handful of CV metrics that actually decide health rather than shipping every series, and lean on in-cluster Prometheus (effectively free compute you already run) for the high-cardinality latency/error queries, reserving Datadog for the APM traces and cross-service view. Keep CV analysis windows tight enough to be decisive but no longer than they need to be — a 4-hour window is not more correct than 15 minutes, just more expensive in deploy lead time.

HarnessContinuous VerificationCanaryKubernetesDatadogPrometheus
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading