A health-insurance company’s claims-platform team ships once a sprint, on a Thursday night, with the whole squad on a bridge call and a runbook open. It works until it doesn’t: last quarter a release to the member-portal API changed a benefits-eligibility response shape, the on-call missed it during the manual smoke test, and for ninety minutes members were told they had no coverage. In a regulated payer that is not an inconvenience — it is a reportable incident, a spike in call-center volume, and a compliance officer asking why a change touching protected health information went out with a person clicking through a checklist. The mandate that came down was specific: no human pushes to production, every change is auditable, and a bad release rolls itself back before most members ever see it. This article is the reference architecture for that mandate — a progressive-delivery pipeline that builds and signs artifacts in GitHub Actions, deploys them through Argo CD GitOps onto GKE, rolls them out as Argo Rollouts canaries, and refuses to promote anything that fails a policy gate or a live health signal.
The pressures are the ones every platform team eventually hits, just sharper in a payer. Safety: a regression in eligibility or claims logic has real-world consequences, so a release has to prove itself on a sliver of traffic before it owns all of it. Auditability: a HIPAA-regulated shop needs to answer “what was running at 14:32, who approved it, and what scanned clean” without spelunking through shell history. Velocity: the business still wants to ship daily, which is impossible if every deploy needs a bridge call. Blast radius: when something does break, it should degrade a percentage of traffic for two minutes, not the whole member base for ninety. Progressive delivery — ship to a few, watch real signals, promote or abort automatically — is the pattern that satisfies all four, and GitOps is what makes it auditable by construction.
Why not the obvious shortcuts
Three cheaper approaches will get proposed in the first planning meeting, and each fails in a way worth naming.
“Just kubectl apply from the CI job.” Push-based CD hands your CI runners cluster-admin-grade credentials, makes the pipeline the source of truth for what’s deployed (so cluster drift is invisible), and leaves you with no record of desired state other than a job log that rotates out in 30 days. The first time someone hotfixes the cluster by hand, your CI and reality silently diverge.
“Blue-green the whole service.” Standing up a full parallel copy and flipping a load balancer is a real strategy, but it’s all-or-nothing: the new version takes 100% of traffic the instant you cut over, so a subtle eligibility bug hits every member at once — exactly the ninety-minute outage we’re trying to kill. It also doubles capacity cost during the cutover window.
“Add more manual gates.” More checklists and more bridge calls slow velocity without improving safety, because the failure mode was a human missing a regression, and humans miss regressions. The fix is not more humans; it’s machine-evaluated gates on real signals.
Progressive delivery with GitOps threads the needle. Git is the single source of truth and the audit log. Argo CD continuously reconciles the cluster to Git, so drift self-heals and “what’s running” is always answerable. Argo Rollouts shifts traffic in small increments and consults health analysis between steps, so a bad version is caught at 5% and rolled back automatically. And policy gates — in CI and at the cluster admission boundary — stop a non-compliant artifact from ever reaching the canary in the first place.
Architecture overview
The platform has two cleanly separated halves that meet at exactly one place — the Git repository — and never share credentials. The CI half lives in GitHub Actions: it builds, scans, signs, and then writes a desired-state change to Git. The CD half lives in the cluster: Argo CD watches Git and pulls changes in; nothing pushes to the cluster from outside. This separation is the whole game. CI never holds cluster credentials, and the cluster never reaches back into CI. The boundary between “build the thing” and “run the thing” is a Git commit, which is also the audit record.
The defining property the compliance team cares about is provenance you can prove: every image is built by a known workflow with no long-lived cloud keys, scanned by two independent tools, cryptographically signed, and admitted to the cluster only if its signature and policy checks pass. A push to production is no longer an action a person takes — it is a state that Git describes and the cluster converges to.
CI path, following the control flow:
- A developer merges a PR to
main. A GitHub Actions workflow starts and authenticates to Google Cloud via Workload Identity Federation (OIDC) — it exchanges its short-lived OIDC token for a GCP access token, so there is no stored service-account JSON key to leak. The same OIDC trust lets it pull from and push to Artifact Registry. - The workflow scans infrastructure-as-code before it builds anything: Checkov lints the Terraform and Kubernetes manifests for misconfigurations (public buckets, privileged pods, missing encryption), and Wiz Code runs in the pipeline to catch IaC and dependency risk with the context of how the resource is actually exposed in the running cloud — a finding Wiz flags as “internet-reachable + critical CVE” is a hard stop, where a buried lab finding is not.
- The application image builds, and the workflow signs it with Cosign (keyless, using the same OIDC identity, recorded in a transparency log) and generates an SBOM and SLSA provenance attestation. Signature and attestation are pushed alongside the image in Artifact Registry.
- The workflow then makes its only change to the deploy surface: it bumps the image digest in the GitOps config repository — a separate repo holding Kustomize/Helm manifests — via a commit (often a PR for prod, auto-merged for lower environments). CI’s job ends here. It has touched Git, not the cluster.
CD path, pull-based and continuous:
- Argo CD, running inside GKE, detects the new commit in the config repo and reconciles. For the member-portal API it renders an Argo Rollouts
Rolloutresource (not a vanilla Deployment) pointing at the new image digest. - Before any pod schedules, the OPA Gatekeeper admission webhook evaluates the manifests against cluster policy — image must come from the approved Artifact Registry, must carry a valid Cosign signature, must set resource limits, must not run privileged. A manifest that violates policy is rejected at admission, so a bad change fails loudly in Argo CD’s sync status instead of quietly running.
- Argo Rollouts begins the canary: it routes a small slice of traffic to the new version and pauses. During each pause it runs an
AnalysisRunthat queries Datadog — error rate, p95 latency, and a custom eligibility-success metric — against defined thresholds. - If the analysis stays green across the canary steps, Rollouts promotes the new version to 100% and the old ReplicaSet scales down. If any step breaches a threshold, Rollouts aborts and rolls back automatically to the last good version, and a ServiceNow incident is opened from the Datadog monitor so on-call has a ticket, not just a page.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| CI / build | GitHub Actions | Build, scan, sign, then commit desired state to Git | OIDC to GCP (no JSON keys); environment protection on prod repo |
| Cloud auth | Workload Identity Federation | Exchange GitHub OIDC token for short-lived GCP creds | Provider scoped to repo + branch; no exportable keys |
| IaC scanning | Checkov | Static policy scan of Terraform + K8s manifests | Fail build on HIGH; custom checks for payer controls |
| Code + cloud risk | Wiz Code | IaC/dependency scan with runtime exposure context | Block on internet-reachable critical; PR annotations |
| Signing / provenance | Cosign + SLSA attestation | Keyless image signing, SBOM, build provenance | Keyless OIDC; signature + attestation in Artifact Registry |
| Image registry | Artifact Registry | Stores images, signatures, SBOMs | Immutable tags; vulnerability scanning on push |
| GitOps controller | Argo CD | Reconcile cluster to the config repo; the audit surface | App-of-apps; auto-sync + self-heal; SSO via Okta/Entra |
| Progressive delivery | Argo Rollouts | Canary traffic shifting + automated analysis/rollback | Canary steps with pauses; Datadog AnalysisTemplate |
| Admission policy | OPA Gatekeeper | Cluster-side gate: signed, sourced, constrained workloads only | ConstraintTemplates; enforce in prod, dryrun to roll out |
| Service mesh / traffic | GKE + Istio (or Gateway API) | Weighted traffic split for the canary | Subset routing by Rollouts; mTLS between services |
| Deployment monitoring | Datadog | Health signals that gate promotion; deployment markers | Monitors as AnalysisTemplate metrics; deployment tracking |
| Identity / SSO | Okta + Entra ID | SSO into Argo CD and GitHub; RBAC by group | OIDC; group claims map to Argo CD roles |
| Secrets | HashiCorp Vault | App secrets to pods; pipeline secrets to CI | Vault Agent / Secrets Operator; short-lived leases |
| ITSM | ServiceNow | Incident + change record on rollback or prod promote | Auto-incident from Datadog; change gate on prod PR |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on GKE nodes + workloads | Sensor via DaemonSet; detections to the SOC |
A few of these choices carry the why, because they’re where teams go wrong.
Why two IaC scanners, not one. Checkov and Wiz Code overlap but answer different questions. Checkov is a fast, free, deterministic policy linter — perfect as a cheap pre-build gate and easy to extend with payer-specific checks (e.g., “every storage class touching claims data must be encrypted with a CMEK key”). Wiz Code adds the context Checkov can’t have: it correlates an IaC finding with the live cloud, so it can tell you a misconfiguration is actually internet-reachable and tied to a critical CVE versus a theoretical lab risk. Running both means cheap deterministic gates plus prioritization by real exposure, and you only hard-fail the build on findings that are genuinely high-severity in context — otherwise you train developers to ignore the scanner.
Why keyless signing and admission verification together. Signing an image proves who built it; it does nothing if the cluster will run any image. The value comes from pairing Cosign signatures with an OPA Gatekeeper (or Kyverno/Binary Authorization) admission check that requires a valid signature from your build identity before a pod schedules. That closes the loop: an attacker who pushes a malicious image to the registry can’t get it admitted, because it isn’t signed by the trusted CI OIDC identity. Provenance is only worth the bytes if something enforces it at the door.
Implementation guidance
Wire OIDC first, and prove no static keys remain. The single biggest security win here is that GitHub Actions never holds a downloadable GCP key — the leaked-credentials lesson the platform team intends never to repeat. Configure a Workload Identity Pool scoped to your repo and branch, and the auth step needs no secret at all:
# .github/workflows/build.yaml
permissions:
contents: read
id-token: write # required for OIDC
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
workload_identity_provider: projects/4711/locations/global/workloadIdentityPools/gh/providers/repo
service_account: ci-build@claims-prod.iam.gserviceaccount.com
- name: IaC scan (Checkov)
run: checkov -d ./infra --hard-fail-on HIGH
- name: Build, sign, attest
run: |
docker build -t "$IMG" .
docker push "$IMG"
cosign sign --yes "$IMG" # keyless, uses the OIDC identity
cosign attest --yes --predicate sbom.json --type spdx "$IMG"
The provider is constrained so only this repository on a protected branch can assume the build service account; a fork or a feature branch gets nothing.
Separate the app repo from the config repo. Keep application source in one repository and the rendered deploy manifests (Kustomize bases/overlays or a Helm chart) in another. CI’s last step writes the new image digest into the config repo. This separation gives you environment-scoped review (a human approves the prod overlay change while dev/stage auto-merge), a clean per-environment audit trail, and an Argo CD that watches exactly one source of truth per environment.
Model the rollout, not a deployment. The member-portal API is an Argo Rollouts Rollout with explicit canary steps and a Datadog analysis between them. The steps below mean: take 5% of traffic, hold while analysis runs, then 25%, then 50%, then full — aborting the moment analysis fails.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: member-portal-api }
spec:
strategy:
canary:
canaryService: member-portal-canary
stableService: member-portal-stable
trafficRouting: { istio: { virtualService: { name: member-portal-vs } } }
steps:
- setWeight: 5
- pause: { duration: 3m }
- analysis: # query Datadog; abort on breach
templates: [{ templateName: datadog-health }]
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: { duration: 5m }
The AnalysisTemplate is where the safety lives — it queries Datadog for the metrics that actually matter to members, not just CPU:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: datadog-health }
spec:
metrics:
- name: error-rate
interval: 1m
failureLimit: 1 # one breach aborts + rolls back
provider:
datadog:
query: "sum:portal.http.5xx{service:member-portal,version:canary}.as_rate()"
failureCondition: "result > 0.01" # >1% 5xx fails the canary
- name: eligibility-success
provider:
datadog:
query: "sum:portal.eligibility.ok{version:canary} / sum:portal.eligibility.total{version:canary}"
failureCondition: "result < 0.995" # the metric that caught nothing last quarter
That second metric — eligibility-success rate — is the one whose absence caused the original incident. Encoding it as a hard gate means the exact regression that took ninety minutes to notice now aborts the canary in three.
Enforce policy at admission, and roll it out in dry-run. OPA Gatekeeper ConstraintTemplates define the rules; start every new constraint in dryrun so you see what would be blocked without breaking deploys, then flip to enforce once the violations are clean. A representative set for this platform: images only from *-docker.pkg.dev/claims-prod/*, a valid Cosign signature present, CPU/memory limits set, no privileged: true, no latest tag. The signed-image constraint is what makes the whole supply-chain story enforceable rather than aspirational.
Enterprise considerations
Security & supply chain. The design is defense-in-depth across the lifecycle: no static cloud keys (OIDC federation), two-tool IaC scanning (Checkov for deterministic gates, Wiz Code for exposure-aware prioritization), keyless signing with SLSA provenance, and OPA Gatekeeper verifying signatures and constraints at admission so only known-good artifacts run. At runtime, CrowdStrike Falcon sensors run as a DaemonSet on the GKE node pools for threat detection on the workloads themselves, feeding the payer’s SOC. HashiCorp Vault holds both application secrets (delivered to pods via the Vault Secrets Operator with short leases, never as static Kubernetes Secrets) and any residual pipeline secrets. Access to Argo CD and GitHub federates through Okta as the workforce IdP, brokered to Entra ID where Azure-side RBAC is needed, so an engineer’s group membership maps directly to what they can sync, override, or approve — and a single source of identity means offboarding actually removes access everywhere.
Cost optimization. Progressive delivery is cheaper than blue-green precisely because it never runs a full second copy of the service. The levers that matter on GKE:
| Lever | Mechanism | Typical effect |
|---|---|---|
| Canary vs blue-green | Run a small extra ReplicaSet, not a full parallel stack | Avoids ~2× capacity during cutover |
| Spot/Preemptible pools | Run stateless canary + batch on Spot node pools | 60–80% cheaper on that capacity |
| Right-sized requests | Enforce limits via Gatekeeper; tune from Datadog usage | Stops over-provisioned requests wasting nodes |
| Fail fast | Abort bad canaries in minutes, not after full rollout | Cuts wasted compute + incident cost |
| Cluster autoscaler | Scale node pools to canary + stable demand | Pay for traffic, not for peak headroom |
Datadog’s usage metrics feed the right-sizing, and because rollbacks are automatic and fast, a bad release burns a few minutes of canary capacity instead of a full deploy plus an emergency redeploy.
Scalability. Each half scales on its own axis. GitHub Actions parallelizes across repos and runners, so build throughput grows with concurrency, not a shared bottleneck. Argo CD scales to hundreds of applications with the app-of-apps pattern and sharded application controllers; for many clusters, point one Argo CD at all of them or run ApplicaitonSets to template environments. The canary mechanism is per-service, so adding services adds independent rollouts, not coordination overhead. The natural ceilings are the GKE control-plane and node quotas and the Datadog metrics query volume during many simultaneous analyses — both planned for as the service count grows.
Failure modes, and what each one looks like. Name them before they page you.
- Datadog metrics gap during a canary — if the query returns no data (an agent hiccup, a renamed metric), a naive analysis treats “no data” as “healthy” and promotes blind. Mitigation: set
failureLimit/inconclusiveLimitso missing data is inconclusive, not pass, and the rollout pauses for a human rather than promoting. - Argo CD / Git drift — someone hotfixes the cluster with
kubectl editand self-heal silently reverts it mid-incident, or auto-sync fights a manual change. Mitigation: self-heal on for prod, a documented break-glass that disables sync deliberately, and alerts on out-of-sync status. - A stuck canary — analysis is inconclusive and the rollout sits paused at 25% indefinitely. Mitigation:
progressDeadlineSecondsto bound the pause and a Datadog monitor that pages when a rollout exceeds its expected window. - Gatekeeper webhook down — if the admission webhook is unreachable with
failurePolicy: Fail, all deploys block; withIgnore, policy silently stops being enforced. Mitigation: run the webhook HA across zones, scopeFailto the namespaces that truly need it, and monitor webhook health as a first-class signal. - Signature verification false-negative — a legitimately built image fails admission because the signing identity or trust root drifted. Mitigation: verify signatures in CI (fail fast) and keep the Gatekeeper trust config in the same GitOps repo so it changes reviewably.
Reliability & DR (RTO/RPO). GitOps gives you a strong recovery story almost for free: because Git is the desired state, rebuilding a cluster is “stand up GKE, install Argo CD, point it at the config repo, let it reconcile.” Decide the numbers per tier — for this platform, RTO 30 minutes to reconstitute a cluster from Git and RPO near zero for desired state (it’s all in Git, replicated by the Git host). Stateful dependencies (databases, Vault) have their own backup/replication SLAs that dominate the real RPO; the deploy layer itself is reproducible from source. Run Argo CD HA, and keep the config repo and signing trust roots backed up off the primary Git host.
Observability. The pipeline emits a deployment marker to Datadog at the start of every rollout, so a latency or error-rate change on a dashboard is visually correlated with the exact release that caused it. Instrument the canary’s AnalysisRun results as first-class events, track rollout success rate, mean time to rollback, canary abort rate, and lead time from merge to 100% — the DORA-style metrics the platform team reports up. Argo CD’s own UI and audit log answer “what’s running and who synced it” for any point in time, and a ServiceNow change record is opened on every prod promotion and incident on every automated rollback, giving compliance the documented trail the mandate demanded.
Governance. Pin everything that can drift: image digests not tags in the config repo (an immutable reference, never a moving latest), Argo CD app revisions to specific Git SHAs for prod, and Gatekeeper policies in version control so a rule change is a reviewed PR. Promotion through environments is a Git PR with Okta-backed approval and a ServiceNow change gate on prod, so “who approved this” is answerable by design. Log every signature verification and policy decision for audit. The combination — Git as the record, signatures as provenance, admission as enforcement, automated analysis as the safety net — is what lets a payer say “yes” to shipping daily on a system that touches PHI.
Explicit tradeoffs
Accept these or don’t build it. Progressive delivery adds real moving parts: a service mesh or Gateway API for weighted traffic, an analysis configuration you must tune (too-tight thresholds abort good releases and erode trust; too-loose ones let regressions through), and the discipline of two repositories and signed artifacts. Canaries also make every deploy slower by design — a release that used to flip instantly now takes fifteen-plus minutes of staged rollout and analysis. That is the point for a regulated, member-facing API, and it is pure overhead for an internal batch job that no member ever sees. The GitOps model means there is no kubectl apply shortcut when you’re firefighting; you change Git and wait for reconcile, which is safer and occasionally maddening, so the break-glass path has to be documented and practiced before you need it.
The alternatives, and when they win. If your service is stateless, idempotent, and has no meaningful “half-deployed” risk, rolling updates are simpler and need none of this machinery. If you genuinely need an instant, atomic switch with trivial rollback and can pay for double capacity during cutover, blue-green beats canary — it’s the right call for a database schema migration coordinated with a deploy. If you’re a small team optimizing for speed over control, push-based CD straight from GitHub Actions stands up in an afternoon; graduate to GitOps and admission policy when audit, supply-chain, or blast-radius requirements demand it. And if you’re already deep in a single cloud, the cloud-native equivalents (Cloud Deploy, Binary Authorization) cover much of this — the Argo stack earns its place when you want the same delivery model across clusters and clouds without re-platforming per provider.
The shape of the win
For the claims-platform team, the payoff is not “fancier CI.” It is that a developer merges a PR on a Tuesday afternoon, the image builds keylessly, scans clean on Checkov and Wiz Code, gets signed, and lands in Git — and then nobody touches the cluster. Argo CD reconciles, Gatekeeper admits only the signed image, Argo Rollouts shifts 5% of member traffic to the new version, Datadog watches the eligibility-success rate, and when that metric dips the canary aborts and rolls back in three minutes with a ServiceNow ticket already filed — before the call center notices anything. The ninety-minute outage that started this project becomes a three-minute blip on a dashboard that auto-recovered. Everything upstream — the OIDC federation, the dual IaC scans, the Cosign signatures, the Gatekeeper admission, the Datadog gates — exists so that “ship daily on a system touching PHI” and “no human pushes to production” are the same sentence. Start narrower if you must, but for a regulated, member-facing service at velocity, this is where progressive delivery has to land.