Architecture Multi-cloud

GitHub Actions to Argo CD: Progressive Delivery with Policy Gates

A health-insurance company’s claims-platform team ships once a sprint, on a Thursday night, with the whole squad on a bridge call and a runbook open. It works until it doesn’t: last quarter a release to the member-portal API changed a benefits-eligibility response shape, the on-call missed it during the manual smoke test, and for ninety minutes members were told they had no coverage. In a regulated payer that is not an inconvenience — it is a reportable incident, a spike in call-center volume, and a compliance officer asking why a change touching protected health information went out with a person clicking through a checklist. The mandate that came down was specific: no human pushes to production, every change is auditable, and a bad release rolls itself back before most members ever see it. This article is the reference architecture for that mandate — a progressive-delivery pipeline that builds and signs artifacts in GitHub Actions, deploys them through Argo CD GitOps onto GKE, rolls them out as Argo Rollouts canaries, and refuses to promote anything that fails a policy gate or a live health signal.

The pressures are the ones every platform team eventually hits, just sharper in a payer. Safety: a regression in eligibility or claims logic has real-world consequences, so a release has to prove itself on a sliver of traffic before it owns all of it. Auditability: a HIPAA-regulated shop needs to answer “what was running at 14:32, who approved it, and what scanned clean” without spelunking through shell history. Velocity: the business still wants to ship daily, which is impossible if every deploy needs a bridge call. Blast radius: when something does break, it should degrade a percentage of traffic for two minutes, not the whole member base for ninety. Progressive delivery — ship to a few, watch real signals, promote or abort automatically — is the pattern that satisfies all four, and GitOps is what makes it auditable by construction.

Why not the obvious shortcuts

Three cheaper approaches will get proposed in the first planning meeting, and each fails in a way worth naming.

“Just kubectl apply from the CI job.” Push-based CD hands your CI runners cluster-admin-grade credentials, makes the pipeline the source of truth for what’s deployed (so cluster drift is invisible), and leaves you with no record of desired state other than a job log that rotates out in 30 days. The first time someone hotfixes the cluster by hand, your CI and reality silently diverge.

“Blue-green the whole service.” Standing up a full parallel copy and flipping a load balancer is a real strategy, but it’s all-or-nothing: the new version takes 100% of traffic the instant you cut over, so a subtle eligibility bug hits every member at once — exactly the ninety-minute outage we’re trying to kill. It also doubles capacity cost during the cutover window.

“Add more manual gates.” More checklists and more bridge calls slow velocity without improving safety, because the failure mode was a human missing a regression, and humans miss regressions. The fix is not more humans; it’s machine-evaluated gates on real signals.

Progressive delivery with GitOps threads the needle. Git is the single source of truth and the audit log. Argo CD continuously reconciles the cluster to Git, so drift self-heals and “what’s running” is always answerable. Argo Rollouts shifts traffic in small increments and consults health analysis between steps, so a bad version is caught at 5% and rolled back automatically. And policy gates — in CI and at the cluster admission boundary — stop a non-compliant artifact from ever reaching the canary in the first place.

Architecture overview

GitHub Actions to Argo CD: Progressive Delivery with Policy Gates — architecture

The platform has two cleanly separated halves that meet at exactly one place — the Git repository — and never share credentials. The CI half lives in GitHub Actions: it builds, scans, signs, and then writes a desired-state change to Git. The CD half lives in the cluster: Argo CD watches Git and pulls changes in; nothing pushes to the cluster from outside. This separation is the whole game. CI never holds cluster credentials, and the cluster never reaches back into CI. The boundary between “build the thing” and “run the thing” is a Git commit, which is also the audit record.

The defining property the compliance team cares about is provenance you can prove: every image is built by a known workflow with no long-lived cloud keys, scanned by two independent tools, cryptographically signed, and admitted to the cluster only if its signature and policy checks pass. A push to production is no longer an action a person takes — it is a state that Git describes and the cluster converges to.

CI path, following the control flow:

  1. A developer merges a PR to main. A GitHub Actions workflow starts and authenticates to Google Cloud via Workload Identity Federation (OIDC) — it exchanges its short-lived OIDC token for a GCP access token, so there is no stored service-account JSON key to leak. The same OIDC trust lets it pull from and push to Artifact Registry.
  2. The workflow scans infrastructure-as-code before it builds anything: Checkov lints the Terraform and Kubernetes manifests for misconfigurations (public buckets, privileged pods, missing encryption), and Wiz Code runs in the pipeline to catch IaC and dependency risk with the context of how the resource is actually exposed in the running cloud — a finding Wiz flags as “internet-reachable + critical CVE” is a hard stop, where a buried lab finding is not.
  3. The application image builds, and the workflow signs it with Cosign (keyless, using the same OIDC identity, recorded in a transparency log) and generates an SBOM and SLSA provenance attestation. Signature and attestation are pushed alongside the image in Artifact Registry.
  4. The workflow then makes its only change to the deploy surface: it bumps the image digest in the GitOps config repository — a separate repo holding Kustomize/Helm manifests — via a commit (often a PR for prod, auto-merged for lower environments). CI’s job ends here. It has touched Git, not the cluster.

CD path, pull-based and continuous:

  1. Argo CD, running inside GKE, detects the new commit in the config repo and reconciles. For the member-portal API it renders an Argo Rollouts Rollout resource (not a vanilla Deployment) pointing at the new image digest.
  2. Before any pod schedules, the OPA Gatekeeper admission webhook evaluates the manifests against cluster policy — image must come from the approved Artifact Registry, must carry a valid Cosign signature, must set resource limits, must not run privileged. A manifest that violates policy is rejected at admission, so a bad change fails loudly in Argo CD’s sync status instead of quietly running.
  3. Argo Rollouts begins the canary: it routes a small slice of traffic to the new version and pauses. During each pause it runs an AnalysisRun that queries Datadog — error rate, p95 latency, and a custom eligibility-success metric — against defined thresholds.
  4. If the analysis stays green across the canary steps, Rollouts promotes the new version to 100% and the old ReplicaSet scales down. If any step breaches a threshold, Rollouts aborts and rolls back automatically to the last good version, and a ServiceNow incident is opened from the Datadog monitor so on-call has a ticket, not just a page.

Component breakdown

Component Service / tool Role in the platform Key configuration choices
CI / build GitHub Actions Build, scan, sign, then commit desired state to Git OIDC to GCP (no JSON keys); environment protection on prod repo
Cloud auth Workload Identity Federation Exchange GitHub OIDC token for short-lived GCP creds Provider scoped to repo + branch; no exportable keys
IaC scanning Checkov Static policy scan of Terraform + K8s manifests Fail build on HIGH; custom checks for payer controls
Code + cloud risk Wiz Code IaC/dependency scan with runtime exposure context Block on internet-reachable critical; PR annotations
Signing / provenance Cosign + SLSA attestation Keyless image signing, SBOM, build provenance Keyless OIDC; signature + attestation in Artifact Registry
Image registry Artifact Registry Stores images, signatures, SBOMs Immutable tags; vulnerability scanning on push
GitOps controller Argo CD Reconcile cluster to the config repo; the audit surface App-of-apps; auto-sync + self-heal; SSO via Okta/Entra
Progressive delivery Argo Rollouts Canary traffic shifting + automated analysis/rollback Canary steps with pauses; Datadog AnalysisTemplate
Admission policy OPA Gatekeeper Cluster-side gate: signed, sourced, constrained workloads only ConstraintTemplates; enforce in prod, dryrun to roll out
Service mesh / traffic GKE + Istio (or Gateway API) Weighted traffic split for the canary Subset routing by Rollouts; mTLS between services
Deployment monitoring Datadog Health signals that gate promotion; deployment markers Monitors as AnalysisTemplate metrics; deployment tracking
Identity / SSO Okta + Entra ID SSO into Argo CD and GitHub; RBAC by group OIDC; group claims map to Argo CD roles
Secrets HashiCorp Vault App secrets to pods; pipeline secrets to CI Vault Agent / Secrets Operator; short-lived leases
ITSM ServiceNow Incident + change record on rollback or prod promote Auto-incident from Datadog; change gate on prod PR
Runtime security CrowdStrike Falcon Runtime threat detection on GKE nodes + workloads Sensor via DaemonSet; detections to the SOC

A few of these choices carry the why, because they’re where teams go wrong.

Why two IaC scanners, not one. Checkov and Wiz Code overlap but answer different questions. Checkov is a fast, free, deterministic policy linter — perfect as a cheap pre-build gate and easy to extend with payer-specific checks (e.g., “every storage class touching claims data must be encrypted with a CMEK key”). Wiz Code adds the context Checkov can’t have: it correlates an IaC finding with the live cloud, so it can tell you a misconfiguration is actually internet-reachable and tied to a critical CVE versus a theoretical lab risk. Running both means cheap deterministic gates plus prioritization by real exposure, and you only hard-fail the build on findings that are genuinely high-severity in context — otherwise you train developers to ignore the scanner.

Why keyless signing and admission verification together. Signing an image proves who built it; it does nothing if the cluster will run any image. The value comes from pairing Cosign signatures with an OPA Gatekeeper (or Kyverno/Binary Authorization) admission check that requires a valid signature from your build identity before a pod schedules. That closes the loop: an attacker who pushes a malicious image to the registry can’t get it admitted, because it isn’t signed by the trusted CI OIDC identity. Provenance is only worth the bytes if something enforces it at the door.

Implementation guidance

Wire OIDC first, and prove no static keys remain. The single biggest security win here is that GitHub Actions never holds a downloadable GCP key — the leaked-credentials lesson the platform team intends never to repeat. Configure a Workload Identity Pool scoped to your repo and branch, and the auth step needs no secret at all:

# .github/workflows/build.yaml
permissions:
  contents: read
  id-token: write          # required for OIDC
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: projects/4711/locations/global/workloadIdentityPools/gh/providers/repo
          service_account: ci-build@claims-prod.iam.gserviceaccount.com
      - name: IaC scan (Checkov)
        run: checkov -d ./infra --hard-fail-on HIGH
      - name: Build, sign, attest
        run: |
          docker build -t "$IMG" .
          docker push "$IMG"
          cosign sign --yes "$IMG"            # keyless, uses the OIDC identity
          cosign attest --yes --predicate sbom.json --type spdx "$IMG"

The provider is constrained so only this repository on a protected branch can assume the build service account; a fork or a feature branch gets nothing.

Separate the app repo from the config repo. Keep application source in one repository and the rendered deploy manifests (Kustomize bases/overlays or a Helm chart) in another. CI’s last step writes the new image digest into the config repo. This separation gives you environment-scoped review (a human approves the prod overlay change while dev/stage auto-merge), a clean per-environment audit trail, and an Argo CD that watches exactly one source of truth per environment.

Model the rollout, not a deployment. The member-portal API is an Argo Rollouts Rollout with explicit canary steps and a Datadog analysis between them. The steps below mean: take 5% of traffic, hold while analysis runs, then 25%, then 50%, then full — aborting the moment analysis fails.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: member-portal-api }
spec:
  strategy:
    canary:
      canaryService: member-portal-canary
      stableService: member-portal-stable
      trafficRouting: { istio: { virtualService: { name: member-portal-vs } } }
      steps:
        - setWeight: 5
        - pause: { duration: 3m }
        - analysis:                      # query Datadog; abort on breach
            templates: [{ templateName: datadog-health }]
        - setWeight: 25
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 5m }

The AnalysisTemplate is where the safety lives — it queries Datadog for the metrics that actually matter to members, not just CPU:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: datadog-health }
spec:
  metrics:
    - name: error-rate
      interval: 1m
      failureLimit: 1                    # one breach aborts + rolls back
      provider:
        datadog:
          query: "sum:portal.http.5xx{service:member-portal,version:canary}.as_rate()"
      failureCondition: "result > 0.01"  # >1% 5xx fails the canary
    - name: eligibility-success
      provider:
        datadog:
          query: "sum:portal.eligibility.ok{version:canary} / sum:portal.eligibility.total{version:canary}"
      failureCondition: "result < 0.995" # the metric that caught nothing last quarter

That second metric — eligibility-success rate — is the one whose absence caused the original incident. Encoding it as a hard gate means the exact regression that took ninety minutes to notice now aborts the canary in three.

Enforce policy at admission, and roll it out in dry-run. OPA Gatekeeper ConstraintTemplates define the rules; start every new constraint in dryrun so you see what would be blocked without breaking deploys, then flip to enforce once the violations are clean. A representative set for this platform: images only from *-docker.pkg.dev/claims-prod/*, a valid Cosign signature present, CPU/memory limits set, no privileged: true, no latest tag. The signed-image constraint is what makes the whole supply-chain story enforceable rather than aspirational.

Enterprise considerations

Security & supply chain. The design is defense-in-depth across the lifecycle: no static cloud keys (OIDC federation), two-tool IaC scanning (Checkov for deterministic gates, Wiz Code for exposure-aware prioritization), keyless signing with SLSA provenance, and OPA Gatekeeper verifying signatures and constraints at admission so only known-good artifacts run. At runtime, CrowdStrike Falcon sensors run as a DaemonSet on the GKE node pools for threat detection on the workloads themselves, feeding the payer’s SOC. HashiCorp Vault holds both application secrets (delivered to pods via the Vault Secrets Operator with short leases, never as static Kubernetes Secrets) and any residual pipeline secrets. Access to Argo CD and GitHub federates through Okta as the workforce IdP, brokered to Entra ID where Azure-side RBAC is needed, so an engineer’s group membership maps directly to what they can sync, override, or approve — and a single source of identity means offboarding actually removes access everywhere.

Cost optimization. Progressive delivery is cheaper than blue-green precisely because it never runs a full second copy of the service. The levers that matter on GKE:

Lever Mechanism Typical effect
Canary vs blue-green Run a small extra ReplicaSet, not a full parallel stack Avoids ~2× capacity during cutover
Spot/Preemptible pools Run stateless canary + batch on Spot node pools 60–80% cheaper on that capacity
Right-sized requests Enforce limits via Gatekeeper; tune from Datadog usage Stops over-provisioned requests wasting nodes
Fail fast Abort bad canaries in minutes, not after full rollout Cuts wasted compute + incident cost
Cluster autoscaler Scale node pools to canary + stable demand Pay for traffic, not for peak headroom

Datadog’s usage metrics feed the right-sizing, and because rollbacks are automatic and fast, a bad release burns a few minutes of canary capacity instead of a full deploy plus an emergency redeploy.

Scalability. Each half scales on its own axis. GitHub Actions parallelizes across repos and runners, so build throughput grows with concurrency, not a shared bottleneck. Argo CD scales to hundreds of applications with the app-of-apps pattern and sharded application controllers; for many clusters, point one Argo CD at all of them or run ApplicaitonSets to template environments. The canary mechanism is per-service, so adding services adds independent rollouts, not coordination overhead. The natural ceilings are the GKE control-plane and node quotas and the Datadog metrics query volume during many simultaneous analyses — both planned for as the service count grows.

Failure modes, and what each one looks like. Name them before they page you.

Reliability & DR (RTO/RPO). GitOps gives you a strong recovery story almost for free: because Git is the desired state, rebuilding a cluster is “stand up GKE, install Argo CD, point it at the config repo, let it reconcile.” Decide the numbers per tier — for this platform, RTO 30 minutes to reconstitute a cluster from Git and RPO near zero for desired state (it’s all in Git, replicated by the Git host). Stateful dependencies (databases, Vault) have their own backup/replication SLAs that dominate the real RPO; the deploy layer itself is reproducible from source. Run Argo CD HA, and keep the config repo and signing trust roots backed up off the primary Git host.

Observability. The pipeline emits a deployment marker to Datadog at the start of every rollout, so a latency or error-rate change on a dashboard is visually correlated with the exact release that caused it. Instrument the canary’s AnalysisRun results as first-class events, track rollout success rate, mean time to rollback, canary abort rate, and lead time from merge to 100% — the DORA-style metrics the platform team reports up. Argo CD’s own UI and audit log answer “what’s running and who synced it” for any point in time, and a ServiceNow change record is opened on every prod promotion and incident on every automated rollback, giving compliance the documented trail the mandate demanded.

Governance. Pin everything that can drift: image digests not tags in the config repo (an immutable reference, never a moving latest), Argo CD app revisions to specific Git SHAs for prod, and Gatekeeper policies in version control so a rule change is a reviewed PR. Promotion through environments is a Git PR with Okta-backed approval and a ServiceNow change gate on prod, so “who approved this” is answerable by design. Log every signature verification and policy decision for audit. The combination — Git as the record, signatures as provenance, admission as enforcement, automated analysis as the safety net — is what lets a payer say “yes” to shipping daily on a system that touches PHI.

Explicit tradeoffs

Accept these or don’t build it. Progressive delivery adds real moving parts: a service mesh or Gateway API for weighted traffic, an analysis configuration you must tune (too-tight thresholds abort good releases and erode trust; too-loose ones let regressions through), and the discipline of two repositories and signed artifacts. Canaries also make every deploy slower by design — a release that used to flip instantly now takes fifteen-plus minutes of staged rollout and analysis. That is the point for a regulated, member-facing API, and it is pure overhead for an internal batch job that no member ever sees. The GitOps model means there is no kubectl apply shortcut when you’re firefighting; you change Git and wait for reconcile, which is safer and occasionally maddening, so the break-glass path has to be documented and practiced before you need it.

The alternatives, and when they win. If your service is stateless, idempotent, and has no meaningful “half-deployed” risk, rolling updates are simpler and need none of this machinery. If you genuinely need an instant, atomic switch with trivial rollback and can pay for double capacity during cutover, blue-green beats canary — it’s the right call for a database schema migration coordinated with a deploy. If you’re a small team optimizing for speed over control, push-based CD straight from GitHub Actions stands up in an afternoon; graduate to GitOps and admission policy when audit, supply-chain, or blast-radius requirements demand it. And if you’re already deep in a single cloud, the cloud-native equivalents (Cloud Deploy, Binary Authorization) cover much of this — the Argo stack earns its place when you want the same delivery model across clusters and clouds without re-platforming per provider.

The shape of the win

For the claims-platform team, the payoff is not “fancier CI.” It is that a developer merges a PR on a Tuesday afternoon, the image builds keylessly, scans clean on Checkov and Wiz Code, gets signed, and lands in Git — and then nobody touches the cluster. Argo CD reconciles, Gatekeeper admits only the signed image, Argo Rollouts shifts 5% of member traffic to the new version, Datadog watches the eligibility-success rate, and when that metric dips the canary aborts and rolls back in three minutes with a ServiceNow ticket already filed — before the call center notices anything. The ninety-minute outage that started this project becomes a three-minute blip on a dashboard that auto-recovered. Everything upstream — the OIDC federation, the dual IaC scans, the Cosign signatures, the Gatekeeper admission, the Datadog gates — exists so that “ship daily on a system touching PHI” and “no human pushes to production” are the same sentence. Start narrower if you must, but for a regulated, member-facing service at velocity, this is where progressive delivery has to land.

GitHub ActionsArgo CDArgo RolloutsGKEOPA GatekeeperProgressive Delivery
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading