A mid-sized health-insurance payer is two weeks out from its annual HITRUST and SOC 2 Type II audit, and the auditor has asked a question the platform team cannot currently answer cleanly: show me, for any production deployment in the last quarter, who approved it, what exactly was deployed, and proof the running cluster matches what was approved. Today the answer is a mess of Jenkins console logs, a kubectl apply someone ran from a laptop during an incident, and a ServiceNow change ticket that says “deploy v3.4” with no link to an artifact. The deployment that fixed a claims-processing outage at 2 a.m. has no change record at all. The CISO knows this is the finding that turns a clean audit into a qualified one — and in a regulated payer, a qualified SOC 2 is something the sales team has to explain to every prospect for a year.
The mandate that comes down is precise: every change to production must be reviewed, approved, recorded, reproducible, and continuously verified against drift — without slowing the forty-engineer delivery org to a crawl. This article is the reference architecture for that mandate: a Jenkins-to-Argo CD GitOps pipeline where Jenkins owns build, test, and signing; Git is the single source of truth for what production should look like; Argo CD continuously reconciles the cluster to Git; ServiceNow holds the auditable change record; and HashiCorp Vault means no long-lived secret ever sits in a pipeline. It is the design that lets the platform team answer the auditor’s question with a single Git history and a single Argo CD screen.
Why GitOps, and why the split between Jenkins and Argo CD
The instinct in a regulated shop is to bolt approval gates onto the existing push-based Jenkins pipeline — Jenkins builds, then Jenkins runs kubectl apply after a manual approval step. That fails the audit for a structural reason: a push pipeline gives Jenkins standing credentials to production, the approval is a moment in time that nothing re-verifies, and the moment someone runs a manual kubectl out-of-band, the cluster silently diverges from any record. You cannot prove the running state matches the approved state because nothing is continuously checking.
GitOps inverts the model. The desired state of production lives in Git as declarative manifests. An in-cluster agent — Argo CD — pulls that state and reconciles the cluster to match it, continuously. Three properties fall out of this, and each maps directly to a control an auditor cares about:
- Git is the audit log. Every change to production is a commit: authored, reviewed via pull request, timestamped, signed, immutable. “Who approved this and what changed” is
git log. - Drift is detected and corrected. If someone hand-edits a Deployment at 2 a.m., Argo CD flags the cluster as
OutOfSyncand can auto-revert it. The 2 a.m. fix now has to go through Git, or it gets undone — which is exactly the control the payer was missing. - No CI system holds production credentials. Argo CD pulls from inside the cluster; Jenkins never touches the EKS API. The blast radius of a compromised Jenkins shrinks from “owns production” to “can write a manifest PR that still needs approval.”
The division of labor is the heart of the design. Jenkins is the continuous-integration plane: it builds the image, runs tests, scans, signs the artifact, and proposes a change by writing to Git. Argo CD is the continuous-delivery plane: it takes approved Git state and makes the cluster real. CI proposes; CD disposes. Keeping these planes separate — separate credentials, separate identities, separate audit trails — is what produces clean separation of duties.
Architecture overview
The platform is built around two Git repositories with deliberately different governance, because conflating them is the most common GitOps mistake and the one that quietly breaks separation of duties.
The application repository holds source code and the Jenkinsfile. Developers own it; they push features and open PRs here all day. The config repository (the GitOps repo) holds the Kubernetes manifests that describe what runs in each environment — and it is governed like production, because it is production’s source of truth. A change to prod/ in the config repo requires a reviewed, approved PR; that PR is the change approval the auditor wants.
The build-and-propose flow (Jenkins, the CI plane):
- A developer merges to
mainin the application repo. A webhook triggers the Jenkins pipeline, which runs on ephemeral Kubernetes agents (the Jenkins Kubernetes plugin spins a fresh pod per build, so no build state lingers and no agent accumulates secrets). - The pipeline — defined almost entirely in a Jenkins shared library so all forty services get identical, centrally-governed behavior rather than forty hand-copied
Jenkinsfiles — runs unit and integration tests, builds the container image, and pushes it to Amazon ECR tagged with the immutable Git SHA, neverlatest. - The pipeline pulls every credential it needs — the ECR push token, the signing key reference, the config-repo deploy key — from HashiCorp Vault at runtime via the Vault Jenkins integration, authenticated with the Kubernetes auth method tied to the build pod’s service account. Nothing sensitive is stored in Jenkins itself.
- Wiz Code scans the Terraform/Helm IaC and the application dependencies on the pull request, and the image is scanned for CVEs; a critical finding fails the build before anything is signed.
- The pipeline signs the image with Sigstore Cosign (keyless, using an OIDC identity, or a Vault-held key), producing a cryptographic attestation that this artifact came from this pipeline. An SBOM is generated and attached.
- The pipeline opens a pull request against the config repository that bumps the image tag for the target environment — typically by patching a Kustomize overlay or Helm values file. This PR is the proposed change. Jenkins’ job ends here: it has proposed, not deployed.
The approve-and-reconcile flow (Argo CD, the CD plane):
- The config-repo PR triggers a ServiceNow change record automatically (via the ServiceNow DevOps integration or a webhook), linking the change to the artifact, the SBOM, the signer, and the Wiz scan result. For lower environments, a normal change auto-approves on green checks; for production, a standard or normal change requires CAB or delegated approval, which gates the PR merge.
- On approval and merge to the config repo, Argo CD — watching that repo — detects the new desired state. Before applying, an Argo CD admission/sync hook verifies the Cosign signature (via a Kyverno or admission policy), so an unsigned or tampered image never reaches the cluster even if the manifest somehow references it.
- Argo CD reconciles the EKS cluster to match Git: it applies the manifests, waits for the rollout to become healthy, and reports
Synced / Healthy. If reconciliation fails, Argo CD surfaces it and (optionally) auto-rolls-back to the last-known-good Git revision. - From this point on, Argo CD continuously compares cluster state to Git. Any manual drift is flagged
OutOfSyncand, where self-heal is enabled, reverted — and the drift event is pushed to Datadog and raised as a ServiceNow incident, because in a regulated environment an unexplained production change is a security event, not a curiosity.
The two planes never share credentials. Jenkins can write a PR to the config repo; only Argo CD can talk to the cluster; only an approved human can merge to prod/. That triangle is separation of duties expressed in infrastructure.
Component breakdown
| Component | Service / tool | Role in the pipeline | Key configuration choices |
|---|---|---|---|
| CI orchestration | Jenkins | Build, test, scan, sign, propose | Kubernetes plugin (ephemeral agents); shared library for all pipelines |
| Pipeline logic | Jenkins shared library | One governed pipeline definition for every service | Versioned, code-reviewed; pinned @Library('platform@v2') |
| Image registry | Amazon ECR | Immutable artifact store | SHA-tagged images; scan-on-push; lifecycle policy; cross-account pull role |
| Secrets | HashiCorp Vault | Runtime secret broker for the pipeline | Kubernetes auth method; short TTL leases; no static creds in Jenkins |
| IaC / dep scanning | Wiz Code | Shift-left scan of IaC + dependencies on PR | Blocking gate on critical findings; PR annotations |
| Image signing | Sigstore Cosign | Cryptographic provenance + SBOM attestation | Keyless OIDC or Vault-held key; Rekor transparency log |
| GitOps engine | Argo CD | Pull-based reconciliation of cluster to Git | App-of-apps; self-heal + auto-prune in prod; SSO via Okta |
| Runtime cluster | Amazon EKS | Where workloads actually run | IRSA for pod identity; private API endpoint; managed node groups |
| Admission policy | Kyverno | Verify signatures + policy at admission | verifyImages for Cosign; deny unsigned/unscanned images |
| Change management | ServiceNow | Auditable change record + approval gate | DevOps integration; CAB gate on prod; auto-incident on drift |
| Identity / SSO | Okta | SSO + RBAC for Jenkins, Argo CD, ServiceNow | SAML/OIDC; group-mapped roles; MFA; deprovisioning on offboard |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on EKS workloads | Sensor as DaemonSet; detections piped to SOC/SIEM |
| Observability | Datadog | Pipeline + deployment + drift telemetry | Deployment markers; DORA metrics; drift + sync-failure monitors |
| Configuration | Kustomize / Helm | Environment-specific manifest rendering | Base + per-env overlays; image tag is the only per-deploy diff |
A few choices are load-bearing and worth the why, because teams routinely get them wrong.
Why a Jenkins shared library, not forty Jenkinsfiles. In a forty-service org, copy-pasted pipelines drift instantly — one team skips the scan, another signs with a stale key, a third logs a secret. A shared library centralizes the entire build-sign-propose flow into versioned Groovy that every repo references with one line. Want to add an SBOM step or tighten the Wiz gate across the whole estate? Change the library, bump the version, done. It also means the pipeline itself is code-reviewed and audited, which is its own control: an auditor can read one library instead of forty files.
// Jenkinsfile in every application repo — thin by design
@Library('platform-pipeline@v2.4') _
regulatedDelivery(
service: 'claims-adjudication',
env: 'prod',
configRepo: 'git@github.com:payer/gitops-config.git',
sign: true,
wizPolicy: 'block-critical'
)
Why sign images and verify at admission, not just at build. Signing in CI proves provenance, but proof is worthless if nothing checks it at the point of deployment. The pair is the control: Cosign signs in Jenkins, and Kyverno’s verifyImages policy rejects any image at the EKS admission webhook whose signature does not validate against the expected identity. An attacker who slips a malicious image into ECR cannot get it scheduled, because the cluster refuses to run anything this pipeline did not sign. Provenance becomes enforceable, not aspirational.
Why ServiceNow is the change record and not Jenkins’ own approval step. Jenkins can pause for a manual approval, but that approval lives in a console log no auditor trusts and no CAB process governs. Routing the change through ServiceNow ties the deployment to the organization’s real change-management workflow — CAB review for production, an approver who is not the author (separation of duties), and a permanent record linked to the artifact, SBOM, and scan. The 2 a.m. emergency fix becomes an emergency change in ServiceNow with retroactive review, instead of an untracked kubectl apply. The change ticket and the Git PR reference each other, so the audit trail is closed on both ends.
Implementation guidance
Stand up the GitOps repo and Argo CD with the app-of-apps pattern. Argo CD’s ApplicationSet / app-of-apps lets one root application manage every service’s deployment declaratively, so onboarding a new service is a Git commit, not a console click. Enable selfHeal and prune for production so drift is actively corrected and orphaned resources are removed:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: claims-adjudication-prod
namespace: argocd
spec:
project: payer-prod
source:
repoURL: git@github.com:payer/gitops-config.git
targetRevision: main
path: apps/claims-adjudication/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: claims
syncPolicy:
automated:
selfHeal: true # revert manual drift back to Git
prune: true # delete resources removed from Git
syncOptions:
- CreateNamespace=false
Wire Vault into Jenkins with the Kubernetes auth method, and keep TTLs short. The build pod’s Kubernetes service account is the Vault identity; the pipeline exchanges its projected SA token for a short-lived Vault token, reads exactly the secrets it needs, and the lease expires when the build pod dies. No secret is ever written to a Jenkins credential store, an environment file, or a job config. This is the control that turns “a compromised Jenkins leaks every production secret” into “a compromised build pod has a few minutes of narrowly-scoped access.” For the config-repo write, prefer a deploy key scoped to a single repo (or a short-lived GitHub App token), never a personal access token with org-wide reach.
Separate the two repos’ branch protection deliberately. The application repo can be relatively permissive — developers iterate fast. The config repo’s prod path must require: a reviewed PR, an approver who is not the author, a passing status check that the ServiceNow change is approved, and signed commits. This is where separation of duties is enforced in practice. A developer can ship code freely; they cannot put it into production without an independent approval recorded in both Git and ServiceNow.
Keep the per-deploy diff to one line. The only thing Jenkins changes in the config repo on a routine deploy is the image tag (a Kustomize images: patch or a Helm image.tag value). Everything else about the environment — replicas, resources, network policy — changes through deliberate, separately-reviewed PRs. This makes the routine path trivially auditable: a deploy PR that touches more than the image tag is a red flag a reviewer can spot in seconds.
Enterprise considerations
Security and separation of duties. The architecture is least-privilege by construction. Jenkins never holds an EKS credential — it can build, sign, and open a PR, nothing more. Argo CD is the only system with cluster write access, and it only applies what is in approved Git. Okta provides SSO and group-mapped RBAC across Jenkins, Argo CD, and ServiceNow, with MFA and automatic deprovisioning when an engineer offboards — so access reviews are a single source, not three. Layer on Wiz Code as the shift-left gate that blocks risky IaC and vulnerable dependencies before merge, and Wiz CSPM watching the running EKS estate for posture drift and exposure. CrowdStrike Falcon runs as a DaemonSet on the nodes for runtime threat detection, streaming to the SOC. Sigstore Cosign + Kyverno close the supply-chain loop: only signed, scanned artifacts run. The threat model worth naming explicitly is the insider or compromised-CI push — and the answer is that even a fully-owned Jenkins cannot deploy to production, because the merge to prod/ needs an independent human and the cluster rejects unsigned images.
Cost optimization. A GitOps pipeline’s cost is mostly compute for builds and the control planes; the levers are straightforward.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Ephemeral build agents | Jenkins Kubernetes plugin; pods exist only during a build | No idle agent fleet; pay per build-second |
| Spot for CI | Run build agents on EKS Spot node groups | ~60–70% cheaper build compute |
| ECR lifecycle policies | Expire untagged + old images automatically | Bounds registry storage growth |
| Right-sized Argo CD | One Argo CD per cluster, app-of-apps, not per-team installs | Avoids control-plane sprawl |
| Cache layers | Build-cache + dependency cache in the shared library | Shorter builds = less compute + faster lead time |
Spot interruptions on CI are low-stakes — a build simply retries — which makes CI an ideal Spot workload and an easy FinOps win to point at.
Scalability. Each plane scales independently. Jenkins scales horizontally on ephemeral Kubernetes agents, so build concurrency is bounded by cluster capacity, not a fixed agent pool. Argo CD scales with ApplicationSets to manage hundreds of applications, and shards its controller across replicas for very large estates or multiple clusters (a hub Argo CD managing many spoke EKS clusters via cluster registrations). The shared library means adding the forty-first service costs one thin Jenkinsfile and one config-repo directory — onboarding is O(1), which is the whole point of investing in the platform.
Failure modes, and what each one looks like. Name them before they page you.
- Config-repo PR merged but Argo CD won’t sync — usually an invalid manifest or a failing health check. Argo CD shows
OutOfSync/Degradedand (with auto-rollback) reverts to last-good Git. Mitigation: render-and-validate manifests in the Jenkins PR check so a broken manifest never merges. - Image signed but Kyverno rejects it — the signing identity doesn’t match the verify policy (common after a key or OIDC issuer rotation). Pods stay
Pendingwith an admission error. Mitigation: test the verify policy in a lower environment on every signing-config change; alert on admission denials in Datadog. - Vault unavailable mid-build — the pipeline can’t fetch the ECR or deploy-key secret and fails closed. This is correct behavior (no secret, no build) but it makes Vault a build-time dependency. Mitigation: run Vault HA (Raft) with auto-unseal; cache nothing, retry with backoff.
- Drift storm — a controller or operator outside Argo CD keeps re-writing a field, and self-heal fights it forever. Mitigation:
ignoreDifferencesfor fields legitimately owned by another controller; alert if an app flaps sync state. - ServiceNow integration down — change records don’t open, blocking the prod merge gate. Mitigation: a documented break-glass emergency change path with mandatory retroactive review, so a Sev-1 fix is never stuck behind a ticketing outage.
Reliability and rollback. Rollback in GitOps is a git revert — re-point the config repo at the previous known-good image tag, and Argo CD reconciles back. Because Git is immutable history, every prior production state is recoverable by commit, which makes “roll back to exactly what ran last Tuesday” a one-line operation with a full audit trail. For progressive safety on critical services, pair Argo CD with Argo Rollouts for canary or blue-green deploys so a bad version is caught at small blast radius before full rollout. Argo CD’s reconciliation is itself the reliability mechanism: even if the cluster is partially clobbered, it converges back to the declared state.
Observability and DORA. Instrument the full lead time in Datadog: pipeline duration, queue time, scan results, and — critically — deployment markers on the Argo CD sync so a latency or error regression in production can be correlated to the exact deploy that caused it. Emit the four DORA metrics the engineering leadership reports on — deployment frequency, lead time for changes, change-failure rate, and time-to-restore — straight from Git and Argo CD events, since GitOps makes all four directly measurable from commit and sync history. Monitor sync failures and drift events as first-class alerts: in a regulated payer, an unexplained OutOfSync is a potential security incident and auto-raises a ServiceNow ticket, not just a Slack message.
Explicit tradeoffs
Accept these or do not build it. GitOps adds a second repository and a mental model shift — engineers must internalize that you change production by changing Git, never by touching the cluster, and that hand-running kubectl will be reverted by the platform. That is a feature for compliance and an adjustment for muscle memory; expect a few weeks of “why did my manual change disappear” before the team trusts it. The Jenkins-and-Argo-CD split means two systems to operate instead of one all-in-one pipeline — more moving parts, two upgrade cadences, two RBAC surfaces — justified only when separation of duties and continuous drift-correction actually matter to you. They emphatically do for a regulated payer; they may not for a three-person startup. The signing-and-admission machinery (Cosign, Kyverno, key rotation) is real overhead you can skip for a hobby cluster and cannot skip when an auditor will ask whether unsigned images can run. And routing every production change through ServiceNow CAB adds latency to the deploy — deliberately, because the whole point is that production changes are reviewed and recorded.
The alternatives, and when they win. If you do not need a separate CI plane and Argo CD’s ecosystem, Flux is a lighter pull-based GitOps engine that some teams prefer for its Kubernetes-native, controller-only footprint. If your build tooling already lives in GitHub, GitHub Actions feeding Argo CD removes Jenkins entirely and is simpler for GitHub-centric shops — Jenkins earns its place here because the payer has years of shared-library investment and pipelines for non-containerized legacy systems too. If you genuinely have one small cluster and one team, a single push pipeline with a manual gate is less machinery — but it cannot answer the auditor’s drift question, which is the line that decides it for a regulated org. And if you want the change-management rigor without GitOps, Spinnaker offers managed-delivery pipelines with built-in approval stages, at the cost of a heavier platform to run.
The shape of the win
For the payer’s platform team, the payoff is not “faster deploys” — though lead time does drop once the routine path is one-line PRs. The payoff is the audit conversation. When the auditor asks who approved this production change, what exactly shipped, and prove the cluster matches it, the answer is now a single Git history (authored, reviewed, signed), a single ServiceNow change record (CAB-approved, linked to a signed artifact and an SBOM), and a single Argo CD screen showing Synced / Healthy with drift actively corrected. The 2 a.m. emergency fix is an emergency change with retroactive review, not a ghost. Everything upstream — the Jenkins shared library, the Vault-injected secrets, the Cosign signatures, the Wiz Code gate, the Okta-governed access — exists to make that one conversation end with a clean opinion instead of a qualified one. In a regulated payer, a clean SOC 2 and HITRUST report is not a compliance checkbox; it is something the sales team puts in front of every prospect, and that is what funds the platform.