A 200-engineer platform team is bleeding money on a row of permanently-on Jenkins agent VMs that sit at 8% utilization overnight and still buckle at 9am when every squad pushes at once. Worse, the agents have drifted: one has JDK 17, another JDK 21, a third has a stale Trivy binary, and “works on the build server” has stopped meaning anything. The fix is to stop treating build capacity as a fleet of pets and start treating it as ephemeral pods — the Jenkins Kubernetes plugin asks the cluster for a fresh agent pod when a job needs one, runs the build inside per-job container templates pinned to exact tool versions, and deletes the pod the instant the build finishes. You pay only for the seconds a build actually runs, every build starts from an identical image, and the controller config lives in version control as code. This guide walks the whole thing end to end: a Helm-installed controller, JCasC (Jenkins Configuration as Code) to declare the cloud and pod templates, a real multi-container pipeline, then validation, rollback, and the security/cost notes that keep it production-grade.
Prerequisites
- A Kubernetes cluster, 1.28+ (EKS, AKS, GKE, or on-prem) with a working
kubectlcontext and cluster-admin for the initial install. - Helm 3.12+ and the
jenkinsci/helm-chartsrepo reachable. - A default
StorageClassfor the controller’s persistent home (e.g.gp3on EKS,managed-csion AKS). - An ingress controller (NGINX or your cloud’s) and a DNS name you control, fronted by Akamai for edge TLS termination, WAF, and bot mitigation so the controller UI is never raw-exposed.
- An OIDC identity provider — Okta as the workforce IdP (federated to Microsoft Entra ID where Azure RBAC is in play) — to back Jenkins SSO instead of local accounts.
- HashiCorp Vault reachable from the cluster for build secrets (registry creds, signing keys) injected at runtime rather than baked into images.
- A container registry (ECR/ACR/GHCR) for your agent images, and Terraform already managing the cluster and its node pools.
Target topology
The shape is deliberately simple and that is the point. A single long-lived Jenkins controller runs as a StatefulSet in a jenkins namespace, with its $JENKINS_HOME on a persistent volume so jobs, build history, and credentials survive a pod restart. The controller holds no build capacity of its own — it is a scheduler and a UI. When a pipeline requests an agent, the Kubernetes plugin calls the Kubernetes API and creates an ephemeral agent pod in a separate jenkins-agents namespace. That pod contains a jnlp container (the Jenkins agent process that phones home over JNLP/WebSocket) plus one or more per-job tool containers — maven, node, docker/kaniko, trivy, whatever the job’s podTemplate declares. The build steps run inside those containers; when the pipeline ends, the plugin deletes the pod and the node-level autoscaler reclaims the node minutes later. Argo CD keeps the controller’s Helm release and JCasC in sync with Git (GitOps), so the controller is reproducible and drift-free. Identity flows Okta → Entra → Jenkins OIDC; secrets flow Vault → agent pod at runtime; and Wiz, CrowdStrike Falcon, and Datadog observe the whole namespace.
1. Create namespaces and the agent service account
Isolate the controller from the throwaway agents. The agents get their own namespace and a tightly-scoped service account — they should never be able to mutate the controller.
kubectl create namespace jenkins
kubectl create namespace jenkins-agents
# Service account the agent pods run as
kubectl -n jenkins-agents create serviceaccount jenkins-agent
Grant the controller’s service account permission to create and delete pods in the agents namespace only — least privilege, scoped by RoleBinding, never a cluster-wide ClusterRoleBinding:
# agent-rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: jenkins-agent-manager
namespace: jenkins-agents
rules:
- apiGroups: [""]
resources: ["pods", "pods/exec", "pods/log"]
verbs: ["get", "list", "watch", "create", "delete"]
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: jenkins-controller-manages-agents
namespace: jenkins-agents
subjects:
- kind: ServiceAccount
name: jenkins # created by the Helm chart in step 2
namespace: jenkins
roleRef:
kind: Role
name: jenkins-agent-manager
apiGroup: rbac.authorization.k8s.io
kubectl apply -f agent-rbac.yaml
2. Install the Jenkins controller with Helm
Use the official chart. The key is to pin the agents into the right namespace, give the controller a persistent home, and turn on the plugins we need — the Kubernetes plugin, JCasC, and the OIDC plugin for Okta/Entra SSO.
helm repo add jenkins https://charts.jenkins.io
helm repo update
# values.yaml (managed by Terraform/Argo CD in production)
controller:
image:
tag: "2.452.3-lts-jdk17"
installPlugins:
- kubernetes:4253.v7700d91739e8
- configuration-as-code:1810.v9b_c30a_249a_4c
- workflow-aggregator:600.vb_57cdd26fdd7 # Pipeline
- git:5.2.2
- oic-auth:4.418.vccc7061f5b_6d # OIDC for Okta/Entra
- hashicorp-vault-plugin:370.va_4c5fe9f9a_69 # Vault credentials
JCasC:
defaultConfig: true
configScripts:
jenkins-casc: |
# filled in step 3
serviceType: ClusterIP # exposed via Ingress + Akamai, not LoadBalancer
resources:
requests: { cpu: "1", memory: "2Gi" }
limits: { cpu: "2", memory: "4Gi" }
persistence:
enabled: true
storageClass: "gp3"
size: "30Gi"
agent:
enabled: false # we declare agents in JCasC, not chart defaults
serviceAccount:
create: true
name: jenkins
Install it:
helm upgrade --install jenkins jenkins/jenkins \
--namespace jenkins \
--values values.yaml \
--wait --timeout 10m
Fetch the initial admin password (you will replace this with OIDC in step 3):
kubectl -n jenkins exec -it sts/jenkins -c jenkins -- \
cat /run/secrets/additional/chart-admin-password
3. Configure the cloud and pod templates with JCasC
This is the heart of the setup. Everything below goes in the JCasC.configScripts.jenkins-casc block from step 2 (or as a separate ConfigMap that the chart mounts). JCasC declares the Kubernetes cloud (how the controller talks to the API and where agents land) and one or more reusable pod templates.
jenkins:
clouds:
- kubernetes:
name: "k8s"
serverUrl: "https://kubernetes.default.svc"
namespace: "jenkins-agents"
jenkinsUrl: "http://jenkins.jenkins.svc.cluster.local:8080"
jenkinsTunnel: "jenkins-agent.jenkins.svc.cluster.local:50000"
directConnection: false # WebSocket/JNLP through the controller svc
containerCapStr: "50" # hard ceiling on concurrent agent pods
connectTimeout: 100
readTimeout: 200
podRetention: "never" # delete the pod the moment the build ends
templates:
- name: "base"
label: "k8s-base"
serviceAccount: "jenkins-agent"
idleMinutes: 0 # do not keep idle agents warm
yamlMergeStrategy: "merge"
containers:
- name: "jnlp"
image: "jenkins/inbound-agent:3261.v9c670a_4748a_9-1"
resourceRequestCpu: "500m"
resourceRequestMemory: "512Mi"
resourceLimitCpu: "1"
resourceLimitMemory: "1Gi"
securityRealm:
oic:
clientId: "${OIDC_CLIENT_ID}"
clientSecret: "${OIDC_CLIENT_SECRET}"
wellKnownOpenIDConfigurationUrl: "https://your-org.okta.com/.well-known/openid-configuration"
userNameField: "preferred_username"
groupsFieldName: "groups"
authorizationStrategy:
roleBased:
roles:
global:
- name: "admin"
permissions: ["Overall/Administer"]
assignments: ["platform-admins"] # Okta/Entra group claim
- name: "developer"
permissions: ["Overall/Read", "Job/Build", "Job/Read"]
assignments: ["developers"]
unclassified:
location:
url: "https://jenkins.example.com/"
A few choices that teams get wrong, called out explicitly:
podRetention: neverandidleMinutes: 0are what make agents genuinely ephemeral. The defaultonFailureretention keeps failed pods around “for debugging” and quietly fills your node pool — only enable it temporarily.containerCapStris your blast-radius limit. Without it, a flood of queued jobs can try to schedule hundreds of pods and exhaust the cluster.jenkinsTunnelmust point at the50000agent port service the Helm chart creates (jenkins-agent), or agents connect to the UI port and silently fail to register.- The OIDC realm replaces local accounts entirely: Okta (federated to Entra ID for Azure-backed RBAC) issues the token, and the
groupsclaim drives Jenkins’ role-based authorization, so access is managed in the IdP, not in Jenkins.
The ${OIDC_CLIENT_SECRET} and any registry/signing secrets are not hardcoded — they are pulled from HashiCorp Vault via the Vault plugin and exposed to JCasC as environment variables, so no secret is ever written into the Helm values or Git.
Apply by upgrading the release, then reload config without a restart:
helm upgrade jenkins jenkins/jenkins -n jenkins --values values.yaml --wait
# JCasC reloads on chart upgrade via the sidecar; or force it:
kubectl -n jenkins exec sts/jenkins -c jenkins -- \
curl -s -X POST localhost:8080/reload-configuration-as-code/
4. Build a per-job pod template into a Pipeline
Now use it. A pipeline declares its own pod inline with a podTemplate so each job gets exactly the tool containers it needs — a Maven build, a Kaniko image build with no Docker daemon, and a Trivy scan, all in one ephemeral pod sharing the workspace.
// Jenkinsfile
pipeline {
agent {
kubernetes {
yaml '''
apiVersion: v1
kind: Pod
spec:
serviceAccountName: jenkins-agent
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: maven
image: maven:3.9.6-eclipse-temurin-17
command: ["cat"]
tty: true
resources:
requests: { cpu: "1", memory: "1Gi" }
limits: { cpu: "2", memory: "2Gi" }
- name: kaniko
image: gcr.io/kaniko-project/executor:v1.23.2-debug
command: ["sleep"]
args: ["9999999"]
- name: trivy
image: aquasec/trivy:0.53.0
command: ["cat"]
tty: true
'''
}
}
stages {
stage('Build & test') {
steps {
container('maven') {
sh 'mvn -B -ntp clean verify'
}
}
}
stage('Image build (no daemon)') {
steps {
container('kaniko') {
// registry creds injected from Vault, mounted at /kaniko/.docker
sh '''/kaniko/executor \
--context=`pwd` \
--dockerfile=Dockerfile \
--destination=ghcr.io/acme/app:${BUILD_NUMBER} \
--cache=true'''
}
}
}
stage('Scan') {
steps {
container('trivy') {
sh 'trivy image --exit-code 1 --severity HIGH,CRITICAL ghcr.io/acme/app:${BUILD_NUMBER}'
}
}
}
}
}
The same job, run via GitHub Actions for ephemeral runners or promoted by Argo CD into the cluster, would reuse these container images — keeping the toolchain identical across CI systems. Terraform (or Ansible for any node-level config) owns the node pools these pods land on, so capacity is also code.
5. Wire in secrets, identity, and the operating stack
Glue the ephemeral pods into the platform so they are governed, not just functional.
-
HashiCorp Vault issues short-lived registry and signing credentials. The agent pod authenticates to Vault with its Kubernetes service-account JWT (Vault’s
kubernetesauth method), leases a token, and mounts the secret — nothing long-lived sits in aSecretobject. Configure the role to bind exactly thejenkins-agentSA injenkins-agents:vault write auth/kubernetes/role/jenkins-agent \ bound_service_account_names=jenkins-agent \ bound_service_account_namespaces=jenkins-agents \ policies=ci-registry-read,ci-signing \ ttl=20m -
Okta → Entra ID is the only way humans log in (step 3); Jenkins local auth stays disabled. Group claims map to Jenkins roles, so onboarding/offboarding happens in the IdP.
-
Wiz / Wiz Code scans the agent and controller images in the registry and runs CSPM over the
jenkins/jenkins-agentsnamespaces, alerting if a pod escapes its scoped RBAC or an image ships a critical CVE — the posture backstop behind the Trivy gate in the pipeline. -
CrowdStrike Falcon sensors on the node pool give runtime threat detection on the ephemeral pods themselves (crypto-miner in a poisoned dependency, unexpected egress) and feed detections to the SOC.
-
Datadog (or Dynatrace) collects the Jenkins Prometheus metrics, agent pod lifecycle events, and build-stage timings, so queue time, pod-startup latency, and per-pipeline duration are dashboards, not guesses.
-
ServiceNow receives an auto-raised change record when the controller’s Helm/JCasC release is promoted, and an incident ticket on a guardrail breach (a Falcon detection, a sustained scan failure) — giving compliance a documented gate.
-
Internal training for new engineers on this pipeline lives as a course in Moodle, and any legacy build that still needs a Windows toolchain runs on a virtual appliance node attached to the cluster as a separate, labeled pod template until it can be containerized.
Validation
Confirm the controller is healthy and that agents are genuinely ephemeral — created on demand, gone after.
# Controller pod Running, PVC bound
kubectl -n jenkins get pods,pvc
kubectl -n jenkins logs sts/jenkins -c jenkins | grep -i "Configuration as Code"
# The k8s cloud is registered and reachable
kubectl -n jenkins exec sts/jenkins -c jenkins -- \
curl -s localhost:8080/manage/cloud/k8s/ -o /dev/null -w "%{http_code}\n"
Trigger a build, then watch a pod appear in the agents namespace and disappear when it ends:
# In one terminal — watch agents come and go
kubectl -n jenkins-agents get pods -w
You should see a pod named like app-build-7-xxxxx-yyyyy go Pending → Running → Completed/Terminating within the build’s lifetime, then vanish. Verify the connection method and that no agents linger idle:
# After a few builds: zero idle agent pods should remain
kubectl -n jenkins-agents get pods --no-headers | wc -l # expect 0 between builds
A green run with the Maven, Kaniko, and Trivy stages all passing — and the pod gone afterward — is the success criterion.
Rollback / teardown
Because the controller is Helm-managed and the agents are stateless, rollback is clean.
# Roll the controller back to the previous release revision
helm history jenkins -n jenkins
helm rollback jenkins <PREVIOUS_REVISION> -n jenkins --wait
# Or fully tear down — agents first, then controller, then namespaces
kubectl -n jenkins-agents delete pods --all # kill any in-flight agents
helm uninstall jenkins -n jenkins
kubectl delete -f agent-rbac.yaml
kubectl delete namespace jenkins-agents
kubectl delete namespace jenkins # this deletes the PVC and JENKINS_HOME — back it up first
If you only need to disable ephemeral agents temporarily (e.g. cluster maintenance), set containerCapStr: "0" in JCasC and reload — the controller stays up but schedules no new pods, draining gracefully.
Common pitfalls
- Agents stuck
Pending. Almost always RBAC or the tunnel: the controller SA lackscreate podsinjenkins-agents, orjenkinsTunnelpoints at the wrong service/port (must be the50000agent port). Checkkubectl -n jenkins-agents describe pod <agent>events. - Pods never get deleted.
podRetentionleft atonFailure/always, or a finalizer hanging. Setneverfor production and reserve retention for ad-hoc debugging. jnlpcontainer overridden by accident. If yourpodTemplatenames a containerjnlp, you replace the agent itself — name tool containersmaven/node/etc. and let the plugin injectjnlp.- Workspace not shared. All containers in a
podTemplateshare the workspace volume automatically; if a step incontainer('trivy')can’t see the artifactcontainer('maven')built, you likely overrode the workspace mount. - Image pulls throttle builds. Cold pulls of large tool images dominate startup. Pre-pull with a
DaemonSetor use a pull-through cache so agent startup is seconds, not minutes. - Clock/timeout flakiness at scale. Under a thundering herd, bump
connectTimeout/readTimeoutand raisecontainerCapStrdeliberately rather than letting jobs queue invisibly.
Security notes
Run agent pods as non-root with a restricted securityContext (runAsNonRoot: true, drop all capabilities, read-only root filesystem where the toolchain allows), and never mount the host Docker socket — use Kaniko or BuildKit rootless for image builds so a poisoned build cannot escape to the node. Keep the controller off the public internet: ClusterIP service behind Ingress, fronted by Akamai for TLS and WAF. Disable Jenkins local auth and gate every login through Okta/Entra OIDC with group-driven RBAC. Pull all build secrets from HashiCorp Vault at runtime with short TTLs, so nothing durable lands in a Secret. Let Wiz scan images and namespace posture and CrowdStrike Falcon watch runtime, with breaches auto-ticketed in ServiceNow. Apply a NetworkPolicy so agent pods can reach only the registry, Vault, and the controller — not each other or the wider cluster.
Cost notes
This is where the design pays for itself. Permanently-on agents bill 24/7 regardless of load; ephemeral pods bill only for the seconds a build runs. Pair the pod model with a cluster autoscaler (or Karpenter on EKS) on a dedicated, Spot/Preemptible node pool labeled for agents — builds are interruptible and idempotent, so a 60–80% discount is realistic. Right-size resourceRequest/limit per container so the scheduler bin-packs tightly instead of stranding capacity. Set idleMinutes: 0 and podRetention: never so nothing idles. Use Datadog to chart cost-relevant signals — node-pool utilization, pod-startup latency, and build minutes per team — and feed per-team build minutes into chargeback so each squad owns its spend. The combined effect for the team in the scenario: the overnight VM bill goes to near zero, peak capacity becomes elastic instead of a fixed ceiling, and every build runs on an identical, version-pinned toolchain — the “works on the build server” problem solved by construction.