A payments platform team gets the finding back from their first real supply-chain audit: anyone with kubectl apply can run :latest from an arbitrary public registry, half the pods have no CPU/memory limits so one bad deploy noisy-neighbours an entire node, and a third of workloads run as root with hostPath mounts. The CISO’s instruction is blunt — “nothing runs in production unless it is our signed image, it stays inside its limits, and it cannot get root on the node.” You can chase that with code review and good intentions, or you can make the cluster itself refuse the bad manifest at the API server. This guide does the latter with Kyverno, the Kubernetes-native policy engine, enforcing three controls as a single admission gate: image signature verification (Cosign), resource limits (mutate + validate), and the restricted Pod Security Standard. Every command below is real and runnable against any conformant cluster (AKS, EKS, GKE, or vanilla).
Prerequisites
- A Kubernetes cluster, v1.27+ (Kyverno 1.13 tracks recent APIs), with cluster-admin via
kubectl. helm3.12+,kubectl,cosign2.x, andjqon your workstation.- An OCI registry you control (GHCR, ECR, ACR, Artifact Registry). Examples use
ghcr.io/kloudvin. - A CI system that builds and signs images — examples use GitHub Actions; Jenkins works identically with the Cosign CLI.
- Cluster egress (or a mirror) to
ghcr.io/kyvernofor the controller images. - Optional but assumed in the operating model: HashiCorp Vault (or a KMS) holding the Cosign private key, Argo CD for GitOps delivery of policies, and Wiz / a SIEM consuming policy reports.
Target topology
Kyverno installs as a set of controllers in the kyverno namespace and registers two webhooks with the API server: a ValidatingWebhookConfiguration (deny on policy violation) and a MutatingWebhookConfiguration (inject defaults, verify-and-rewrite image digests). Every CREATE/UPDATE of a Pod-bearing resource flows API server → Kyverno admission controller → your ClusterPolicy rules → allow / mutate / deny. A separate reports controller writes PolicyReport objects continuously so you have a posture view even for resources admitted before a policy existed. Three independent control planes feed in:
- CI (GitHub Actions / Jenkins) builds the image and signs it with Cosign, whose private key is issued just-in-time from HashiCorp Vault (or keyless signing via OIDC) — so the signing secret never lives in a runner.
- GitOps (Argo CD) delivers the Kyverno policies themselves from a Git repo, making the policy set auditable and revertable, and Terraform/Ansible provisions the cluster add-on and namespace labels underneath.
- Posture/SOC: Wiz correlates the admission policies with cloud posture and flags drift, PolicyReports stream to Datadog/Dynatrace dashboards, and a hard denial can open a ServiceNow ticket. CrowdStrike Falcon stays on the nodes for runtime defense — Kyverno gates admission, Falcon watches what runs after.
1. Install Kyverno
Install via the official Helm chart. Run admission in high availability (3 replicas) for any cluster that matters — a single Kyverno pod is a single point of admission failure.
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno \
--namespace kyverno --create-namespace \
--version 3.3.4 \
--set admissionController.replicas=3 \
--set backgroundController.replicas=2 \
--set reportsController.replicas=2 \
--set cleanupController.replicas=2
Wait for the controllers and confirm the webhooks registered:
kubectl -n kyverno rollout status deploy/kyverno-admission-controller
kubectl get pods -n kyverno
kubectl get validatingwebhookconfigurations,mutatingwebhookconfigurations | grep kyverno
A critical safety setting before you write any policy: decide what happens if Kyverno itself is down. The default failurePolicy: Fail means admission requests are rejected when the webhook is unreachable — safe, but it can wedge a cluster. Set it deliberately per policy (below). Also confirm Kyverno excludes its own and system namespaces so you cannot deadlock the control plane:
kubectl get configmap kyverno -n kyverno -o jsonpath='{.data.webhooks}' ; echo
# Expect kube-system / kyverno excluded by namespaceSelector
2. Set up Cosign signing in CI
Image-signature enforcement is worthless if your own images are unsigned, so build the signing side first. Generate a key pair, or — preferred — use keyless signing where Cosign gets a short-lived certificate from Fulcio bound to your CI’s OIDC identity, leaving no long-lived key to leak.
Key-based, with the private key stored in HashiCorp Vault (never in the repo or a plain CI secret):
# One-time: generate and push the public half to the registry/Git; private half to Vault
cosign generate-key-pair
vault kv put secret/cosign/payments cosign.key=@cosign.key password='<passphrase>'
shred -u cosign.key # do not keep the private key on disk
The CI job pulls the key from Vault at build time and signs the digest (never a tag):
# .github/workflows/build-sign.yml (GitHub Actions)
permissions:
contents: read
id-token: write # required for keyless / Vault OIDC auth
jobs:
build-sign:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/build-push-action@v6
id: build
with: { push: true, tags: "ghcr.io/kloudvin/api:${{ github.sha }}" }
- uses: sigstore/cosign-installer@v3
# Option A — keyless (recommended): identity is the GitHub OIDC token
- run: |
cosign sign --yes \
"ghcr.io/kloudvin/api@${{ steps.build.outputs.digest }}"
# Option B — key from Vault:
# - run: cosign sign --yes --key "hashivault://payments/cosign" \
# "ghcr.io/kloudvin/api@${{ steps.build.outputs.digest }}"
Verify locally so you know the exact identity strings the cluster policy must match:
cosign verify \
--certificate-identity-regexp "https://github.com/kloudvin/.+" \
--certificate-oidc-issuer "https://token.actions.githubusercontent.com" \
ghcr.io/kloudvin/api@<digest> | jq '.[0].optional.Subject'
3. Enforce image signatures with verifyImages
Now the gate. This ClusterPolicy uses Kyverno’s verifyImages rule to require a valid Cosign signature for any image from your registry. Start in Audit so you can see the blast radius before you block anything.
# policies/verify-images.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-signed-images
annotations:
policies.kyverno.io/severity: high
spec:
validationFailureAction: Audit # flip to Enforce in step 7
failurePolicy: Fail
webhookTimeoutSeconds: 30 # signature checks are slower than plain validation
background: false # verifyImages cannot run as a background scan
rules:
- name: verify-ghcr-cosign-keyless
match:
any:
- resources:
kinds: [Pod]
verifyImages:
- imageReferences:
- "ghcr.io/kloudvin/*" # only OUR registry; pin public ones separately
failureAction: Audit
mutateDigest: true # rewrite the verified tag to an immutable @sha256
required: true
attestors:
- count: 1
entries:
- keyless:
subject: "https://github.com/kloudvin/*"
issuer: "https://token.actions.githubusercontent.com"
rekor:
url: https://rekor.sigstore.dev
If you signed with a Vault/KMS key instead of keyless, swap the attestor entry for the public key:
entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQ...your cosign.pub...
-----END PUBLIC KEY-----
rekor:
url: https://rekor.sigstore.dev
Apply it and watch the reports:
kubectl apply -f policies/verify-images.yaml
kubectl get clusterpolicy require-signed-images
kubectl get policyreport -A | head # PASS/FAIL counts per namespace
mutateDigest: true is doing quiet, important work: once verified, Kyverno rewrites :tag to the pinned @sha256:... digest in the pod spec, so what runs is provably the bytes you signed — closing the tag-mutation window where an attacker re-pushes a tag after verification.
4. Mutate in default resource limits
A pod with no limits can starve a node. Use a mutate rule to inject sane defaults when the author omits them — non-destructive, and far better adoption than rejecting every under-specified deployment on day one.
# policies/default-resources.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-default-resources
spec:
rules:
- name: set-default-requests-limits
match:
any:
- resources:
kinds: [Pod]
mutate:
foreach:
- list: "request.object.spec.containers"
patchStrategicMerge:
spec:
containers:
- name: "{{ element.name }}"
resources:
requests:
+(memory): "128Mi" # +(...) = add only if absent
+(cpu): "100m"
limits:
+(memory): "512Mi"
+(cpu): "500m"
The +(...) anchor means Kyverno only adds the field if it is missing — it never overwrites an explicit value the author set on purpose.
5. Require resource limits with validate
Defaulting is a safety net, not a rule. Pair it with a validate rule so a container that explicitly omits limits in a namespace you care about is rejected outright — defence in depth against someone setting limits: null.
# policies/require-limits.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-cpu-mem-limits
match:
any:
- resources:
kinds: [Pod]
validate:
message: "CPU and memory limits are required on every container."
foreach:
- list: "request.object.spec.containers"
deny:
conditions:
any:
- key: "{{ element.resources.limits.memory || '' }}"
operator: Equals
value: ""
- key: "{{ element.resources.limits.cpu || '' }}"
operator: Equals
value: ""
Order matters: Kyverno runs mutate rules before validate, so the step-4 defaults are applied first and only a container that cannot be defaulted (e.g. an explicit null) trips this deny.
6. Enforce restricted Pod Security
Replace the deprecated PodSecurityPolicy with Kyverno’s podSecurity subrule, which maps directly to the upstream Pod Security Standards. This single rule enforces the entire restricted profile — no root, no privilege escalation, dropped capabilities, seccomp, no host namespaces.
# policies/pod-security-restricted.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: pod-security-restricted
spec:
validationFailureAction: Enforce
background: true
rules:
- name: restricted-profile
match:
any:
- resources:
kinds: [Pod]
validate:
podSecurity:
level: restricted
version: latest
# Targeted, auditable exemptions instead of a blanket opt-out:
exclude:
- controlName: "Capabilities"
images: ["ghcr.io/kloudvin/net-tools:*"]
Why Kyverno over the built-in Pod Security Admission: PSA only operates per-namespace at fixed levels and cannot make exceptions, mutate, or report centrally. Kyverno gives you per-image exemptions, the same PolicyReport stream as your other controls, and a single place security reviews. Apply all the policies through Argo CD rather than kubectl in production so the policy set is the Git-tracked source of truth:
kubectl apply -f policies/ # or sync the Argo CD Application
kubectl get cpol # all four ClusterPolicies, READY=true
7. Promote from Audit to Enforce
Never go straight to Enforce on a live cluster. Run in Audit, read the reports, fix the offenders, then flip. Find what would be blocked:
# Aggregate failing rules across the cluster
kubectl get policyreport -A -o json \
| jq -r '.items[].results[] | select(.result=="fail")
| "\(.policy)/\(.rule)\t\(.resources[0].namespace)/\(.resources[0].name)"' \
| sort | uniq -c | sort -rn
Once the failures are down to known exemptions, flip each policy to enforcing:
kubectl patch clusterpolicy require-signed-images \
--type merge -p '{"spec":{"validationFailureAction":"Enforce"}}'
# repeat for the verifyImages rule's own failureAction: Enforce
For high-risk control-plane namespaces, keep failurePolicy: Fail; for application namespaces during rollout, Ignore avoids an outage if Kyverno blips. Make that choice consciously, per policy.
Validation
Prove the gate works with a deliberately bad pod — every one of these must be rejected once policies are enforcing:
# 1. Unsigned / wrong-registry image -> blocked by verifyImages
kubectl run bad-unsigned --image=nginx:latest
# Error: ... require-signed-images: image is not signed
# 2. Signed image with NO limits -> defaulted by mutate, or denied if null
kubectl run noreq --image=ghcr.io/kloudvin/api@<digest> --dry-run=server -o yaml \
| grep -A4 resources # see injected requests/limits
# 3. Root / privileged pod -> blocked by restricted profile
kubectl run rooty --image=ghcr.io/kloudvin/api@<digest> \
--privileged --dry-run=server
# Error: ... pod-security-restricted: privileged containers are not allowed
# 4. A correctly signed, limited, non-root pod -> ADMITTED
kubectl apply -f tests/good-pod.yaml # should succeed
Confirm the digest rewrite actually happened on the admitted pod:
kubectl get pod good-pod -o jsonpath='{.spec.containers[0].image}'; echo
# Expect ghcr.io/kloudvin/api@sha256:... (a digest, not a tag)
Run Kyverno’s own test harness in CI so policy changes are unit-tested before they ship via Argo CD:
kyverno test ./policies/ # asserts expected pass/fail per fixture
Rollback / teardown
Policies are declarative, so rollback is fast — switch back to Audit first if a policy is over-blocking in production, then remove if needed:
# Soft rollback: stop denying, keep reporting
for p in require-signed-images require-resource-limits pod-security-restricted; do
kubectl patch cpol "$p" --type merge -p '{"spec":{"validationFailureAction":"Audit"}}'
done
# Remove a single policy
kubectl delete clusterpolicy pod-security-restricted
# Full uninstall (also removes both webhooks, so admission stops gating)
helm uninstall kyverno -n kyverno
kubectl delete ns kyverno
If you delivered policies via Argo CD, do the rollback in Git (revert the commit) and let the sync remove them — never kubectl delete out of band, or Argo will flag drift and may re-create them.
Common pitfalls
- Cluster wedge from
failurePolicy: Fail. If Kyverno is unreachable and a policy fails closed, new pods (including Kyverno’s own on a cold start) can be blocked. Always excludekube-systemandkyverno, and keep app namespaces onIgnoreduring rollout. - verifyImages can’t background-scan. Signature rules only run at admission (
background: false); existing pods are not retro-verified. Re-roll deployments after enabling, or pods admitted earlier keep running unsigned images. - Identity string mismatch. Keyless
subject/issuermust match the certificate exactly — a wrong issuer URL or a stray branch in the subject regex silently fails every verify. Confirm withcosign verifyfirst (step 2). - Forgetting to pin digests. Without
mutateDigest: true, an attacker re-pushing a verified tag bypasses the check. Always rewrite to a digest. - Mutate vs validate ordering surprises. Defaults from a mutate rule appear before your validate rule runs; test the combined result with
--dry-run=server, not the policies in isolation. - initContainers and ephemeralContainers. A
foreachoverspec.containersmisses init/ephemeral containers — add them explicitly or attackers hide privileged work there.
Security notes
This is a Zero-Trust admission control: the cluster trusts no image it cannot cryptographically tie to your CI identity, runs nothing as root, and pins every workload to a signed digest. Keep the Cosign private key in HashiCorp Vault or use keyless signing so there is no long-lived secret to steal; rotate the key and update the policy’s public key together. Feed every PolicyReport to Wiz (to correlate admission posture with cloud misconfig and attack paths) and your SIEM, and let a hard denial open a ServiceNow incident so security gets a ticket, not just a log line. Remember the boundary: Kyverno gates admission — CrowdStrike Falcon on the nodes covers runtime (a compromise after a pod is admitted), and the two together close the gap.
Cost notes
Kyverno’s own footprint is small — the HA controllers run comfortably in roughly 0.5 vCPU / 512Mi per replica, a rounding error against the workloads they protect. The real saving is indirect: the step-4/5 resource-limit policies stop unbounded pods from triggering node autoscale events and cluster overprovisioning, which is usually a far larger line item than the controller. Watch one operational cost — verifyImages adds a Rekor/registry round-trip per new image, so size webhookTimeoutSeconds (step 3) generously and run an in-cluster registry mirror if your image pull volume is high, both to cut latency and to avoid public-registry rate limits.