Every cluster accumulates rules that live in tribal memory: images must come from our registry, every Pod needs resource limits, no latest tags, each namespace gets a default-deny NetworkPolicy. Documented in a wiki, enforced by hope. Policy-as-code moves those rules into the admission path so they are evaluated on every kubectl apply, every Argo sync, every Helm upgrade – before the object is persisted. Kyverno is the policy engine I reach for first because its policies are Kubernetes resources written in YAML, not a separate language, and because it does something Gatekeeper cannot: it mutates and generates resources, not just validates them. This is the end-to-end workflow, from a first validate rule to verifying cosign signatures inline.
1. Kyverno vs OPA/Gatekeeper, and the admission flow
Both Kyverno and OPA/Gatekeeper run as ValidatingWebhookConfiguration (and Kyverno also MutatingWebhookConfiguration) targets that the API server calls during admission. The difference is the authoring model and the verb coverage.
| Dimension | Kyverno | OPA/Gatekeeper |
|---|---|---|
| Policy language | Kubernetes YAML, overlay/pattern style | Rego (separate DSL) |
| Validate | Yes | Yes |
| Mutate | Yes | Limited (Assign/ModifySet) |
| Generate downstream resources | Yes | No |
| Image signature verification | Built in (verifyImages) |
Requires external integration |
| Mental model | “Looks like the resource it governs” | General-purpose policy engine |
The admission flow is the same shape for both. The API server authenticates and authorizes the request, runs mutating webhooks (Kyverno injects defaults here), persists nothing yet, runs validating webhooks (Kyverno enforces here), and only then writes to etcd. Order matters: a mutate rule that adds runAsNonRoot: true runs before a validate rule that requires it, so a single Kyverno install can both fix and gate the same field.
Install with the official Helm chart. Pin the chart version in production.
helm repo add kyverno https://kyverno.github.io/kyverno/
helm repo update
helm install kyverno kyverno/kyverno \
--namespace kyverno --create-namespace \
--version 3.2.6 \
--set admissionController.replicas=3 \
--set backgroundController.replicas=2
kubectl -n kyverno get pods
kubectl get crd | grep kyverno.io
Since Kyverno 1.10 the controllers are split – admission, background, cleanup, and reports controllers run as separate Deployments. Run at least 3 admission replicas so a node drain never leaves the webhook unbacked; an unbacked webhook with failurePolicy: Fail blocks the API server for the resources it matches.
2. Writing validate rules with patterns, anchors, and good messages
A ClusterPolicy contains rules; each rule has a match block and exactly one of validate, mutate, generate, or verifyImages. The most common validate style is pattern: you write a fragment that the resource must match. This policy requires CPU and memory limits on every container.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
background: true
rules:
- name: require-limits
match:
any:
- resources:
kinds:
- Pod
validate:
message: "CPU and memory limits are required on every container."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
The anchors are the part people get wrong. ?* means “any non-empty value.” Inside a list, the default behavior makes the pattern apply to every element – so the rule above requires limits on all containers, not just the first. Other anchors:
=(field)– existence-optional but if present must match. Useful for fields that may be absent.()(conditional anchor) – “if this matches, then the sibling keys must also match.” Drives conditional logic.X(field)(negation) – the field must not exist.+(field)(add anchor) – mutate-only, adds if absent.
A conditional example: only require a read-only root filesystem when the container is not explicitly privileged.
validate:
message: "Non-privileged containers must set readOnlyRootFilesystem: true."
pattern:
spec:
containers:
- =(securityContext):
=(privileged): "false"
readOnlyRootFilesystem: true
For logic that patterns cannot express, use deny with conditions. This blocks the :latest tag and bare tags using JMESPath against the image string.
validate:
message: "Using a mutable ':latest' or untagged image is not allowed."
deny:
conditions:
any:
- key: "{{ images.containers.*.tag }}"
operator: AnyIn
value:
- "latest"
- key: "{{ images.containers.*.tag || '' }}"
operator: AnyIn
value:
- ""
Failure messages are a UX surface. The message is what a developer sees when their deploy is rejected. “validation error: rule require-limits failed” plus a clear sentence beats a wall of YAML. Write the message as an instruction, not a complaint.
3. Mutate rules: inject defaults, labels, and sidecars
Mutation is where Kyverno pulls ahead of Gatekeeper. The most useful pattern is defaulting – supplying a sane value so the validate rule never has to fail. Here we set imagePullPolicy: IfNotPresent and a default runAsNonRoot when they are missing, using the add anchor +.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: default-security-context
spec:
rules:
- name: default-run-as-non-root
match:
any:
- resources:
kinds: [Pod]
mutate:
patchStrategicMerge:
spec:
+(securityContext):
+(runAsNonRoot): true
containers:
- (name): "?*"
+(imagePullPolicy): IfNotPresent
The (name): "?*" is a conditional anchor that matches every container by name, then + adds the field only if absent – so we never clobber an explicit setting. Strategic merge respects list merge keys, which is why we anchor on name.
For injecting a sidecar or any structural change, patchesJson6902 (JSON Patch, RFC 6902) gives precise control:
mutate:
patchesJson6902: |-
- op: add
path: "/spec/containers/-"
value:
name: logging-sidecar
image: registry.internal/fluent-bit:2.2
resources:
limits: { cpu: "100m", memory: "128Mi" }
A cleaner pattern for shared mutations is mutateExistingOnPolicyUpdate plus a targets block, which lets a policy retroactively patch resources that already exist when the policy changes – handy for rolling a new label across live namespaces without re-applying every manifest. Use it sparingly; it generates write load against the API server.
4. Generate rules: auto-create NetworkPolicies and ConfigMaps per namespace
generate rules create downstream resources in response to a trigger. The canonical use is a default-deny NetworkPolicy in every new namespace, so security posture is correct by construction rather than by a checklist.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: default-deny-netpol
spec:
rules:
- name: deny-all-ingress
match:
any:
- resources:
kinds: [Namespace]
generate:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
name: default-deny
namespace: "{{ request.object.metadata.name }}"
synchronize: true
data:
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Two flags carry the weight. synchronize: true means Kyverno reconciles the generated resource – if someone deletes or edits the NetworkPolicy, the background controller restores it from the policy definition. data embeds the resource inline; the alternative, clone, copies from an existing source object, which is the right choice for distributing a shared registry pull-secret or CA bundle into every namespace:
generate:
apiVersion: v1
kind: Secret
name: regcred
namespace: "{{ request.object.metadata.name }}"
synchronize: true
clone:
namespace: platform
name: regcred
Generate rules need RBAC. The background controller can only create what its ServiceAccount is permitted to, so to generate NetworkPolicies you must grant the kyverno:background-controller an aggregated ClusterRole:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kyverno:generate-netpol
labels:
app.kubernetes.io/component: background-controller
rbac.kyverno.io/aggregate-to-background-controller: "true"
rules:
- apiGroups: ["networking.k8s.io"]
resources: ["networkpolicies"]
verbs: ["create", "update", "delete", "get", "list", "watch"]
The rbac.kyverno.io/aggregate-to-background-controller label is what wires this into Kyverno’s role. Forget it and your generate rule silently produces no resources – check kubectl describe clusterpolicy events and the background controller logs first when generation appears to do nothing.
5. Image verification: enforcing cosign signatures and attestations inline
This is the supply-chain payoff. The verifyImages rule blocks any image that does not carry a valid cosign signature, evaluated at admission against the registry’s OCI signature artifact. Keyless verification (Fulcio/Rekor) keys on the signing identity rather than a static public key:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
failurePolicy: Fail
rules:
- name: verify-signed-by-ci
match:
any:
- resources:
kinds: [Pod]
verifyImages:
- imageReferences:
- "registry.internal/*"
mutateDigest: true
verifyDigest: true
attestors:
- count: 1
entries:
- keyless:
subject: "https://github.com/acme/*"
issuer: "https://token.actions.githubusercontent.com"
rekor:
url: https://rekor.sigstore.dev
Three behaviors are worth calling out. mutateDigest: true rewrites the verified tag to its immutable digest in the admitted Pod spec, closing the tag-mutability gap – the image that was verified is provably the image that runs. verifyDigest: true rejects images referenced by tag only when a digest cannot be resolved. And count: 1 against an attestors list lets you require m-of-n signers, e.g. CI plus a release-approver identity.
For static keys, swap keyless for keys:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
-----END PUBLIC KEY-----
rekor:
url: https://rekor.sigstore.dev
You can go further and require attestations – demand that an SBOM or SLSA provenance predicate exists and satisfies a condition, for example that the build ran on a hosted runner:
verifyImages:
- imageReferences: ["registry.internal/*"]
attestations:
- type: https://slsa.dev/provenance/v1
attestors:
- count: 1
entries:
- keyless:
issuer: "https://token.actions.githubusercontent.com"
subject: "https://github.com/acme/*"
conditions:
- all:
- key: "{{ buildDefinition.externalParameters.workflow.repository }}"
operator: Equals
value: "https://github.com/acme/payments"
This couples admission to provenance: not just “is it signed” but “was it built from the repo we expect, on the runner we expect.” That is the difference between supply-chain theater and supply-chain control.
6. Policy reporting, background scans, and audit vs enforce
validationFailureAction is the single most important field for safe rollout. Audit allows the resource and records a result; Enforce rejects it. Always start Audit.
In Audit mode (and for background: true policies generally), Kyverno’s reports controller writes results to PolicyReport (namespaced) and ClusterPolicyReport objects, and re-scans existing resources on a schedule – so you see which already-running workloads would fail before you flip to Enforce.
# Per-namespace pass/fail/warn/error tallies
kubectl get policyreport -A
# Drill into one namespace's failing results
kubectl get policyreport -n payments -o yaml \
| yq '.results[] | select(.result == "fail")'
# Cluster-scoped resources (namespaces, nodes, CRDs)
kubectl get clusterpolicyreport
The background controller is what makes Audit useful: it evaluates policies against the existing cluster on a periodic scan, not only on admission. So a policy you apply today immediately tells you your historical debt, not just your future compliance. Watch the pass/fail columns trend toward zero failures, then promote.
7. Testing policies with the Kyverno CLI and wiring into CI
Never let a cluster be the first place a policy meets a manifest. The kyverno CLI applies policies to resource files offline and kyverno test runs a declarative test suite with expected results.
# Apply a policy to a manifest and print the verdict
kyverno apply policies/require-limits.yaml \
--resource manifests/deployment.yaml
# Declarative test suite defined in kyverno-test.yaml
kyverno test .
The test file pins expected outcomes so a policy change that flips a verdict fails CI:
apiVersion: cli.kyverno.io/v1alpha1
kind: Test
metadata:
name: limits-suite
policies:
- policies/require-limits.yaml
resources:
- manifests/deployment.yaml
results:
- policy: require-resource-limits
rule: require-limits
resource: web
kind: Deployment
result: fail
Wire it into the pipeline so policies are unit-tested like any other code, and so application manifests are linted against the live policy set before merge:
# .github/workflows/policy.yaml
name: kyverno-policy
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Kyverno CLI
run: |
curl -sL https://github.com/kyverno/kyverno/releases/download/v1.12.0/kyverno-cli_v1.12.0_linux_x86_64.tar.gz \
| tar -xz kyverno
sudo mv kyverno /usr/local/bin/
- name: Run policy tests
run: kyverno test .
- name: Lint app manifests against policies
run: kyverno apply policies/ --resource manifests/ --warn-exit-code 0
Running the same policies in CI that run in the cluster collapses the feedback loop from “deploy rejected at 2am” to “PR check failed in 30 seconds.”
8. Performance, failurePolicy, and safe rollout across many namespaces
failurePolicy decides what happens when the webhook itself is unreachable. Fail (the default for security-critical policies) means the API server rejects the request if Kyverno cannot answer – correct for image verification, dangerous if Kyverno is undersized. Ignore fails open. The honest engineering position: run image-verify policies as Fail and run enough admission replicas with a PodDisruptionBudget that the webhook is never down.
spec:
failurePolicy: Fail
webhookTimeoutSeconds: 15 # cap latency added to every matched request
rules:
- name: verify
match:
any:
- resources:
kinds: [Pod]
namespaceSelector:
matchExpressions:
- key: kyverno.io/enforce
operator: In
values: ["true"]
Three rollout levers keep this safe at scale:
- Scope the webhook. A
namespaceSelectoron the match means the webhook is only invoked for opted-in namespaces. Roll out by labeling namespaces, not by editing the policy. - Exclude system namespaces. Always exclude
kube-systemandkyvernofrom broad Pod policies, or a Kyverno restart can deadlock on its own webhook. - Tune the timeout. Image verification reaches out to a registry and Rekor; a slow registry plus
failurePolicy: Failis an outage. SetwebhookTimeoutSecondsdeliberately and monitor admission latency.
Exclude protected namespaces directly in the match block:
match:
any:
- resources:
kinds: [Pod]
exclude:
any:
- resources:
namespaces: ["kube-system", "kyverno", "kube-node-lease"]
Enterprise scenario
A fintech platform team I worked with had a hard control from their auditors: production workloads must run only images signed by the central CI identity, and every Pod must carry the owning team’s cost-center label. They had ~140 namespaces across three clusters and could not afford a big-bang cutover that risked blocking deploys org-wide.
The constraint that bit them was ordering and false negatives. Their first attempt enforced verifyImages cluster-wide on day one with failurePolicy: Fail. Within an hour a registry hiccup combined with the fail-closed webhook blocked every deploy across all 140 namespaces – including the platform team’s own fix. They rolled it back and rebuilt the rollout as a labeled opt-in.
The fix had three parts. First, a mutate rule defaulted the cost-center label from an existing namespace annotation, so teams that had set it once on the namespace never had to repeat it on every Pod – eliminating the most common validate failure before it happened. Second, the verifyImages policy was scoped by namespaceSelector so only namespaces labeled verify=enforce were gated; everything else ran in Audit and reported via PolicyReport. Third, they watched the cluster reports until failures hit zero per namespace, then flipped that namespace’s label.
- name: default-cost-center-from-ns
match:
any:
- resources:
kinds: [Pod]
context:
- name: ns
apiCall:
urlPath: "/api/v1/namespaces/{{ request.namespace }}"
mutate:
patchStrategicMerge:
metadata:
labels:
+(cost-center): "{{ ns.metadata.annotations.\"acme.io/cost-center\" || 'unassigned' }}"
The context.apiCall fetches the namespace object at admission so the rule can read its annotation; the + add anchor defaults the label only when absent. Per-namespace promotion meant the blast radius of any mistake was one team, not the company. Full enforcement across all 140 namespaces took three weeks of label flips, with zero deploy-blocking incidents after the rebuild.
Verify
Confirm the whole pipeline end to end before declaring a policy live.
# 1. Policies are loaded and Ready
kubectl get clusterpolicy
# READY column should be true for each
# 2. A non-compliant Pod is rejected under Enforce
kubectl run bad --image=registry.internal/web:latest --dry-run=server
# Expect: admission webhook denied (latest tag / unsigned)
# 3. A mutate rule actually patched the object
kubectl get pod good -o jsonpath='{.spec.securityContext.runAsNonRoot}'
# Expect: true
# 4. Generated NetworkPolicy exists in a fresh namespace
kubectl create namespace probe
kubectl get networkpolicy -n probe default-deny
# Expect: the synchronized default-deny NetworkPolicy
# 5. Image verification rewrote the tag to a digest
kubectl get pod signed -o jsonpath='{.spec.containers[0].image}'
# Expect: registry.internal/web@sha256:...
# 6. Reports show the historical picture
kubectl get policyreport -A
If step 4 shows nothing, check background-controller RBAC and logs. If step 5 keeps the tag, mutateDigest is off or the digest could not be resolved.