An operator is not “a controller plus a CRD.” It is a promise: declare desired state in a custom resource, and the operator continuously drives the world toward it — surviving restarts, partial failures, and concurrent edits. This guide builds one for real with Kubebuilder, then hardens the parts that bite you in production: idempotency, finalizers, webhooks, and tests.
We’ll model a Cache resource that provisions a Redis-style Deployment plus Service, and (to make finalizers meaningful) registers itself with a hypothetical external metadata service on create and deregisters on delete.
1. Operator pattern fundamentals
The whole design rests on one idea: level-triggered reconciliation, not edge-triggered event handling.
Edge-triggered thinking (“on create, do X; on update, do Y”) is a trap. Events get coalesced, dropped on restart, and delivered out of order. Instead, your Reconcile function receives only a name and namespace, fetches current desired state, observes current actual state, and computes the diff. It must produce the same result whether it’s the first call or the thousandth.
+-----------------------+
desired --> | Reconcile(req) | --> actual cluster state
(spec) | observe -> diff -> |
| act -> update status |
+-----------------------+
^ |
+------+ re-queued on change, error, or interval
Three rules follow directly:
- Idempotent. Running twice with no spec change must be a no-op (modulo status).
- Self-correcting. If someone deletes the managed Deployment, the next reconcile recreates it.
- No assumptions about call cause. You never know why you were called. Always re-fetch.
2. Scaffold with Kubebuilder and design the CRD
Initialize the project and create the API. Pick a domain you own; the group/version/kind become your API surface.
mkdir cache-operator && cd cache-operator
go mod init github.com/acme/cache-operator
# Scaffold project layout, Makefile, manager main.go
kubebuilder init --domain acme.io --repo github.com/acme/cache-operator
# Create the API: group cache, version v1alpha1, kind Cache
kubebuilder create api --group cache --version v1alpha1 --kind Cache \
--resource --controller
Now define the type. The key design decisions live in the marker comments above the struct, not in the fields themselves.
// api/v1alpha1/cache_types.go
package v1alpha1
import (
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type CacheSpec struct {
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=9
// +kubebuilder:default=1
Replicas int32 `json:"replicas"`
// +kubebuilder:validation:Required
Image string `json:"image"`
// +optional
Resources corev1.ResourceRequirements `json:"resources,omitempty"`
}
type CacheStatus struct {
// ObservedGeneration is the .metadata.generation the status reflects.
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
type Cache struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec CacheSpec `json:"spec,omitempty"`
Status CacheStatus `json:"status,omitempty"`
}
The
+kubebuilder:subresource:statusmarker is load-bearing. It puts/statuson its own endpoint, so a status write cannot accidentally clobber spec, and a spec edit does not bump nothing — it makesstatus.observedGenerationvsmetadata.generationa reliable “have I caught up?” signal.
Regenerate the deepcopy code and CRD manifests after every type change:
make generate # runs controller-gen object — DeepCopy methods
make manifests # runs controller-gen crd,rbac,webhook -> config/crd, config/rbac
3. Implement an idempotent, level-triggered Reconcile
The reconcile body is a fixed skeleton. Memorize this shape; deviating from it is how bugs get in.
func (r *CacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := logf.FromContext(ctx)
// 1. ALWAYS re-fetch. Never trust cached object state from the event.
var cache cachev1alpha1.Cache
if err := r.Get(ctx, req.NamespacedName, &cache); err != nil {
// NotFound means it was deleted and finalizers (if any) already ran.
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. (Finalizer + deletion handling goes here — see section 5.)
// 3. Reconcile owned objects toward desired state, idempotently.
dep := r.desiredDeployment(&cache)
if err := ctrl.SetControllerReference(&cache, dep, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.applyDeployment(ctx, dep); err != nil {
return ctrl.Result{}, err
}
// 4. Observe actual state and update status (on the status subresource).
var live appsv1.Deployment
if err := r.Get(ctx, client.ObjectKeyFromObject(dep), &live); err != nil {
return ctrl.Result{}, err
}
cache.Status.ReadyReplicas = live.Status.ReadyReplicas
cache.Status.ObservedGeneration = cache.Generation
meta.SetStatusCondition(&cache.Status.Conditions, metav1.Condition{
Type: "Available",
Status: conditionStatus(live.Status.ReadyReplicas >= cache.Spec.Replicas),
Reason: "DeploymentReady",
Message: fmt.Sprintf("%d/%d replicas ready", live.Status.ReadyReplicas, cache.Spec.Replicas),
})
if err := r.Status().Update(ctx, &cache); err != nil {
return ctrl.Result{}, err
}
log.Info("reconciled", "ready", cache.Status.ReadyReplicas)
return ctrl.Result{}, nil
}
A few non-obvious rules baked into that code:
- Return the error, let the queue back off. controller-runtime requeues failed reconciles with exponential backoff automatically. You almost never need
RequeueAfterfor error handling — returnerrand stop. client.IgnoreNotFoundturns the post-deletion reconcile into a clean no-op.- Status is a separate write via
r.Status().Update. Never mutate spec and status in the sameUpdatecall.
4. Owner references, server-side apply, and avoiding reconcile storms
ctrl.SetControllerReference stamps the Deployment with an owner reference back to the Cache. This does two things: garbage collection deletes the Deployment automatically when the Cache is removed, and your watch on owned objects works.
Wire that up in SetupWithManager so a change to the managed Deployment triggers a reconcile of the owner:
func (r *CacheReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&cachev1alpha1.Cache{}).
Owns(&appsv1.Deployment{}). // maps owned obj -> owner via ownerRef
Owns(&corev1.Service{}).
Complete(r)
}
For the actual write, server-side apply beats the classic get-then-update dance. SSA declares the fields you own; other controllers and humans can own other fields without a write-write conflict, and there is no read-modify-write race:
func (r *CacheReconciler) applyDeployment(ctx context.Context, dep *appsv1.Deployment) error {
return r.Patch(ctx, dep, client.Apply,
client.FieldOwner("cache-operator"),
client.ForceOwnership)
}
The reconcile storm. The classic self-inflicted outage: your reconcile writes an object on every call, the write triggers a watch event, which triggers another reconcile, which writes again — a hot loop pinning a CPU. Server-side apply with a stable
FieldOwneris inherently no-op when nothing changed (the resourceVersion doesn’t move, so no event fires). If you must useCreateOrUpdate, mutate only inside its callback and never set timestamps, random values, or re-ordered slices.
5. Finalizers for safe external cleanup
Owner references handle in-cluster GC. They do nothing for resources outside the cluster — the external registration in our example. For that you need a finalizer: a string in metadata.finalizers that blocks deletion until you remove it.
The deletion-handling block from section 3 expands to this. Order matters: add the finalizer before doing external work, run cleanup before removing the finalizer.
const finalizer = "cache.acme.io/finalizer"
// Inside Reconcile, after the Get:
if cache.DeletionTimestamp.IsZero() {
// Not being deleted: ensure our finalizer is present.
if !controllerutil.ContainsFinalizer(&cache, finalizer) {
controllerutil.AddFinalizer(&cache, finalizer)
if err := r.Update(ctx, &cache); err != nil {
return ctrl.Result{}, err
}
}
} else {
// Being deleted: run external cleanup, then drop the finalizer.
if controllerutil.ContainsFinalizer(&cache, finalizer) {
if err := r.deregisterExternal(ctx, &cache); err != nil {
// Return the error — deletion stays blocked, we retry with backoff.
return ctrl.Result{}, err
}
controllerutil.RemoveFinalizer(&cache, finalizer)
if err := r.Update(ctx, &cache); err != nil {
return ctrl.Result{}, err
}
}
// Finalizer gone -> Kubernetes completes deletion. Stop here.
return ctrl.Result{}, nil
}
Two failure modes worth internalizing:
deregisterExternalmust be idempotent. It will be retried. “Already deregistered” is success, not an error — otherwise the object is stuck inTerminatingforever.- Finalizer updates need RBAC on the
finalizerssubresource. Kubebuilder generates this when you mark the controller, but if you hand-edit RBAC, theupdateverb oncaches/finalizersis mandatory.
6. Validating and defaulting with admission webhooks
CRD OpenAPI validation (the +kubebuilder:validation markers) handles structural rules — ranges, required fields, enums. For cross-field logic, immutability, or environment-aware defaults, you need webhooks.
Scaffold them:
kubebuilder create webhook --group cache --version v1alpha1 --kind Cache \
--defaulting --programmatic-validation
Modern Kubebuilder generates the CustomDefaulter and CustomValidator interfaces (decoupled from the API type). The validator signatures return warnings plus an error:
// internal/webhook/v1alpha1/cache_webhook.go
func (v *CacheCustomValidator) ValidateUpdate(
ctx context.Context, oldObj, newObj runtime.Object,
) (admission.Warnings, error) {
oldC := oldObj.(*cachev1alpha1.Cache)
newC := newObj.(*cachev1alpha1.Cache)
// Enforce immutability that OpenAPI can't express.
if oldC.Spec.Image != newC.Spec.Image {
return nil, field.Forbidden(
field.NewPath("spec", "image"),
"image is immutable; delete and recreate the Cache")
}
return nil, nil
}
Webhooks require TLS and a
CABundlewired into the webhook configuration. In real deployments let cert-manager issue and rotate the serving cert, and use Kubebuilder’s[WEBHOOK]and[CERTMANAGER]kustomize patches rather than managing certs by hand. Always setfailurePolicydeliberately:Fail(default) blocks writes if the webhook is down — safe but can wedge your cluster;Ignoreis more available but lets invalid objects through.
Conversion webhooks become relevant the moment you ship a v1beta1. Set one version as the storage version (+kubebuilder:storageversion) and implement Hub/Convertible so the API server can round-trip objects between versions. Get the conversion functions wrong and you silently corrupt stored data — treat them like a database migration.
7. Test with envtest, and emit events, conditions, and metrics
envtest runs a real kube-apiserver and etcd binary locally — no kubelet, no scheduler. You get genuine API-server validation, admission, and your CRD, which is exactly what controller logic depends on.
// suite_test.go (Ginkgo)
var _ = Describe("Cache controller", func() {
It("creates an owned Deployment", func() {
cache := &cachev1alpha1.Cache{
ObjectMeta: metav1.ObjectMeta{Name: "c1", Namespace: "default"},
Spec: cachev1alpha1.CacheSpec{Image: "redis:7", Replicas: 3},
}
Expect(k8sClient.Create(ctx, cache)).To(Succeed())
// Reconcile is async — poll with Eventually, never sleep.
key := types.NamespacedName{Name: "c1", Namespace: "default"}
Eventually(func() error {
var dep appsv1.Deployment
return k8sClient.Get(ctx, key, &dep)
}, "10s", "200ms").Should(Succeed())
})
})
# Downloads pinned apiserver/etcd binaries and runs the suite
make test
For observability, three signals matter and each has a built-in path:
| Signal | Mechanism | Surfaces in |
|---|---|---|
| Events | r.Recorder.Event(&cache, corev1.EventTypeNormal, "Created", "...") |
kubectl describe cache |
| Conditions | meta.SetStatusCondition on status.conditions |
kubectl get cache -o yaml, gates |
| Metrics | controller-runtime exposes /metrics; register custom counters with metrics.Registry |
Prometheus |
controller-runtime already emits controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total, and a reconcile-duration histogram. Alert on the error rate and a rising reconcile queue depth before writing any custom metric.
8. Package, generate RBAC, and distribute
RBAC is generated from markers, never written by hand. Annotate the reconciler with exactly the permissions it uses — least privilege is the default if you are honest in these comments:
// +kubebuilder:rbac:groups=cache.acme.io,resources=caches,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=cache.acme.io,resources=caches/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=cache.acme.io,resources=caches/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=events,verbs=create;patch
make manifests regenerates config/rbac/role.yaml from these. Then build, push, and deploy:
make docker-build docker-push IMG=ghcr.io/acme/cache-operator:v0.1.0
make deploy IMG=ghcr.io/acme/cache-operator:v0.1.0 # applies CRDs + RBAC + manager
For distribution, pick based on audience:
- Helm chart — pragmatic for internal platform teams. Kubebuilder can scaffold one (
kubebuilder edit --plugins=helm/v1-alpha); ship the CRD, RBAC, and Deployment together and let consumershelm upgrade. - OLM bundle — the right choice if you publish to OperatorHub or target OpenShift. OLM manages install, upgrade graphs, and dependency resolution via a
ClusterServiceVersion. More machinery, but it handles version-to-version upgrade edges a Helm chart leaves to you.
Ship CRDs out-of-band from the operator Deployment when you can. Helm’s handling of CRD upgrades is notoriously weak (it does not upgrade CRDs in the
crds/directory onhelm upgrade), and a CRD is cluster-scoped shared state — treat its lifecycle more carefully than the workload.
Enterprise scenario
A payments platform ran a multi-tenant operator that provisioned a Cache per tenant. After a routine helm upgrade bumped the manager image, every reconcile across ~1,400 Cache objects fired at once on leader-election handover. The controller hammered the API server, the workqueue depth alert paged, and controller_runtime_reconcile_total spiked while apiserver_request_duration_seconds for PATCH deployments blew past 2s. Root cause was a non-idempotent write: the team had added a last-reconciled annotation set to time.Now() inside the deployment spec on every pass, so server-side apply saw a diff every call — a textbook reconcile storm, dormant until a mass re-sync exposed it.
Two fixes shipped. First, the timestamp moved out of the applied object and into status only. Second — the part most teams miss — they bounded concurrency and client throughput so a future re-sync degrades gracefully instead of self-DoSing:
// cmd/main.go — manager setup
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
Controller: config.Controller{MaxConcurrentReconciles: 4},
})
// and rate-limit the client itself
cfg.QPS, cfg.Burst = 30, 50
They also added a PrometheusRule alerting on rate(controller_runtime_reconcile_errors_total[5m]) and workqueue_depth, with the depth alert firing before saturation rather than after. The lesson: idempotency isn’t a code-review nicety — at fleet scale, one time.Now() in an applied field is a latent outage waiting for a leader change.
Verify
Confirm the operator works end to end against a real cluster (a kind cluster is fine):
# 1. CRD is registered
kubectl get crd caches.cache.acme.io
# 2. Create an instance and watch the operator converge
kubectl apply -f - <<'EOF'
apiVersion: cache.acme.io/v1alpha1
kind: Cache
metadata:
name: demo
spec:
image: redis:7
replicas: 3
EOF
# 3. Owner-referenced Deployment appears and becomes ready
kubectl get deploy -l app=demo -o wide
kubectl get cache demo # printer columns show Replicas / Ready
# 4. Status subresource reflects reality
kubectl get cache demo -o jsonpath='{.status.conditions[?(@.type=="Available")].status}{"\n"}'
# 5. Self-healing: delete the managed Deployment, it comes back
kubectl delete deploy demo
kubectl get deploy -l app=demo -w # recreated by the next reconcile
# 6. Finalizer blocks deletion until external cleanup runs
kubectl delete cache demo # blocks briefly in Terminating, then clears
kubectl get cache demo # NotFound
# 7. Webhook rejects an immutable change
kubectl patch cache demo --type=merge -p '{"spec":{"image":"redis:6"}}'
# expect: admission webhook ... image is immutable
Check the operator’s own health:
kubectl logs -n cache-operator-system deploy/cache-operator-controller-manager -c manager
kubectl get --raw /metrics | grep controller_runtime_reconcile_errors_total
Production-readiness checklist
Pitfalls
The mistakes that cause real incidents, ranked by how often I see them:
- Non-idempotent writes -> reconcile storms. Anything that changes on every call (timestamps, generated names, re-sorted lists) creates an infinite hot loop. Diff before you write.
- Stuck
Terminating. A finalizer whose cleanup function can permanently fail wedges deletion forever. Make cleanup idempotent and bounded; “not found externally” is success. - Mutating spec in the status update path (or vice versa). Keep the two writes strictly separate, and only ever write status via
r.Status(). - Webhook
failurePolicy: Failwith no availability plan. If the webhook pod is down, all writes to that resource — including by the operator itself — are blocked. Run multiple replicas or scope the webhook narrowly. - Hand-edited RBAC drifting from code. Always regenerate with
make manifests; a missing verb fails silently at runtime as a permission error mid-reconcile, not at deploy time.
Get the reconcile skeleton and idempotency right first — everything else is hardening on top of a loop that already converges correctly.