Containerization Containers

Building a Kubernetes Operator with Kubebuilder: CRDs, Reconciliation & Production Hardening

An operator is not “a controller plus a CRD.” It is a promise: declare desired state in a custom resource, and the operator continuously drives the world toward it — surviving restarts, partial failures, and concurrent edits. This guide builds one for real with Kubebuilder, then hardens the parts that bite you in production: idempotency, finalizers, webhooks, and tests.

We’ll model a Cache resource that provisions a Redis-style Deployment plus Service, and (to make finalizers meaningful) registers itself with a hypothetical external metadata service on create and deregisters on delete.

1. Operator pattern fundamentals

The whole design rests on one idea: level-triggered reconciliation, not edge-triggered event handling.

Edge-triggered thinking (“on create, do X; on update, do Y”) is a trap. Events get coalesced, dropped on restart, and delivered out of order. Instead, your Reconcile function receives only a name and namespace, fetches current desired state, observes current actual state, and computes the diff. It must produce the same result whether it’s the first call or the thousandth.

            +-----------------------+
desired --> |  Reconcile(req)       | --> actual cluster state
(spec)      |  observe -> diff ->   |
            |  act -> update status |
            +-----------------------+
                    ^      |
                    +------+  re-queued on change, error, or interval

Three rules follow directly:

2. Scaffold with Kubebuilder and design the CRD

Initialize the project and create the API. Pick a domain you own; the group/version/kind become your API surface.

mkdir cache-operator && cd cache-operator
go mod init github.com/acme/cache-operator

# Scaffold project layout, Makefile, manager main.go
kubebuilder init --domain acme.io --repo github.com/acme/cache-operator

# Create the API: group cache, version v1alpha1, kind Cache
kubebuilder create api --group cache --version v1alpha1 --kind Cache \
  --resource --controller

Now define the type. The key design decisions live in the marker comments above the struct, not in the fields themselves.

// api/v1alpha1/cache_types.go
package v1alpha1

import (
	corev1 "k8s.io/api/core/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type CacheSpec struct {
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:validation:Maximum=9
	// +kubebuilder:default=1
	Replicas int32 `json:"replicas"`

	// +kubebuilder:validation:Required
	Image string `json:"image"`

	// +optional
	Resources corev1.ResourceRequirements `json:"resources,omitempty"`
}

type CacheStatus struct {
	// ObservedGeneration is the .metadata.generation the status reflects.
	ObservedGeneration int64 `json:"observedGeneration,omitempty"`

	ReadyReplicas int32 `json:"readyReplicas,omitempty"`

	// +listType=map
	// +listMapKey=type
	Conditions []metav1.Condition `json:"conditions,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
type Cache struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   CacheSpec   `json:"spec,omitempty"`
	Status CacheStatus `json:"status,omitempty"`
}

The +kubebuilder:subresource:status marker is load-bearing. It puts /status on its own endpoint, so a status write cannot accidentally clobber spec, and a spec edit does not bump nothing — it makes status.observedGeneration vs metadata.generation a reliable “have I caught up?” signal.

Regenerate the deepcopy code and CRD manifests after every type change:

make generate   # runs controller-gen object — DeepCopy methods
make manifests  # runs controller-gen crd,rbac,webhook -> config/crd, config/rbac

3. Implement an idempotent, level-triggered Reconcile

The reconcile body is a fixed skeleton. Memorize this shape; deviating from it is how bugs get in.

func (r *CacheReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := logf.FromContext(ctx)

	// 1. ALWAYS re-fetch. Never trust cached object state from the event.
	var cache cachev1alpha1.Cache
	if err := r.Get(ctx, req.NamespacedName, &cache); err != nil {
		// NotFound means it was deleted and finalizers (if any) already ran.
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	// 2. (Finalizer + deletion handling goes here — see section 5.)

	// 3. Reconcile owned objects toward desired state, idempotently.
	dep := r.desiredDeployment(&cache)
	if err := ctrl.SetControllerReference(&cache, dep, r.Scheme); err != nil {
		return ctrl.Result{}, err
	}
	if err := r.applyDeployment(ctx, dep); err != nil {
		return ctrl.Result{}, err
	}

	// 4. Observe actual state and update status (on the status subresource).
	var live appsv1.Deployment
	if err := r.Get(ctx, client.ObjectKeyFromObject(dep), &live); err != nil {
		return ctrl.Result{}, err
	}

	cache.Status.ReadyReplicas = live.Status.ReadyReplicas
	cache.Status.ObservedGeneration = cache.Generation
	meta.SetStatusCondition(&cache.Status.Conditions, metav1.Condition{
		Type:    "Available",
		Status:  conditionStatus(live.Status.ReadyReplicas >= cache.Spec.Replicas),
		Reason:  "DeploymentReady",
		Message: fmt.Sprintf("%d/%d replicas ready", live.Status.ReadyReplicas, cache.Spec.Replicas),
	})
	if err := r.Status().Update(ctx, &cache); err != nil {
		return ctrl.Result{}, err
	}

	log.Info("reconciled", "ready", cache.Status.ReadyReplicas)
	return ctrl.Result{}, nil
}

A few non-obvious rules baked into that code:

4. Owner references, server-side apply, and avoiding reconcile storms

ctrl.SetControllerReference stamps the Deployment with an owner reference back to the Cache. This does two things: garbage collection deletes the Deployment automatically when the Cache is removed, and your watch on owned objects works.

Wire that up in SetupWithManager so a change to the managed Deployment triggers a reconcile of the owner:

func (r *CacheReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&cachev1alpha1.Cache{}).
		Owns(&appsv1.Deployment{}).   // maps owned obj -> owner via ownerRef
		Owns(&corev1.Service{}).
		Complete(r)
}

For the actual write, server-side apply beats the classic get-then-update dance. SSA declares the fields you own; other controllers and humans can own other fields without a write-write conflict, and there is no read-modify-write race:

func (r *CacheReconciler) applyDeployment(ctx context.Context, dep *appsv1.Deployment) error {
	return r.Patch(ctx, dep, client.Apply,
		client.FieldOwner("cache-operator"),
		client.ForceOwnership)
}

The reconcile storm. The classic self-inflicted outage: your reconcile writes an object on every call, the write triggers a watch event, which triggers another reconcile, which writes again — a hot loop pinning a CPU. Server-side apply with a stable FieldOwner is inherently no-op when nothing changed (the resourceVersion doesn’t move, so no event fires). If you must use CreateOrUpdate, mutate only inside its callback and never set timestamps, random values, or re-ordered slices.

5. Finalizers for safe external cleanup

Owner references handle in-cluster GC. They do nothing for resources outside the cluster — the external registration in our example. For that you need a finalizer: a string in metadata.finalizers that blocks deletion until you remove it.

The deletion-handling block from section 3 expands to this. Order matters: add the finalizer before doing external work, run cleanup before removing the finalizer.

const finalizer = "cache.acme.io/finalizer"

// Inside Reconcile, after the Get:

if cache.DeletionTimestamp.IsZero() {
	// Not being deleted: ensure our finalizer is present.
	if !controllerutil.ContainsFinalizer(&cache, finalizer) {
		controllerutil.AddFinalizer(&cache, finalizer)
		if err := r.Update(ctx, &cache); err != nil {
			return ctrl.Result{}, err
		}
	}
} else {
	// Being deleted: run external cleanup, then drop the finalizer.
	if controllerutil.ContainsFinalizer(&cache, finalizer) {
		if err := r.deregisterExternal(ctx, &cache); err != nil {
			// Return the error — deletion stays blocked, we retry with backoff.
			return ctrl.Result{}, err
		}
		controllerutil.RemoveFinalizer(&cache, finalizer)
		if err := r.Update(ctx, &cache); err != nil {
			return ctrl.Result{}, err
		}
	}
	// Finalizer gone -> Kubernetes completes deletion. Stop here.
	return ctrl.Result{}, nil
}

Two failure modes worth internalizing:

6. Validating and defaulting with admission webhooks

CRD OpenAPI validation (the +kubebuilder:validation markers) handles structural rules — ranges, required fields, enums. For cross-field logic, immutability, or environment-aware defaults, you need webhooks.

Scaffold them:

kubebuilder create webhook --group cache --version v1alpha1 --kind Cache \
  --defaulting --programmatic-validation

Modern Kubebuilder generates the CustomDefaulter and CustomValidator interfaces (decoupled from the API type). The validator signatures return warnings plus an error:

// internal/webhook/v1alpha1/cache_webhook.go

func (v *CacheCustomValidator) ValidateUpdate(
	ctx context.Context, oldObj, newObj runtime.Object,
) (admission.Warnings, error) {
	oldC := oldObj.(*cachev1alpha1.Cache)
	newC := newObj.(*cachev1alpha1.Cache)

	// Enforce immutability that OpenAPI can't express.
	if oldC.Spec.Image != newC.Spec.Image {
		return nil, field.Forbidden(
			field.NewPath("spec", "image"),
			"image is immutable; delete and recreate the Cache")
	}
	return nil, nil
}

Webhooks require TLS and a CABundle wired into the webhook configuration. In real deployments let cert-manager issue and rotate the serving cert, and use Kubebuilder’s [WEBHOOK] and [CERTMANAGER] kustomize patches rather than managing certs by hand. Always set failurePolicy deliberately: Fail (default) blocks writes if the webhook is down — safe but can wedge your cluster; Ignore is more available but lets invalid objects through.

Conversion webhooks become relevant the moment you ship a v1beta1. Set one version as the storage version (+kubebuilder:storageversion) and implement Hub/Convertible so the API server can round-trip objects between versions. Get the conversion functions wrong and you silently corrupt stored data — treat them like a database migration.

7. Test with envtest, and emit events, conditions, and metrics

envtest runs a real kube-apiserver and etcd binary locally — no kubelet, no scheduler. You get genuine API-server validation, admission, and your CRD, which is exactly what controller logic depends on.

// suite_test.go (Ginkgo)
var _ = Describe("Cache controller", func() {
	It("creates an owned Deployment", func() {
		cache := &cachev1alpha1.Cache{
			ObjectMeta: metav1.ObjectMeta{Name: "c1", Namespace: "default"},
			Spec:       cachev1alpha1.CacheSpec{Image: "redis:7", Replicas: 3},
		}
		Expect(k8sClient.Create(ctx, cache)).To(Succeed())

		// Reconcile is async — poll with Eventually, never sleep.
		key := types.NamespacedName{Name: "c1", Namespace: "default"}
		Eventually(func() error {
			var dep appsv1.Deployment
			return k8sClient.Get(ctx, key, &dep)
		}, "10s", "200ms").Should(Succeed())
	})
})
# Downloads pinned apiserver/etcd binaries and runs the suite
make test

For observability, three signals matter and each has a built-in path:

Signal Mechanism Surfaces in
Events r.Recorder.Event(&cache, corev1.EventTypeNormal, "Created", "...") kubectl describe cache
Conditions meta.SetStatusCondition on status.conditions kubectl get cache -o yaml, gates
Metrics controller-runtime exposes /metrics; register custom counters with metrics.Registry Prometheus

controller-runtime already emits controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total, and a reconcile-duration histogram. Alert on the error rate and a rising reconcile queue depth before writing any custom metric.

8. Package, generate RBAC, and distribute

RBAC is generated from markers, never written by hand. Annotate the reconciler with exactly the permissions it uses — least privilege is the default if you are honest in these comments:

// +kubebuilder:rbac:groups=cache.acme.io,resources=caches,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=cache.acme.io,resources=caches/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=cache.acme.io,resources=caches/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups="",resources=events,verbs=create;patch

make manifests regenerates config/rbac/role.yaml from these. Then build, push, and deploy:

make docker-build docker-push IMG=ghcr.io/acme/cache-operator:v0.1.0
make deploy IMG=ghcr.io/acme/cache-operator:v0.1.0   # applies CRDs + RBAC + manager

For distribution, pick based on audience:

Ship CRDs out-of-band from the operator Deployment when you can. Helm’s handling of CRD upgrades is notoriously weak (it does not upgrade CRDs in the crds/ directory on helm upgrade), and a CRD is cluster-scoped shared state — treat its lifecycle more carefully than the workload.

Enterprise scenario

A payments platform ran a multi-tenant operator that provisioned a Cache per tenant. After a routine helm upgrade bumped the manager image, every reconcile across ~1,400 Cache objects fired at once on leader-election handover. The controller hammered the API server, the workqueue depth alert paged, and controller_runtime_reconcile_total spiked while apiserver_request_duration_seconds for PATCH deployments blew past 2s. Root cause was a non-idempotent write: the team had added a last-reconciled annotation set to time.Now() inside the deployment spec on every pass, so server-side apply saw a diff every call — a textbook reconcile storm, dormant until a mass re-sync exposed it.

Two fixes shipped. First, the timestamp moved out of the applied object and into status only. Second — the part most teams miss — they bounded concurrency and client throughput so a future re-sync degrades gracefully instead of self-DoSing:

// cmd/main.go — manager setup
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
    Controller: config.Controller{MaxConcurrentReconciles: 4},
})
// and rate-limit the client itself
cfg.QPS, cfg.Burst = 30, 50

They also added a PrometheusRule alerting on rate(controller_runtime_reconcile_errors_total[5m]) and workqueue_depth, with the depth alert firing before saturation rather than after. The lesson: idempotency isn’t a code-review nicety — at fleet scale, one time.Now() in an applied field is a latent outage waiting for a leader change.

Verify

Confirm the operator works end to end against a real cluster (a kind cluster is fine):

# 1. CRD is registered
kubectl get crd caches.cache.acme.io

# 2. Create an instance and watch the operator converge
kubectl apply -f - <<'EOF'
apiVersion: cache.acme.io/v1alpha1
kind: Cache
metadata:
  name: demo
spec:
  image: redis:7
  replicas: 3
EOF

# 3. Owner-referenced Deployment appears and becomes ready
kubectl get deploy -l app=demo -o wide
kubectl get cache demo   # printer columns show Replicas / Ready

# 4. Status subresource reflects reality
kubectl get cache demo -o jsonpath='{.status.conditions[?(@.type=="Available")].status}{"\n"}'

# 5. Self-healing: delete the managed Deployment, it comes back
kubectl delete deploy demo
kubectl get deploy -l app=demo -w   # recreated by the next reconcile

# 6. Finalizer blocks deletion until external cleanup runs
kubectl delete cache demo            # blocks briefly in Terminating, then clears
kubectl get cache demo               # NotFound

# 7. Webhook rejects an immutable change
kubectl patch cache demo --type=merge -p '{"spec":{"image":"redis:6"}}'
# expect: admission webhook ... image is immutable

Check the operator’s own health:

kubectl logs -n cache-operator-system deploy/cache-operator-controller-manager -c manager
kubectl get --raw /metrics | grep controller_runtime_reconcile_errors_total

Production-readiness checklist

Pitfalls

The mistakes that cause real incidents, ranked by how often I see them:

  1. Non-idempotent writes -> reconcile storms. Anything that changes on every call (timestamps, generated names, re-sorted lists) creates an infinite hot loop. Diff before you write.
  2. Stuck Terminating. A finalizer whose cleanup function can permanently fail wedges deletion forever. Make cleanup idempotent and bounded; “not found externally” is success.
  3. Mutating spec in the status update path (or vice versa). Keep the two writes strictly separate, and only ever write status via r.Status().
  4. Webhook failurePolicy: Fail with no availability plan. If the webhook pod is down, all writes to that resource — including by the operator itself — are blocked. Run multiple replicas or scope the webhook narrowly.
  5. Hand-edited RBAC drifting from code. Always regenerate with make manifests; a missing verb fails silently at runtime as a permission error mid-reconcile, not at deploy time.

Get the reconcile skeleton and idempotency right first — everything else is hardening on top of a loop that already converges correctly.

KubernetesOperatorKubebuilderGoCRD

Comments

Keep Reading