Containerization DevOps

GitOps with Flux: Image Update Automation, OCI Artifact Sources, and Hard Multi-Tenancy

Argo CD gets the conference talks, but Flux quietly runs a lot of the largest GitOps platforms because its controllers compose. Each one does a narrow job and reconciles a single CRD, which means you can wire image scanning to commit automation to OCI distribution without a monolith in the middle. The cost is that you have to understand the pieces. This is a platform-engineer’s tour: image automation that writes tags back to Git, manifests shipped as OCI artifacts instead of cloned repos, and the RBAC plumbing that makes hard multi-tenancy actually hold.

I’m assuming Flux v2 (the GitOps Toolkit, apiVersion group *.toolkit.fluxcd.io), the flux CLI v2.x, and a cluster you have admin on.

1. The controllers and what each reconciles

Flux is five controllers, each owning a set of CRDs:

Controller Reconciles Job
source-controller GitRepository, OCIRepository, HelmRepository, Bucket Fetch and expose artifacts
kustomize-controller Kustomization Build kustomize overlays and apply
helm-controller HelmRelease Render charts and manage releases
image-reflector-controller ImageRepository, ImagePolicy Scan registries, select tags
image-automation-controller ImageUpdateAutomation Write selected tags back to Git

The mental model: source-controller produces artifacts, kustomize/helm controllers consume them and apply to the cluster, and the two image controllers form a separate loop that scans registries and pushes commits. They communicate through the Kubernetes API, not direct calls, so a controller can be down and the rest degrade gracefully rather than cascade.

2. Bootstrap declaratively and structure for tenants

flux bootstrap is imperative-feeling but its job is to make Flux manage its own installation from Git. Bootstrap against GitHub:

export GITHUB_TOKEN=ghp_...
flux bootstrap github \
  --owner=acme-platform \
  --repository=fleet-infra \
  --branch=main \
  --path=clusters/prod \
  --components-extra=image-reflector-controller,image-automation-controller \
  --personal=false

--components-extra is the part people miss: the two image controllers are not installed by default. Without them, your ImageUpdateAutomation objects sit there doing nothing with no obvious error.

For many tenants, separate the cluster’s own config from tenant config. A structure that scales:

fleet-infra/
  clusters/prod/
    flux-system/            # bootstrap-managed
    tenants.yaml            # one Kustomization per tenant, applied by Flux
  tenants/
    base/
      team-a/
        rbac.yaml           # ServiceAccount + RoleBinding
        sync.yaml           # GitRepository + Kustomization (impersonated)
      team-b/
    production/
      team-a/
        kustomization.yaml  # patches base for prod

The platform team owns clusters/ and tenants/base/*/rbac.yaml. Tenants own their own application repos, which the per-tenant GitRepository points at. flux create tenant scaffolds the namespace, service account, and a RoleBinding to a role you provide:

flux create tenant team-a \
  --with-namespace=team-a \
  --cluster-role=tenant-app-admin \
  --export > tenants/base/team-a/rbac.yaml

3. Image automation: scan, select, commit

Three objects drive automated image updates. First, scan the registry:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: podinfo
  namespace: team-a
spec:
  image: ghcr.io/acme-platform/podinfo
  interval: 5m
  secretRef:
    name: ghcr-auth

Then declare which tag wins. The policy is where correctness lives – get the ordering wrong and you ship the wrong image:

apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: podinfo
  namespace: team-a
spec:
  imageRepositoryRef:
    name: podinfo
  filterTags:
    pattern: '^main-[a-f0-9]+-(?P<ts>[0-9]+)$'
    extract: '$ts'
  policy:
    numerical:
      order: asc

This filters to main-<sha>-<timestamp> tags, extracts the timestamp, and picks the numerically highest. For real semver releases, use policy.semver with a range like >=1.0.0 instead – never sort semver lexically. Mark the deployment field Flux should rewrite with a setter marker:

    spec:
      containers:
        - name: podinfo
          image: ghcr.io/acme-platform/podinfo:main-abc123-1718000000 # {"$imagepolicy": "team-a:podinfo"}

Finally, the automation that commits the change back:

apiVersion: image.toolkit.fluxcd.io/v1beta1
kind: ImageUpdateAutomation
metadata:
  name: team-a-images
  namespace: team-a
spec:
  interval: 30m
  sourceRef:
    kind: GitRepository
    name: team-a
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        name: fluxcdbot
        email: fluxcdbot@acme.example
      messageTemplate: |
        Automated image update
        {{ range .Changed.Changes }}{{ .OldValue }} -> {{ .NewValue }}
        {{ end }}
    push:
      branch: flux-image-updates
  update:
    path: ./apps/team-a
    strategy: Setters

Pushing to a dedicated flux-image-updates branch instead of main is the pattern I push teams toward: it forces image bumps through a PR with branch protection and CODEOWNERS, so a registry push can’t silently mutate production. The GitRepository your Kustomization reconciles still tracks main, so nothing deploys until the PR merges.

4. Manifests as OCI artifacts

Cloning Git on every reconcile across hundreds of tenants is load you don’t need, and it couples deploys to your Git host’s availability. Flux can treat any OCI registry as a source. Push your built manifests as an artifact in CI:

flux push artifact \
  oci://ghcr.io/acme-platform/manifests/team-a:$(git rev-parse --short HEAD) \
  --path=./deploy \
  --source="$(git config --get remote.origin.url)" \
  --revision="$(git rev-parse HEAD)"

flux tag artifact \
  oci://ghcr.io/acme-platform/manifests/team-a:$(git rev-parse --short HEAD) \
  --tag=latest

Consume it with OCIRepository instead of GitRepository:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: OCIRepository
metadata:
  name: team-a
  namespace: team-a
spec:
  interval: 10m
  url: oci://ghcr.io/acme-platform/manifests/team-a
  ref:
    semver: ">=1.0.0"
  secretRef:
    name: ghcr-auth
  verify:
    provider: cosign
    secretRef:
      name: cosign-pub

The verify block is the reason to bother with OCI even if you keep Git: Flux refuses to reconcile an artifact whose cosign signature doesn’t validate. Combined with keyless signing in CI, you get a supply-chain gate where an unsigned or tampered artifact never reaches the cluster – something plain Git sources can’t give you without extra tooling.

5. Hard multi-tenancy with impersonation

Soft multi-tenancy (namespaces, NetworkPolicy) is not enough when tenants can author Kustomizations. By default kustomize-controller applies with its own powerful service account, so a tenant manifest could create a ClusterRoleBinding and escalate. Hard multi-tenancy closes this by making Flux impersonate a per-tenant service account that only has namespace-scoped rights:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: team-a
  namespace: team-a
spec:
  serviceAccountName: team-a   # impersonate this SA
  sourceRef:
    kind: OCIRepository
    name: team-a
  path: ./
  prune: true
  interval: 10m
  targetNamespace: team-a

spec.serviceAccountName is the linchpin. kustomize-controller applies the manifests as team-a, so anything the tenant tries that exceeds that SA’s RBAC fails at apply time. Enforce that this field is never omitted by setting --default-service-account on the controller, so a missing serviceAccountName falls back to a powerless SA rather than the controller’s own identity:

flux bootstrap github ... \
  --kustomization-controller-extra-args=--default-service-account=fluxcd-noop

The tenant’s RoleBinding must stay namespace-scoped:

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-a-reconciler
  namespace: team-a
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tenant-app-admin   # a role WITHOUT rbac/clusterrole verbs
subjects:
  - kind: ServiceAccount
    name: team-a
    namespace: team-a

6. Block cross-namespace references and lock sources

Impersonation stops privilege escalation but not data exfiltration. A tenant could point a Kustomization in their namespace at another tenant’s GitRepository via a cross-namespace sourceRef. Two controller flags shut both doors. Disable cross-namespace source references entirely:

--kustomization-controller-extra-args=--no-cross-namespace-refs=true
--helm-controller-extra-args=--no-cross-namespace-refs=true
--notification-controller-extra-args=--no-cross-namespace-refs=true
--image-automation-controller-extra-args=--no-cross-namespace-refs=true

With this set, a sourceRef may only target objects in the same namespace – a tenant physically cannot reference another tenant’s source. Then lock down which URLs sources may use with Kyverno, so a tenant can’t repoint their own OCIRepository at an arbitrary registry:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-flux-source-urls
spec:
  validationFailureAction: Enforce
  rules:
    - name: oci-url-allowlist
      match:
        any:
          - resources:
              kinds: ["OCIRepository"]
      validate:
        message: "OCIRepository url must be under the platform registry"
        pattern:
          spec:
            url: "oci://ghcr.io/acme-platform/manifests/*"

Together these three controls – impersonation, no cross-namespace refs, and a source-URL allowlist – are what I mean by hard multi-tenancy. Any one alone leaves a gap.

7. Progressive delivery with Flagger

Flux applies the desired state; it does not do canaries. Flagger fills that gap and reads the same Deployment Flux reconciles, so the GitOps loop stays the source of truth while Flagger owns the rollout:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: team-a
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  service:
    port: 9898
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m

When Flux’s image automation merges a new tag and the Deployment spec changes, Flagger detects the change, shifts traffic in 10% steps, watches the success-rate metric, and rolls back automatically if it drops below 99%. The whole chain – registry push to Git commit to apply to canary – runs without a human in the path, but every step is observable and reversible.

8. Drift detection, health, and alerts

Flux corrects drift by default: Kustomization and HelmRelease re-apply on every interval, reverting manual kubectl edit. Gate “done” on real health, not just “applied,” with health checks:

spec:
  wait: true
  timeout: 5m
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: podinfo
      namespace: team-a

wait: true blocks the Kustomization as Reconciling until every listed object reports healthy, so a bad rollout surfaces as a failed reconciliation instead of a green-but-broken deploy. Route those failures to Slack with Provider and Alert:

apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: slack
  namespace: flux-system
spec:
  type: slack
  channel: platform-alerts
  secretRef:
    name: slack-url
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: tenant-failures
  namespace: flux-system
spec:
  providerRef:
    name: slack
  eventSeverity: error
  eventSources:
    - kind: Kustomization
      name: '*'
    - kind: HelmRelease
      name: '*'

Verify

Confirm the pipeline end to end:

# Controllers and CRDs healthy
flux check

# Sources are pulling artifacts
flux get sources oci --all-namespaces
flux get sources git --all-namespaces

# Image scan picked the expected tag
flux get image policy podinfo -n team-a
# LATEST IMAGE should show ghcr.io/.../podinfo:main-...

# Automation committed back to Git
flux get image update team-a-images -n team-a

# Impersonation is in effect (should be the tenant SA, not flux-system)
kubectl get kustomization team-a -n team-a -o jsonpath='{.spec.serviceAccountName}'

# Cross-namespace refs are blocked
kubectl get deploy kustomize-controller -n flux-system \
  -o jsonpath='{.spec.template.spec.containers[0].args}' | tr ',' '\n' | grep cross-namespace

# Force a reconcile and watch health gating
flux reconcile kustomization team-a -n team-a --with-source

A correctly wired tenant shows Ready=True with a recent Applied revision, the ImagePolicy reports a LATEST IMAGE, and an out-of-policy sourceRef is rejected by the API server before reconciliation.

Enterprise scenario

A fintech platform team ran 80+ product squads on shared clusters under PCI scope. The audit finding that triggered the work: a squad’s Kustomization had created a ClusterRoleBinding granting cluster-admin, because kustomize-controller applied with its own identity and nothing stopped it. Worse, two squads were reconciling from each other’s Git repos via cross-namespace sourceRef, so one team’s broken manifest had taken down another’s service.

They couldn’t move squads to separate clusters – the per-cluster control-plane and node overhead was rejected on cost. So they hardened the shared model: impersonation on every Kustomization via --default-service-account=fluxcd-restricted, --no-cross-namespace-refs=true on the kustomize, helm, image-automation, and notification controllers, and a Kyverno policy pinning each OCIRepository URL to that squad’s own path. The single highest-leverage change was the default service account, because it closed the escalation path even for Kustomizations that omitted serviceAccountName:

flux bootstrap github \
  --owner=acme-platform --repository=fleet-infra \
  --path=clusters/prod \
  --kustomization-controller-extra-args=--default-service-account=fluxcd-restricted,--no-cross-namespace-refs=true

fluxcd-restricted was a ServiceAccount with no RoleBindings at all, so any Kustomization that forgot to impersonate a real tenant SA could create exactly nothing. Re-running the pen test, the escalation path was closed and the cross-tenant blast radius was gone – and the cluster bill didn’t move.

Checklist

fluxgitopskubernetesmulti-tenancyoci

Comments

Keep Reading