Scaling GitOps with Argo CD: App-of-Apps, ApplicationSets, and Multi-Cluster Fan-Out

Argo CD scales fine to a handful of apps. The trouble starts at the third cluster and the fiftieth app, when hand-authored Application manifests become the thing you spend your weekends reconciling. This is the topology, generator strategy, and guardrail set I reach for when a platform has to fan a few hundred workloads across many clusters without turning into a drift factory.

1. Repo topology: where environment config actually lives

The first decision dominates everything downstream. You are choosing between a monorepo and a polyrepo, and separately deciding where per-environment values live.

My default for a platform team is a small number of repos with clear ownership:

platform-gitops — Argo CD bootstrap, AppProject definitions, and the ApplicationSets that generate everything else. Owned by the platform team.
app-config — per-app, per-environment overlays (Kustomize) or values files (Helm). Owned by app teams, gated by CODEOWNERS.
Application source repos — the actual app code and its Helm chart or base Kustomize. Owned by the service team.

The non-negotiable rule: rendered desired state is keyed by (cluster, environment, app) and lives in Git, never in cluster annotations. A common layout in app-config:

app-config/
  apps/
    checkout/
      base/                 # kustomization.yaml + manifests, or a Helm chart ref
      overlays/
        dev/
        staging/
        prod-eu/
        prod-us/
    inventory/
      base/
      overlays/
        ...

Monorepo vs polyrepo is less about scale and more about blast radius and review ownership. A monorepo gives you atomic cross-cutting changes and one place to grep; a polyrepo gives you hard RBAC and per-team CI. Pick monorepo unless your org chart forces isolation, and use directory-scoped CODEOWNERS to recover most of the isolation benefit.

2. The app-of-apps pattern, and exactly where it breaks

App-of-apps is one parent Application whose source is a directory of child Application manifests. Argo CD syncs the parent, the children appear, and they sync their own targets.

# platform-gitops/bootstrap/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root
  namespace: argocd
spec:
  project: platform
  source:
    repoURL: https://github.com/acme/platform-gitops.git
    targetRevision: main
    path: apps            # a directory full of child Application manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

This is great until you are copy-pasting child manifests. The pattern breaks down when:

You add a cluster and have to hand-write N new child apps.
The only difference between children is a cluster name and a values file — pure boilerplate.
You want a child app to appear only on clusters with a given label.

That boilerplate is precisely what ApplicationSet exists to eliminate. Treat app-of-apps as the bootstrap mechanism (one root app, committed once) and let ApplicationSets generate the leaves.

3. ApplicationSet generators in depth

An ApplicationSet is a controller-managed template plus one or more generators that produce parameters. The controller renders one Application per generated parameter set. Here are the four I use most.

Git generator (directories)

Generate one app per directory found in a repo. Add a directory under apps/, get an app for free.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: tenant-apps
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
    - git:
        repoURL: https://github.com/acme/app-config.git
        revision: main
        directories:
          - path: apps/*
  template:
    metadata:
      name: '{{.path.basename}}'
    spec:
      project: tenants
      source:
        repoURL: https://github.com/acme/app-config.git
        targetRevision: main
        path: '{{.path.path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{.path.basename}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Enable goTemplate: true on every new ApplicationSet. The legacy fasttemplate syntax cannot do conditionals or safe nested lookups, and missingkey=error turns a typo into a render failure instead of a silently empty field.

Cluster generator

Generate one app per registered cluster, optionally filtered by label. This is the heart of multi-cluster fan-out. Argo CD stores each cluster as a Secret labeled argocd.argoproj.io/secret-type: cluster; you add your own labels there.

generators:
  - clusters:
      selector:
        matchLabels:
          environment: prod
          region: eu

Inside the template you reference {{.name}}, {{.server}}, and any label as {{index .metadata.labels "region"}} (with goTemplate). Label your clusters once at registration and the selector does the routing.

Matrix generator

The workhorse for “every app on every matching cluster.” A matrix takes the Cartesian product of two child generators — typically git (apps) crossed with clusters.

generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/acme/app-config.git
            revision: main
            directories:
              - path: apps/*
        - clusters:
            selector:
              matchLabels:
                environment: prod
  template:
    metadata:
      name: '{{.path.basename}}-{{.name}}'
    spec:
      project: tenants
      source:
        repoURL: https://github.com/acme/app-config.git
        targetRevision: main
        path: '{{.path.path}}/overlays/{{index .metadata.labels "environment"}}'
      destination:
        server: '{{.server}}'
        namespace: '{{.path.basename}}'

One ApplicationSet, every prod cluster, every app, each pointed at its environment overlay. Add a cluster: apps appear. Add an app directory: it lands on all matching clusters. That is the whole point.

Pull-request generator

Spin up ephemeral preview environments per open PR, and let them be garbage-collected when the PR closes. Combine with a requeueAfterSeconds poll or a webhook.

generators:
  - pullRequest:
      github:
        owner: acme
        repo: checkout
        tokenRef:
          secretName: github-token
          key: token
        labels:
          - preview
      requeueAfterSeconds: 120
  template:
    metadata:
      name: 'checkout-pr-{{.number}}'
    spec:
      source:
        targetRevision: '{{.head_sha}}'
      # ...

Pair this with spec.syncPolicy.preserveResourcesOnDeletion: false so closing the PR tears the namespace down.

4. Templating overlays without duplication

The fastest way to ruin a GitOps repo is to copy a 200-line values file four times. Two clean approaches:

Helm value layering. Keep one base values.yaml plus thin per-environment files, and let Argo CD apply them in order (later wins).

source:
  repoURL: https://github.com/acme/checkout.git
  targetRevision: main
  path: charts/checkout
  helm:
    valueFiles:
      - values.yaml
      - ../../app-config/apps/checkout/overlays/{{.env}}/values.yaml
    parameters:
      - name: image.tag
        value: '{{.image_tag}}'

Kustomize overlays. A base/ with shared manifests and overlays that only express the delta — replica counts, resource limits, ingress hosts — via patches and images: tags. The overlay should be tens of lines, not hundreds. If an overlay starts to look like a full copy of the base, your base is under-parameterized.

Do not mix both engines for the same app. Pick Helm or Kustomize per app and keep the override surface as small as possible. The override file is the diff a reviewer reads to approve a prod change; keep it readable.

5. Sync waves and hooks for ordered, stateful workloads

Argo CD applies resources in waves. Lower wave numbers go first, and Argo CD waits for each wave to become healthy before starting the next. This is how you sequence a database ahead of the app that depends on it.

metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"   # CRDs, namespaces, operators first

A pragmatic ordering:

Wave	Resources
-2	CRDs, namespaces
-1	Operators, secrets/config, PersistentVolumeClaims
0	StatefulSets (databases, brokers)
1	Schema migration `Job` (PreSync hook)
2	Deployments, Services
3	Ingress, smoke-test `Job` (PostSync hook)

Hooks run scripts at defined points in the sync:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded

Waves order resources within a single sync of one Application. They do not order separate Applications. For cross-app ordering (operator app must be healthy before tenant apps sync), use sync waves on the child Application objects themselves in the app-of-apps directory, or split into stacked ApplicationSets and gate on health. Do not assume two Applications honor each other’s wave numbers — they do not.

6. Detecting and remediating drift

Drift is any divergence between Git and the live cluster. Three controls govern how Argo CD responds.

selfHeal — when the live state drifts from Git (someone ran kubectl edit), Argo CD reverts it. Turn this on for prod.
prune — when a resource is deleted from Git, Argo CD deletes it from the cluster. Without prune you accumulate orphans.
ignoreDifferences — tells Argo CD to stop fighting controllers that legitimately mutate the spec (HPA editing replicas, a webhook injecting a sidecar, a CA injecting a bundle).

spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - PruneLast=true            # prune after everything else applies
      - ServerSideApply=true      # cleaner field ownership on large CRDs
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas          # let the HPA own replica count
    - group: ""
      kind: Secret
      jqPathExpressions:
        - '.data["ca.crt"]'       # ignore a CA-injected field

selfHeal without scoped ignoreDifferences will war with your HPA and flap forever — Argo CD reverts the replica count, the HPA re-scales, repeat. Tune ignoreDifferences first, then enable self-heal. PruneLast=true is cheap insurance against an ordering bug deleting a still-referenced resource mid-sync.

7. Securing the control plane: projects and RBAC

Argo CD’s AppProject is the multi-tenancy boundary. A project restricts which repos, destination clusters/namespaces, and resource kinds its Applications may touch — and to which namespaces it can deploy.

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: team-payments
  namespace: argocd
spec:
  sourceRepos:
    - https://github.com/acme/app-config.git
    - https://github.com/acme/checkout.git
  destinations:
    - server: https://prod-eu.example.com
      namespace: 'payments-*'
  clusterResourceWhitelist: []          # deny all cluster-scoped resources
  namespaceResourceBlacklist:
    - group: ""
      kind: ResourceQuota
  roles:
    - name: deployer
      policies:
        - p, proj:team-payments:deployer, applications, sync, team-payments/*, allow
      groups:
        - acme:payments-engineers

Layer RBAC on top via argocd-rbac-cm, mapping your SSO groups to actions. A workable model:

Platform team: role:admin.
App teams: project-scoped sync/get on their project only; no create/delete on Applications (those come from ApplicationSets the platform owns).
Everyone else: role:readonly.

The single most effective guardrail is an empty clusterResourceWhitelist plus a namespace-scoped destinations glob per team. It means a compromised or fat-fingered app repo cannot create a ClusterRole, escape its namespace, or deploy to another team’s cluster — the project rejects the sync before anything is applied.

8. Promotion across clusters: config repos vs rendered manifests

Two schools of thought for moving a known-good version from staging to prod.

Config repo (templated). Promotion is a one-line change — bump image_tag in the prod overlay — and Argo CD renders Helm/Kustomize at sync time. Simple, but the cluster runs whatever the templating engine produces now, which can differ from what you reviewed if a chart dependency moved.

Rendered-manifests pattern. CI renders the chart to plain YAML and commits the fully-expanded manifests to an environment branch or directory. Argo CD points at raw YAML, so what you see in Git is byte-for-byte what runs. Promotion becomes a Git diff/merge between environment branches — auditable and reproducible, at the cost of a noisier repo and a rendering step in CI.

For regulated or large fleets I lean rendered-manifests: the diff a reviewer approves is the exact thing that hits prod, and rollbacks are a revert. For smaller teams, a config repo with pinned chart versions and targetRevision set to a tag (never a moving branch) is enough.

Enterprise scenario

A fintech platform team I worked with ran one matrix ApplicationSet (git apps × environment: prod clusters) fanning ~180 apps across 14 EKS clusters in two regions. They added a third region by registering five new cluster Secrets at once. The controller dutifully rendered ~900 new Applications, every one flipped to OutOfSync, and the controller hammered every source repo on the same reconcile tick. GitHub returned 403 secondary rate limit, the argocd-repo-server cache thrashed, and sync latency for the existing fleet blew past 20 minutes. The root cause was unbounded fan-out with no rollout gating: ApplicationSet’s default behavior applies all generated changes simultaneously.

The fix was Progressive Syncs plus concurrency limits. We enabled the rollout strategy so new clusters drained in controlled steps instead of all at once, and capped repo-server parallelism.

spec:
  strategy:
    type: RollingSync
    rollingSync:
      steps:
        - matchExpressions:
            - key: region
              operator: In
              values: [ap-south-1]   # one new region at a time
        - matchExpressions:
            - key: region
              operator: In
              values: [eu-west-1, us-east-1]

We also set --repo-server-parallelism-limit 8 and a webhook instead of polling so reconciles spread out. Bringing a region online went from a 900-app thundering herd to a gated, observable rollout. The lesson: at fleet scale a matrix generator is a loaded gun — gate the rollout before you pull the trigger on a new cluster label.

Verify

Confirm the system behaves as designed before trusting it.

# ApplicationSets generated the expected Applications
kubectl get applicationset -n argocd
kubectl get applications -n argocd -o wide

# A specific app is Synced and Healthy on the right cluster
argocd app get checkout-prod-eu

# Diff live state against Git without syncing
argocd app diff checkout-prod-eu

# Confirm clusters are registered and labeled
argocd cluster list
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster \
  -o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels

# Prove self-heal works: mutate live state, watch it revert
kubectl scale deploy/checkout -n payments --replicas=99
argocd app wait checkout-prod-eu --health

A correctly configured platform shows every Application Synced/Healthy, the manual scale reverts within the reconciliation window (unless replicas is in ignoreDifferences), and adding a cluster Secret with the right labels makes apps appear with no manual manifest edits.

Checklist

Pitfalls

Self-heal flapping. Enabling selfHeal before scoping ignoreDifferences makes Argo CD fight your HPA and admission webhooks indefinitely. Tune ignores first.
Waves do not cross Applications. sync-wave orders resources inside one Application’s sync only. Sequence Applications via the app-of-apps layer, not by hoping wave numbers compose.
Moving targetRevision in prod. Pointing prod at main means a merge anywhere can roll out unreviewed. Pin to a tag or SHA and promote by changing the pin.
Generator typos fail silently. Without missingkey=error, a misspelled parameter renders an empty string and produces a broken-but-accepted Application. Always set it.
Orphaned resources. Skipping prune leaves deleted-from-Git resources running in the cluster forever; Git stops being the source of truth.

Next steps

Wire ApplicationSet and webhook events into notifications (argocd-notifications) so generation failures page someone, add Argo CD’s Prometheus metrics to your dashboards to watch sync latency as the fleet grows, and adopt Progressive Syncs on critical ApplicationSets to roll changes cluster-by-cluster instead of all at once.