Argo CD scales fine to a handful of apps. The trouble starts at the third cluster and the fiftieth app, when hand-authored Application manifests become the thing you spend your weekends reconciling. This is the topology, generator strategy, and guardrail set I reach for when a platform has to fan a few hundred workloads across many clusters without turning into a drift factory.
1. Repo topology: where environment config actually lives
The first decision dominates everything downstream. You are choosing between a monorepo and a polyrepo, and separately deciding where per-environment values live.
My default for a platform team is a small number of repos with clear ownership:
platform-gitops— Argo CD bootstrap,AppProjectdefinitions, and the ApplicationSets that generate everything else. Owned by the platform team.app-config— per-app, per-environment overlays (Kustomize) or values files (Helm). Owned by app teams, gated by CODEOWNERS.- Application source repos — the actual app code and its Helm chart or base Kustomize. Owned by the service team.
The non-negotiable rule: rendered desired state is keyed by (cluster, environment, app) and lives in Git, never in cluster annotations. A common layout in app-config:
app-config/
apps/
checkout/
base/ # kustomization.yaml + manifests, or a Helm chart ref
overlays/
dev/
staging/
prod-eu/
prod-us/
inventory/
base/
overlays/
...
Monorepo vs polyrepo is less about scale and more about blast radius and review ownership. A monorepo gives you atomic cross-cutting changes and one place to grep; a polyrepo gives you hard RBAC and per-team CI. Pick monorepo unless your org chart forces isolation, and use directory-scoped CODEOWNERS to recover most of the isolation benefit.
2. The app-of-apps pattern, and exactly where it breaks
App-of-apps is one parent Application whose source is a directory of child Application manifests. Argo CD syncs the parent, the children appear, and they sync their own targets.
# platform-gitops/bootstrap/root-app.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root
namespace: argocd
spec:
project: platform
source:
repoURL: https://github.com/acme/platform-gitops.git
targetRevision: main
path: apps # a directory full of child Application manifests
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
This is great until you are copy-pasting child manifests. The pattern breaks down when:
- You add a cluster and have to hand-write N new child apps.
- The only difference between children is a cluster name and a values file — pure boilerplate.
- You want a child app to appear only on clusters with a given label.
That boilerplate is precisely what ApplicationSet exists to eliminate. Treat app-of-apps as the bootstrap mechanism (one root app, committed once) and let ApplicationSets generate the leaves.
3. ApplicationSet generators in depth
An ApplicationSet is a controller-managed template plus one or more generators that produce parameters. The controller renders one Application per generated parameter set. Here are the four I use most.
Git generator (directories)
Generate one app per directory found in a repo. Add a directory under apps/, get an app for free.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: tenant-apps
namespace: argocd
spec:
goTemplate: true
goTemplateOptions: ["missingkey=error"]
generators:
- git:
repoURL: https://github.com/acme/app-config.git
revision: main
directories:
- path: apps/*
template:
metadata:
name: '{{.path.basename}}'
spec:
project: tenants
source:
repoURL: https://github.com/acme/app-config.git
targetRevision: main
path: '{{.path.path}}'
destination:
server: https://kubernetes.default.svc
namespace: '{{.path.basename}}'
syncPolicy:
automated:
prune: true
selfHeal: true
Enable
goTemplate: trueon every new ApplicationSet. The legacy fasttemplate syntax cannot do conditionals or safe nested lookups, andmissingkey=errorturns a typo into a render failure instead of a silently empty field.
Cluster generator
Generate one app per registered cluster, optionally filtered by label. This is the heart of multi-cluster fan-out. Argo CD stores each cluster as a Secret labeled argocd.argoproj.io/secret-type: cluster; you add your own labels there.
generators:
- clusters:
selector:
matchLabels:
environment: prod
region: eu
Inside the template you reference {{.name}}, {{.server}}, and any label as {{index .metadata.labels "region"}} (with goTemplate). Label your clusters once at registration and the selector does the routing.
Matrix generator
The workhorse for “every app on every matching cluster.” A matrix takes the Cartesian product of two child generators — typically git (apps) crossed with clusters.
generators:
- matrix:
generators:
- git:
repoURL: https://github.com/acme/app-config.git
revision: main
directories:
- path: apps/*
- clusters:
selector:
matchLabels:
environment: prod
template:
metadata:
name: '{{.path.basename}}-{{.name}}'
spec:
project: tenants
source:
repoURL: https://github.com/acme/app-config.git
targetRevision: main
path: '{{.path.path}}/overlays/{{index .metadata.labels "environment"}}'
destination:
server: '{{.server}}'
namespace: '{{.path.basename}}'
One ApplicationSet, every prod cluster, every app, each pointed at its environment overlay. Add a cluster: apps appear. Add an app directory: it lands on all matching clusters. That is the whole point.
Pull-request generator
Spin up ephemeral preview environments per open PR, and let them be garbage-collected when the PR closes. Combine with a requeueAfterSeconds poll or a webhook.
generators:
- pullRequest:
github:
owner: acme
repo: checkout
tokenRef:
secretName: github-token
key: token
labels:
- preview
requeueAfterSeconds: 120
template:
metadata:
name: 'checkout-pr-{{.number}}'
spec:
source:
targetRevision: '{{.head_sha}}'
# ...
Pair this with spec.syncPolicy.preserveResourcesOnDeletion: false so closing the PR tears the namespace down.
4. Templating overlays without duplication
The fastest way to ruin a GitOps repo is to copy a 200-line values file four times. Two clean approaches:
Helm value layering. Keep one base values.yaml plus thin per-environment files, and let Argo CD apply them in order (later wins).
source:
repoURL: https://github.com/acme/checkout.git
targetRevision: main
path: charts/checkout
helm:
valueFiles:
- values.yaml
- ../../app-config/apps/checkout/overlays/{{.env}}/values.yaml
parameters:
- name: image.tag
value: '{{.image_tag}}'
Kustomize overlays. A base/ with shared manifests and overlays that only express the delta — replica counts, resource limits, ingress hosts — via patches and images: tags. The overlay should be tens of lines, not hundreds. If an overlay starts to look like a full copy of the base, your base is under-parameterized.
Do not mix both engines for the same app. Pick Helm or Kustomize per app and keep the override surface as small as possible. The override file is the diff a reviewer reads to approve a prod change; keep it readable.
5. Sync waves and hooks for ordered, stateful workloads
Argo CD applies resources in waves. Lower wave numbers go first, and Argo CD waits for each wave to become healthy before starting the next. This is how you sequence a database ahead of the app that depends on it.
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1" # CRDs, namespaces, operators first
A pragmatic ordering:
| Wave | Resources |
|---|---|
| -2 | CRDs, namespaces |
| -1 | Operators, secrets/config, PersistentVolumeClaims |
| 0 | StatefulSets (databases, brokers) |
| 1 | Schema migration Job (PreSync hook) |
| 2 | Deployments, Services |
| 3 | Ingress, smoke-test Job (PostSync hook) |
Hooks run scripts at defined points in the sync:
apiVersion: batch/v1
kind: Job
metadata:
name: db-migrate
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
Waves order resources within a single sync of one Application. They do not order separate Applications. For cross-app ordering (operator app must be healthy before tenant apps sync), use sync waves on the child Application objects themselves in the app-of-apps directory, or split into stacked ApplicationSets and gate on health. Do not assume two Applications honor each other’s wave numbers — they do not.
6. Detecting and remediating drift
Drift is any divergence between Git and the live cluster. Three controls govern how Argo CD responds.
- selfHeal — when the live state drifts from Git (someone ran
kubectl edit), Argo CD reverts it. Turn this on for prod. - prune — when a resource is deleted from Git, Argo CD deletes it from the cluster. Without prune you accumulate orphans.
- ignoreDifferences — tells Argo CD to stop fighting controllers that legitimately mutate the spec (HPA editing replicas, a webhook injecting a sidecar, a CA injecting a bundle).
spec:
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- PruneLast=true # prune after everything else applies
- ServerSideApply=true # cleaner field ownership on large CRDs
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # let the HPA own replica count
- group: ""
kind: Secret
jqPathExpressions:
- '.data["ca.crt"]' # ignore a CA-injected field
selfHealwithout scopedignoreDifferenceswill war with your HPA and flap forever — Argo CD reverts the replica count, the HPA re-scales, repeat. TuneignoreDifferencesfirst, then enable self-heal.PruneLast=trueis cheap insurance against an ordering bug deleting a still-referenced resource mid-sync.
7. Securing the control plane: projects and RBAC
Argo CD’s AppProject is the multi-tenancy boundary. A project restricts which repos, destination clusters/namespaces, and resource kinds its Applications may touch — and to which namespaces it can deploy.
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: team-payments
namespace: argocd
spec:
sourceRepos:
- https://github.com/acme/app-config.git
- https://github.com/acme/checkout.git
destinations:
- server: https://prod-eu.example.com
namespace: 'payments-*'
clusterResourceWhitelist: [] # deny all cluster-scoped resources
namespaceResourceBlacklist:
- group: ""
kind: ResourceQuota
roles:
- name: deployer
policies:
- p, proj:team-payments:deployer, applications, sync, team-payments/*, allow
groups:
- acme:payments-engineers
Layer RBAC on top via argocd-rbac-cm, mapping your SSO groups to actions. A workable model:
- Platform team:
role:admin. - App teams: project-scoped sync/get on their project only; no
create/deleteon Applications (those come from ApplicationSets the platform owns). - Everyone else:
role:readonly.
The single most effective guardrail is an empty
clusterResourceWhitelistplus a namespace-scopeddestinationsglob per team. It means a compromised or fat-fingered app repo cannot create aClusterRole, escape its namespace, or deploy to another team’s cluster — the project rejects the sync before anything is applied.
8. Promotion across clusters: config repos vs rendered manifests
Two schools of thought for moving a known-good version from staging to prod.
Config repo (templated). Promotion is a one-line change — bump image_tag in the prod overlay — and Argo CD renders Helm/Kustomize at sync time. Simple, but the cluster runs whatever the templating engine produces now, which can differ from what you reviewed if a chart dependency moved.
Rendered-manifests pattern. CI renders the chart to plain YAML and commits the fully-expanded manifests to an environment branch or directory. Argo CD points at raw YAML, so what you see in Git is byte-for-byte what runs. Promotion becomes a Git diff/merge between environment branches — auditable and reproducible, at the cost of a noisier repo and a rendering step in CI.
For regulated or large fleets I lean rendered-manifests: the diff a reviewer approves is the exact thing that hits prod, and rollbacks are a revert. For smaller teams, a config repo with pinned chart versions and targetRevision set to a tag (never a moving branch) is enough.
Enterprise scenario
A fintech platform team I worked with ran one matrix ApplicationSet (git apps × environment: prod clusters) fanning ~180 apps across 14 EKS clusters in two regions. They added a third region by registering five new cluster Secrets at once. The controller dutifully rendered ~900 new Applications, every one flipped to OutOfSync, and the controller hammered every source repo on the same reconcile tick. GitHub returned 403 secondary rate limit, the argocd-repo-server cache thrashed, and sync latency for the existing fleet blew past 20 minutes. The root cause was unbounded fan-out with no rollout gating: ApplicationSet’s default behavior applies all generated changes simultaneously.
The fix was Progressive Syncs plus concurrency limits. We enabled the rollout strategy so new clusters drained in controlled steps instead of all at once, and capped repo-server parallelism.
spec:
strategy:
type: RollingSync
rollingSync:
steps:
- matchExpressions:
- key: region
operator: In
values: [ap-south-1] # one new region at a time
- matchExpressions:
- key: region
operator: In
values: [eu-west-1, us-east-1]
We also set --repo-server-parallelism-limit 8 and a webhook instead of polling so reconciles spread out. Bringing a region online went from a 900-app thundering herd to a gated, observable rollout. The lesson: at fleet scale a matrix generator is a loaded gun — gate the rollout before you pull the trigger on a new cluster label.
Verify
Confirm the system behaves as designed before trusting it.
# ApplicationSets generated the expected Applications
kubectl get applicationset -n argocd
kubectl get applications -n argocd -o wide
# A specific app is Synced and Healthy on the right cluster
argocd app get checkout-prod-eu
# Diff live state against Git without syncing
argocd app diff checkout-prod-eu
# Confirm clusters are registered and labeled
argocd cluster list
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster \
-o custom-columns=NAME:.metadata.name,LABELS:.metadata.labels
# Prove self-heal works: mutate live state, watch it revert
kubectl scale deploy/checkout -n payments --replicas=99
argocd app wait checkout-prod-eu --health
A correctly configured platform shows every Application Synced/Healthy, the manual scale reverts within the reconciliation window (unless replicas is in ignoreDifferences), and adding a cluster Secret with the right labels makes apps appear with no manual manifest edits.
Checklist
Pitfalls
- Self-heal flapping. Enabling
selfHealbefore scopingignoreDifferencesmakes Argo CD fight your HPA and admission webhooks indefinitely. Tune ignores first. - Waves do not cross Applications.
sync-waveorders resources inside one Application’s sync only. Sequence Applications via the app-of-apps layer, not by hoping wave numbers compose. - Moving
targetRevisionin prod. Pointing prod atmainmeans a merge anywhere can roll out unreviewed. Pin to a tag or SHA and promote by changing the pin. - Generator typos fail silently. Without
missingkey=error, a misspelled parameter renders an empty string and produces a broken-but-accepted Application. Always set it. - Orphaned resources. Skipping
pruneleaves deleted-from-Git resources running in the cluster forever; Git stops being the source of truth.
Next steps
Wire ApplicationSet and webhook events into notifications (argocd-notifications) so generation failures page someone, add Argo CD’s Prometheus metrics to your dashboards to watch sync latency as the fleet grows, and adopt Progressive Syncs on critical ApplicationSets to roll changes cluster-by-cluster instead of all at once.