Most “run microservices on Kubernetes” guides stop at kubectl apply against a managed cluster and call it production. The gap between that and an audited, multi-team, multi-AZ platform that an enterprise will actually trust with revenue traffic is enormous: who can pull which image, which pod can assume which IAM role, how 200 microservices talk to each other with mTLS and retries, how the cluster scales from 12 nodes at 3 a.m. to 140 nodes during a sale without paging anyone, and how every change reaches the cluster through a reviewed Git commit rather than someone’s laptop.
This article is a concrete, opinionated reference architecture for that platform on Amazon EKS, built from five load-bearing pieces: EKS as the compute substrate, ECR as the signed image supply chain, IRSA as the pod-to-AWS identity bridge, App Mesh for service-to-service traffic management, Karpenter for just-in-time node provisioning, and GitOps (Argo CD) as the only path into the cluster. It is designed to be cloned by a 30-engineer startup and to keep working when that startup becomes a 2,000-engineer org running 18 product teams on one platform.
The flow that makes this production-grade, end to end:
- A squad pushes to Git — the only way to change what runs. No
kubectl applyfrom laptops. - CI builds and signs the image, pushing it to ECR; only signed images are admitted.
- Argo CD continuously reconciles the cluster to the Git desired state (drift is auto-corrected).
- The Kubernetes API schedules workloads; admission control verifies image signatures and IRSA bindings.
- Karpenter provisions just-in-time EC2 capacity across three AZs and bin-packs to cut idle cost.
- Pods assume scoped AWS permissions through IRSA — no long-lived keys — while App Mesh enforces mTLS, retries and circuit-breaking on all east-west traffic.
The business scenario
Picture a mid-market commerce and logistics company — call the pattern “the platform team’s dilemma.” You’ve outgrown a monolith on a handful of EC2 instances. Forty-plus services now exist: checkout, pricing, inventory, fulfilment, notifications, a recommendations API, fraud scoring, and a long tail of internal tools. Three to twelve product squads ship them, each wanting to deploy several times a day without filing a ticket with a central ops team.
The pain that forces an architecture decision is rarely “we need Kubernetes.” It’s a cluster of concrete failures:
- Deploys are scary and serial. A bad config push takes down checkout because there’s no canary, no automatic rollback, and no mesh-level traffic shifting. Changes are applied imperatively, so nobody can answer “what is actually running in prod right now?” without SSH-ing in.
- Capacity is either wasteful or fragile. Static Auto Scaling Groups are sized for peak, so you pay for idle
c5.4xlargeinstances all night; or they’re sized for average and fall over during a flash sale. Bin-packing across instance types is done by hand. - Identity is a mess of long-lived keys. Services hold AWS access keys in environment variables to reach S3, DynamoDB, SQS and Secrets Manager. Those keys leak, never rotate, and grant far more than any one service needs. (This is exactly the class of incident this org has been burned by before — leaked long-lived credentials in source control — so eliminating static keys is a hard requirement, not a nice-to-have.)
- East-west traffic is a black box. When the pricing service gets slow, it silently takes checkout down with it. There are no per-route timeouts, no retries with budgets, no circuit breaking, and no consistent mTLS between services. Security can’t prove that service-to-service traffic is encrypted.
- Supply chain is unverified. Images are pulled
:latestfrom a public registry, unsigned and unscanned. There’s no answer to “was this image built by our pipeline and not tampered with?”
The business goal is a self-service internal platform: squads own their services and deploy via Git; a small platform team owns the substrate, guardrails, and golden paths. The non-functional targets are typical for this size of company: 99.95% availability for tier-1 services, p99 latency budgets enforced at the mesh, RTO ≈ 30 minutes / RPO ≈ 5 minutes for the cluster control plane and workloads, costs that scale roughly linearly with traffic (not with peak), and an audit trail that satisfies SOC 2 and PCI-DSS scope for the checkout path. EKS plus the five focus components is precisely the combination that hits those targets without a 40-person platform org.
Architecture overview
The end-to-end request path and the end-to-end change path are two different flows, and a good EKS platform treats them as equally first-class. Picture the diagram in two layers stacked on the same VPC.
The runtime (request) path — north-south then east-west. A client request hits Amazon Route 53, which resolves to an AWS WAF-protected Application Load Balancer (or an NLB for gRPC/TLS passthrough). The ALB is provisioned and reconciled by the AWS Load Balancer Controller running in-cluster, driven by Ingress/Gateway objects — so the edge is declared in Git, not clicked in the console. Traffic lands on the App Mesh ingress (Envoy) gateway, the single managed entry point into the mesh. From there every hop is east-west traffic inside the mesh: the ingress gateway routes to a VirtualService (say checkout), which resolves through a VirtualRouter to one or more VirtualNodes backed by Kubernetes Services and pods. Each pod runs an Envoy sidecar injected by the App Mesh controller; sidecars carry mTLS (certs from AWS Private CA via the cert-manager integration), per-route timeouts, retries with budgets, and outlier-detection circuit breaking. When checkout calls pricing, that call is another mesh hop with the same guarantees. Pods reach AWS data services — DynamoDB, Aurora, S3, SQS, Secrets Manager — using IRSA (IAM Roles for Service Accounts): the pod’s ServiceAccount is annotated with an IAM role ARN, the pod gets a projected OIDC token, and the AWS SDK exchanges it via STS for short-lived, scoped credentials. No static AWS keys exist anywhere in the cluster.
The compute substrate. All of this runs on an EKS cluster (Kubernetes control plane managed by AWS, spread across three AZs) with worker capacity supplied two ways. A tiny managed node group (2–3 on-demand nodes) hosts the things that must always be up and must not be churned by the autoscaler: Argo CD, the Karpenter controller itself, CoreDNS, and the App Mesh/LB/EBS controllers. Everything else — all application pods — runs on nodes that Karpenter provisions just-in-time. When pods go Pending, Karpenter reads their CPU/memory/affinity/topology requirements, picks the cheapest viable instance types (mixing Spot and On-Demand, Graviton and x86), launches the node in seconds, and consolidates (terminates and repacks) underused nodes minutes later. This is what makes capacity track traffic instead of peak.
The change (GitOps) path. Developers never run kubectl apply against prod. A squad merges a PR to an app repo; CI builds a container image, runs SCA/SBOM generation, pushes to ECR, where the image is scanned (Amazon Inspector / ECR enhanced scanning) and signed. CI then bumps an image digest in a config repo (Helm values or Kustomize overlay). Argo CD, watching that config repo, detects drift and reconciles the desired state into the cluster — applying Deployments, VirtualService/VirtualRouter mesh routes, HPAs, and NetworkPolicies. Progressive delivery (Argo Rollouts) shifts traffic 5% → 25% → 100% by reweighting the mesh VirtualRouter, watching golden-signal metrics, and auto-rolling-back on SLO breach. Git is the single source of truth; the cluster is a cache of Git.
So the picture is: Route 53 → WAF → ALB (LB Controller) → App Mesh ingress → Envoy-sidecar mesh of services on Karpenter-provisioned EKS nodes → AWS data services via IRSA, with ECR + Argo CD forming the supply chain and reconciliation loop that feeds the whole thing from Git.
Component breakdown
| Component | Role in this architecture | Key configuration choices |
|---|---|---|
| Amazon EKS | Managed Kubernetes control plane across 3 AZs; the compute substrate everything else sits on. | API endpoint set to private (or public-restricted to CI/CD + admin CIDRs); OIDC provider enabled (prerequisite for IRSA); control-plane logs (api, audit, authenticator) shipped to CloudWatch; EKS access entries + cluster access management instead of hand-edited aws-auth; version N-1 with managed add-ons (VPC CNI, CoreDNS, kube-proxy, EBS CSI). |
| Karpenter | Just-in-time node autoscaler; replaces Cluster Autoscaler + multiple ASGs. | One or more NodePools with requirements spanning Graviton (arm64) and x86, On-Demand + Spot via karpenter.sh/capacity-type; consolidation policy WhenEmptyOrUnderutilized; EC2NodeClass pinning a hardened AMI, IMDSv2-only, and the node IAM role; disruption budgets so prod doesn’t lose too many nodes at once; per-team taints/labels for isolation. |
| AWS App Mesh | Service mesh: mTLS, traffic routing, resilience for east-west calls. | Envoy sidecar auto-injection via the App Mesh controller; VirtualService → VirtualRouter → VirtualNode topology; per-route timeouts, retry policies with budgets, outlier detection (circuit breaking); mTLS with certs from AWS Private CA (cert-manager); a single mesh ingress gateway; access-logging to Envoy → FireLens. (Note: App Mesh is in maintenance/deprecation track — see “When to use it” for the Istio/Cilium migration path; the patterns here transfer directly.) |
| Amazon ECR | Private OCI registry and the start of the verified supply chain. | Enhanced scanning (Amazon Inspector) on push, continuous CVE re-scan; immutable tags; image signing + verification gate (cosign/Notation) enforced by a Kyverno/admission policy; lifecycle policies to expire untagged/old images; cross-region replication for DR; pulls via IRSA/VPC endpoint, not credentials. |
| IRSA (IAM Roles for Service Accounts) | Pod-level AWS identity; eliminates static keys. | ServiceAccount annotated eks.amazonaws.com/role-arn; IAM trust policy scoped to the cluster OIDC provider + exact sub (namespace:serviceaccount); one role per service, least-privilege; session tags + conditions to constrain by namespace; (EKS Pod Identity as the newer alternative for simpler cross-account/role association). |
| Argo CD (GitOps) | Continuous reconciliation of desired state from Git into the cluster. | App-of-apps / ApplicationSets to onboard teams; config repo separate from app repos; auto-sync + self-heal + prune; sync waves for ordering (CRDs/mesh before workloads); SSO via OIDC + RBAC per team/namespace; Argo Rollouts for canary/blue-green driven through App Mesh VirtualRouter weights. |
| AWS Load Balancer Controller | Reconciles ALB/NLB from Ingress/Service objects; the north-south edge as code. |
ALB with WAF + ACM TLS termination for HTTP(S); NLB for gRPC/passthrough; IP target mode (targets pods directly, bypassing node hops); IRSA-scoped permissions; integrates with the mesh ingress gateway. |
| Supporting data + observability | The stateful and telemetry plane the platform depends on. | DynamoDB/Aurora/S3/SQS/Secrets Manager reached via IRSA + VPC endpoints (no NAT egress for AWS APIs); CloudWatch Container Insights + ADOT (OpenTelemetry) → AMP (Prometheus) + AMG (Grafana); AWS X-Ray traces stitched through Envoy; FireLens → CloudWatch Logs/OpenSearch. |
A few of these choices deserve a sentence of “why.” Private API endpoint + OIDC is the foundation that makes IRSA and a no-public-control-plane posture possible at once. Karpenter over Cluster Autoscaler matters because Karpenter chooses instance types per-pod and consolidates aggressively, which is where the real cost savings live. Mesh ingress gateway as the single entry keeps all routing/resilience policy in one mental model instead of split between the ALB and the mesh. And separate config repo is what lets Argo CD’s auto-heal be safe: prod state is whatever is in config-repo@main, full stop.
Implementation guidance
Provision in layers with Terraform, then hand the cluster to GitOps. A clean separation is: Terraform owns everything up to and including a bootstrapped cluster with its core add-ons; Argo CD owns everything inside the cluster from then on. Mixing the two (Terraform managing app Deployments) is the classic anti-pattern that causes drift fights.
Layer 1 — network & cluster (Terraform). Use the community modules as a backbone: terraform-aws-modules/vpc/aws for a 3-AZ VPC with private/public/intra subnets and the required kubernetes.io/role/internal-elb + karpenter.sh/discovery subnet tags; terraform-aws-modules/eks/aws for the cluster, the small managed node group, the OIDC provider, EKS access entries, and managed add-ons. Pin the Kubernetes version and turn on the audit/authenticator control-plane logs here.
Layer 2 — IAM & supply chain (Terraform). Create the Karpenter controller IRSA role and node IAM role/instance profile, the AWS Load Balancer Controller and EBS CSI IRSA roles, the ECR repositories with scan-on-push + immutability + lifecycle + replication, and AWS Private CA for mesh mTLS. Generate per-service IRSA roles with a small module that takes (namespace, serviceaccount, policy_json) and emits a role whose trust policy hard-codes the OIDC sub — this is the security crux:
# Trust policy condition that makes IRSA least-privilege:
# only THIS serviceaccount in THIS namespace can assume the role.
StringEquals = {
"${oidc}:sub" = "system:serviceaccount:checkout:checkout-api"
"${oidc}:aud" = "sts.amazonaws.com"
}
Layer 3 — in-cluster platform (GitOps, bootstrapped once). Terraform installs only the Argo CD Helm release and a single “root” Application; from there Argo CD installs Karpenter, the App Mesh controller + CRDs, the LB controller, cert-manager, ADOT, Kyverno, and the per-team ApplicationSets. Use sync waves so CRDs and the mesh control plane land before any workload that references a VirtualService. Karpenter NodePool/EC2NodeClass and App Mesh objects are plain manifests in the config repo.
Networking & identity wiring — the load-bearing details.
- VPC CNI in prefix-delegation mode to raise pod density per node (so Karpenter’s bin-packing isn’t capped by ENI limits); enable Network Policy support (VPC CNI or Cilium) and ship
NetworkPolicyobjects via Argo for default-deny east-west, layered under the mesh. - VPC endpoints (Interface for ECR API/DKR, STS, Secrets Manager, CloudWatch, App Mesh/Envoy management; Gateway for S3/DynamoDB) so pulls, STS
AssumeRoleWithWebIdentity, and AWS API calls never traverse a NAT gateway — this cuts NAT data-processing cost and shrinks the egress attack surface. - IRSA end-to-end: ServiceAccount → role-arn annotation → projected token → SDK exchanges via STS. Validate with a throwaway pod that runs
aws sts get-caller-identityand confirms it returns the service’s role, not the node role. For new clusters consider EKS Pod Identity (no per-cluster OIDC trust plumbing, easier cross-account); the manifest changes but the “no static keys” principle is identical. - App Mesh sidecar injection is namespace-labelled; the init container programs iptables to redirect traffic through Envoy. Wire AWS Private CA + cert-manager so every
VirtualNodegets a workload cert and mTLS isSTRICT.
CI/CD shape. App repo CI: build → unit/integration → SBOM (Syft) → push to ECR → Inspector scan gate → cosign sign. A separate job (or Argo CD Image Updater) writes the new digest into the config repo behind a PR. Argo CD syncs; Argo Rollouts runs the canary by reweighting the mesh VirtualRouter and watching AMP metrics, auto-aborting on SLO breach. Kyverno at admission refuses any image that isn’t from your ECR and signed — closing the loop so even a manual kubectl can’t run an unverified image.
(On other IaC: Bicep and Deployment Manager are Azure/GCP-native and don’t target AWS — for this stack Terraform or AWS CDK/CloudFormation are the right tools; the layering above maps cleanly onto CDK constructs if you prefer typed IaC.)
Enterprise considerations
Security & Zero Trust. The architecture is built to assume breach at every layer. Identity: IRSA/Pod Identity means zero long-lived AWS keys; each pod gets least-privilege, short-lived STS credentials scoped to one ServiceAccount — leaked-credentials-in-git simply cannot happen for AWS access. Network: private API endpoint, default-deny NetworkPolicy, and mesh mTLS (STRICT) so all east-west traffic is mutually authenticated and encrypted — security can prove it. Supply chain: ECR enhanced scanning + cosign signatures + a Kyverno admission gate enforce “only our pipeline’s signed, unexpired images run here.” Edge: WAF on the ALB, TLS via ACM, and the mesh ingress gateway as the only door. Secrets: pulled at runtime from Secrets Manager via IRSA (or the Secrets Store CSI driver), never baked into images or env in Git. Map these to PCI-DSS scope by isolating the checkout namespace (dedicated NodePool taint, tighter NetworkPolicy, separate IRSA boundary) so the cardholder path is a small, auditable blast radius.
Cost optimization. This is where Karpenter earns its place. Right-sizing per pod: Karpenter picks the cheapest instance that fits, mixing types instead of one ASG shape. Spot for stateless: run the bulk of stateless services on Spot with On-Demand fallback via capacity-type requirements; Spot at ~70% off is the single biggest lever. Graviton: allow arm64 in NodePools — typically ~20% better price/performance for services with multi-arch images. Consolidation: WhenEmptyOrUnderutilized continuously repacks and terminates waste, so the cluster shrinks at night automatically. No-NAT AWS traffic: VPC endpoints remove NAT data-processing charges for ECR/STS/S3/DynamoDB. ECR lifecycle policies stop image storage from creeping. Track it all with Kubecost/OpenCost + cost-allocation tags per namespace so each squad sees its bill. Net effect: spend tracks traffic, and the 3-a.m. cluster is a fraction of the peak cluster.
Scalability. Three independent axes scale cleanly: pods (HPA on CPU/custom AMP metrics, or KEDA on SQS depth/event lag), nodes (Karpenter, seconds to provision), and traffic shaping (mesh VirtualRouter weights + retries absorb partial failures). The control plane is AWS-managed and scales itself. The ceiling you actually hit first is usually pod IP density (solved by prefix delegation) or a noisy-neighbour data service (solved by per-service throttling at the mesh).
Reliability & DR (RTO ≈ 30 min / RPO ≈ 5 min). Multi-AZ is table stakes: 3-AZ node spread via topology-spread constraints, PodDisruptionBudgets, and Karpenter disruption budgets so consolidation/Spot reclaims never take down a quorum. For regional DR, the cluster is rebuildable from Git in minutes — Terraform recreates the cluster, Argo CD reconciles every workload from the config repo — which is the whole point of GitOps. ECR cross-region replication ensures images exist in the DR region; data RPO comes from the data services (Aurora cross-region replicas / DynamoDB global tables / S3 CRR), not the cluster. So RTO is dominated by Terraform cluster-create + Argo sync (≈30 min, faster with a warm standby cluster), and RPO is whatever the replicated data tier gives you (≈5 min or better with global tables).
Observability. Golden signals from ADOT/OpenTelemetry → Amazon Managed Prometheus, dashboards in Amazon Managed Grafana, container/node metrics from CloudWatch Container Insights. Envoy emits per-route latency/error/retry/circuit-breaker stats and X-Ray spans, so a slow pricing call is visible before it cascades. Logs via FireLens → CloudWatch/OpenSearch. SLOs (availability, p99) are defined as Prometheus recording rules and are the same signals Argo Rollouts uses to gate canaries — one source of truth for “healthy.”
Governance. Argo CD RBAC + SSO scopes each team to its namespaces; Kyverno policies enforce required labels, resource limits, signed images, no-:latest, no-privileged. EKS audit logs + CloudTrail give a full who-did-what trail. Because every change is a reviewed Git commit, change management and the audit story are the same artifact.
Reference enterprise example
Meridian Freight is a (fictional) logistics-and-parcel marketplace: shippers post loads, carriers bid, the platform handles pricing, matching, tracking and settlement. They run 52 microservices across 9 squads, ~1.8 million API calls/hour at baseline with 5–7x peaks during weekday morning dispatch and quarter-end settlement runs. They came from a Rails monolith plus a sprawl of EC2 ASGs, and two incidents pushed the migration: a flash-sale-style dispatch spike that exhausted statically-sized capacity and dropped 8% of bookings, and a near-miss where an AWS access key for the tracking service was found committed in a config repo (matching this org’s prior leaked-credentials scar tissue — hence the hard “no static keys” mandate).
What they built. One EKS cluster per environment (dev/stage/prod), 3-AZ, private API endpoint. A 3-node On-Demand managed group hosts Argo CD, Karpenter, and the controllers; everything else is Karpenter-provisioned. NodePools allow c/m/r families on both Graviton and x86, Spot-first with On-Demand fallback, except a tainted payments NodePool that’s On-Demand-only and PCI-isolated for settlement. App Mesh fronts all 52 services with STRICT mTLS (AWS Private CA), per-route timeouts, retry budgets, and outlier detection; the matching service — historically the cascade culprit — now has a 250 ms timeout and a 3-attempt retry budget so a slow matcher fails fast instead of taking dispatch down. ECR holds all images with Inspector scanning, immutable tags, cosign signatures, and a Kyverno gate. Every service has a least-privilege IRSA role: tracking can read/write exactly two DynamoDB tables and one S3 prefix — nothing else — so the previously leaked key pattern is structurally impossible. Nine ApplicationSets in Argo CD onboard the squads; canaries shift 5/25/100% via mesh weights with auto-rollback on a p99 or error-rate breach.
The numbers and the outcome.
| Dimension | Before (EC2 ASGs + monolith) | After (EKS + this stack) |
|---|---|---|
| Compute cost (steady state) | ~$41k/mo, sized near peak | ~$23k/mo (Spot + Graviton + consolidation) |
| Scale-up for dispatch peak | manual ASG bumps, ~10 min lag | Karpenter, nodes in ~45–90 s |
| Deploy frequency | ~5/week, central ops ticketed | ~140/week, squad self-service via Git |
| Failed-deploy blast radius | full-service outage, manual rollback | canary auto-rollback, <1% traffic affected |
| Static AWS keys in use | dozens (env vars, leaked once) | zero (IRSA only) |
| East-west encryption | partial/none, unprovable | 100% mTLS, audit-attestable |
| Cluster regional rebuild | days (snowflake infra) | ~28 min (Terraform + Argo CD from Git) |
The headline outcome wasn’t just the ~44% compute saving or the move from 5 to 140 deploys a week — it was that the platform team shrank its toil: capacity, rollbacks, and credential rotation became properties of the system rather than human chores, and the settlement path became a small, provable PCI island instead of an audit nightmare.
When to use it
Use this architecture when you have a genuine multi-team, multi-service platform (roughly 10+ services and/or 3+ squads) that needs self-service deploys, fine-grained pod identity, real east-west resilience, and traffic-tracking cost. It shines precisely where static ASGs and imperative kubectl break down: spiky load, many small teams, strict identity/audit requirements, and a need to rebuild the world from Git.
Trade-offs and anti-patterns to avoid.
- Don’t reach for it too early. For 3–5 services and one team, this is over-engineered. AWS App Runner, ECS on Fargate, or EKS Auto Mode (which bundles Karpenter-style provisioning and core add-ons with far less to operate) will get you to production faster with a fraction of the cognitive load. Adopt the mesh and GitOps machinery when team count — not service count — demands it.
- Don’t let Terraform and Argo CD both own workloads. Terraform up to the bootstrapped cluster, GitOps inside. Overlapping ownership = perpetual drift.
- Don’t run a mesh you don’t need. A sidecar per pod has CPU/memory/latency overhead and real operational weight. If you only need mTLS and basic policy, VPC CNI NetworkPolicy + ALB may be enough; add the mesh when per-route retries/timeouts/circuit-breaking and traffic-shifting canaries are genuinely required.
- Mind the App Mesh lifecycle. AWS has placed App Mesh on a deprecation track. For new builds, prefer Istio (ambient mode trims sidecar overhead) or Cilium service mesh (eBPF, sidecar-less) — the architecture here is mesh-agnostic: the
VirtualService/router/resilience concepts and the GitOps-driven traffic shifting map directly onto IstioVirtualService/DestinationRuleor Cilium policies. Treat the App Mesh specifics as illustrative of the pattern, and pick the in-support mesh for greenfield. - Don’t put stateful databases in-cluster to “simplify.” Keep Aurora/DynamoDB/S3 as managed services reached via IRSA; that’s what makes the cluster disposable and DR fast.
Alternatives in brief. Pure ECS/Fargate — simpler, no Kubernetes to operate, but you give up the CNCF ecosystem, fine-grained mesh control, and portability. EKS Auto Mode — the same EKS core with AWS operating Karpenter/add-ons for you; an excellent on-ramp that you can graduate from into this fuller architecture as you need more control. Self-managed Kubernetes on EC2 — maximum control, maximum toil; rarely worth it versus EKS. Knative/serverless containers — great for bursty, scale-to-zero event workloads, but not a fit for always-on tier-1 request services. The sweet spot for this reference architecture is the broad middle: organizations large enough to feel the multi-team pain, not so constrained that a managed, lower-control option is clearly better.