The three-tier web app is the most-deployed and most-misbuilt architecture on the planet. Almost everyone draws the same three boxes — web, app, database — and almost everyone then quietly couples them: the app servers keep session state in memory so they can never be replaced without logging users out, the database is a single instance because “we’ll add a replica later,” credentials live in environment variables baked into the AMI, and the whole thing sits in one Availability Zone because the demo worked. It runs fine until an AZ blips, a deploy needs to scale, or the single database reboots — and then it is down, and the post-mortem says “we always meant to make it resilient.” This reference architecture is the version that is resilient by construction: a public Application Load Balancer spreading traffic across multiple AZs, a stateless application tier on ECS Fargate that the platform can kill and replace at will, EC2/ECS Auto Scaling that tracks real demand, Amazon RDS in a Multi-AZ deployment with automatic failover, and Amazon ElastiCache so that session and hot-read load never touches the database. It scales down to a single team running one product and up to a regulated enterprise running a fleet of these behind one platform — the diagram is the same; what changes is the instance sizes, the number of accounts, and the strictness of the guardrails, not the shape.
This article follows the format of the major architecture centers — the scenario, the end-to-end request and data path, a component-by-component breakdown, concrete implementation and Terraform wiring, the enterprise concerns (security, cost, reliability, observability, governance), a named worked example with real monthly numbers, and an honest section on when not to build this.
The business scenario
Picture a product that has graduated from “it runs on one box” to “it cannot be down.” It might be a Series-A SaaS company’s customer portal, a retailer’s storefront, an insurer’s quote-and-bind app, or an internal HR system that 40,000 employees use on Monday morning. The technology underneath is almost always the same shape — a load balancer, some application servers, a relational database — and the shape of the pain is the same at both ends of the size range:
- A single point of failure that everyone knows about and nobody has fixed. There is one database instance. When it patches, reboots, or its host degrades, the application is down — and “the database is the app” because there is no read replica, no standby, and no cache to soften the blow. The business has now had two outages traced to the same single instance and wants it to stop.
- Sessions are sticky, so the app tier is a pet. The application keeps user sessions in process memory, so the load balancer must pin each user to one server. That single decision poisons everything downstream: you cannot deploy without logging people out, you cannot scale in without dropping sessions, and a single instance failure takes its users with it. The “stateless web tier” on the diagram is a lie the moment session state lives in RAM.
- Scaling is manual and always late. Capacity is a fixed fleet someone sized for last quarter. A marketing email or a Monday-morning login storm saturates it; an engineer SSHes in to add instances after the incident has already started. Off-peak, the same fleet runs at 8% utilization, paying full price to do nothing.
- Deploys are scary and the AMI is a snowflake. New versions go out by building an AMI by hand, or by SSHing in and pulling code. Rollback is “rebuild the old AMI.” Two instances drift apart over time, and “is prod actually running the version we think it is?” has no confident answer.
- Secrets and blast radius are an afterthought. The database password is an environment variable in the launch template, the security groups allow
0.0.0.0/0on the database port “temporarily,” and the app servers have an IAM role that can read every bucket in the account. An auditor asking “what can a compromised web server reach?” gets a long, uncomfortable pause.
The problem this architecture solves is precise: serve a stateful web application with no single point of failure, a genuinely stateless and disposable application tier, demand-tracking elasticity, a managed database that fails over automatically, and a cache that protects both the database and the user experience — all behind least-privilege identity and network boundaries, deployed as code, for a cost that tracks load. The non-goals matter too. This is not a globally-distributed, multi-region, active-active system (that is a heavier, more expensive article, and most apps do not need it). It is not a microservices platform with a service mesh (also a different article). It is the smallest coherent architecture that makes one important web application highly available within a region, elastic, and operable — the workhorse that the fancier patterns are usually an over-reaction to.
Architecture overview
The organizing idea is separation of failure domains and separation of state. The three tiers — load balancing/edge, stateless compute, and stateful data — each live in their own subnet tier and their own security boundary, every tier is spread across at least two (ideally three) Availability Zones, and all durable state is pushed down to managed services (RDS, ElastiCache) so that the compute tier in the middle owns nothing it would be sad to lose. That single discipline — the app tier is stateless, the data tier is managed and multi-AZ — is what turns “three boxes” into “resilient three tiers.”
The whole thing sits inside one VPC with a conventional three-subnet-tier layout, replicated across AZs: a public subnet tier (just the load balancer and NAT), a private application subnet tier (the Fargate tasks, with no public IPs), and a private data subnet tier (RDS and ElastiCache, reachable from nowhere but the app tier).
The request path, end to end, for a user hitting the application:
- A client resolves the application hostname via Amazon Route 53 to the Application Load Balancer (ALB). In front of the ALB sits AWS WAF (OWASP managed rule groups, rate-based rules, bot control) and, for a public internet app, optionally Amazon CloudFront for TLS termination at the edge and caching of static assets. TLS certificates come from AWS Certificate Manager (ACM) and are attached to the ALB’s HTTPS listener; HTTP is redirected to HTTPS.
- The ALB lives in the public subnets across multiple AZs and is the only internet-facing component. It terminates TLS, evaluates listener rules (host/path routing), and forwards each request to a healthy target chosen from its target group — without sticky sessions, because the app tier is stateless. The ALB’s health checks continuously probe each task’s
/healthz; an unhealthy task is taken out of rotation automatically, and an unhealthy AZ simply stops receiving traffic. - The target is an ECS Fargate task running the application container in a private application subnet with no public IP. ECS spreads tasks across AZs; ECS Service Auto Scaling (target-tracking on ALB request-count-per-target and/or CPU/memory) adds and removes tasks as load changes, and the ALB registers/deregisters them automatically. The task has a least-privilege IAM task role — it can reach exactly the AWS APIs it needs and nothing else.
- The application handles the request without keeping any user state in its own memory. Session data, user carts, feature flags, and other per-user context live in Amazon ElastiCache (Redis/Valkey), reached over the private data subnets. Because state is external, any task can serve any user’s next request — which is precisely what lets the platform kill, replace, scale, and deploy tasks freely.
- For data the cache can serve, the app reads from ElastiCache first (cache-aside): hot product data, rendered fragments, rate-limit counters, idempotency keys. On a cache miss, or for any write, the app talks to the database.
- The database is Amazon RDS (PostgreSQL or MySQL) in a Multi-AZ deployment in the private data subnets. The application connects through the cluster/primary endpoint; RDS keeps a synchronous standby in another AZ and, on a primary failure, fails over automatically by repointing that DNS endpoint to the promoted standby — the app reconnects and continues. Read replicas (optional) take read-heavy traffic off the primary via a separate reader endpoint. The database credential is never in the app’s config: the task fetches it at startup from AWS Secrets Manager (or uses RDS IAM authentication for a passwordless token), over a private path.
- Telemetry flows out continuously: the ALB, ECS tasks, RDS, and ElastiCache publish metrics to Amazon CloudWatch; the application and the ALB access logs stream to CloudWatch Logs (and the ALB logs to S3); AWS X-Ray (or OpenTelemetry via the ADOT collector) traces requests across the tiers; and CloudWatch alarms drive both auto scaling and on-call alerting.
The data and deployment path, briefly, because resilience includes how state and code move:
- Static assets (JS/CSS/images, user uploads) live in Amazon S3, served through CloudFront with Origin Access Control — they never touch the app tier, which keeps the compute tier truly stateless and cheap.
- Code ships as an immutable container image in Amazon ECR, built and scanned in CI; a deploy is “register a new task definition revision and let ECS do a rolling (or blue/green via CodeDeploy) update” — old tasks drain, new tasks come up healthy, and rollback is “point back to the previous task definition.” There is no AMI to bake and no SSH.
The mental model: the ALB spreads traffic across AZs and hides unhealthy targets; Fargate runs a stateless, disposable app tier that Auto Scaling sizes to demand; ElastiCache holds the session and hot-read state so the tier can stay stateless and the database stays unburdened; and RDS Multi-AZ owns the durable data and fails over on its own. No single instance, in any tier, is something the system depends on.
Component breakdown
| Component | AWS service | What it does | Key configuration choices |
|---|---|---|---|
| DNS & edge | Route 53 + CloudFront | Hostname resolution; global TLS/cache edge | Route 53 alias record to CloudFront/ALB; health-check-based failover records for DR; CloudFront for static-asset caching and TLS at the edge; HTTP→HTTPS redirect |
| Web Application Firewall | AWS WAF | Filters malicious requests at the edge | AWS Managed Rules (Core/OWASP, Known-Bad-Inputs, IP reputation); rate-based rule per IP; optional Bot Control; associate to CloudFront and/or ALB |
| Load balancer | Application Load Balancer (ALB) | L7 entry, TLS termination, health-checked routing across AZs | Internet-facing, in ≥2 public subnets/AZs; HTTPS listener with ACM cert + modern TLS policy; deletion protection + access logs to S3; no stickiness (stateless app); deregistration delay tuned for graceful drain |
| Compute (app tier) | Amazon ECS on AWS Fargate | Runs the stateless application containers | Tasks in private subnets, no public IP; awsvpc networking (each task its own ENI + security group); spread across AZs; task role (app perms) separate from execution role (pull image, read secrets); rolling or blue/green (CodeDeploy) deploys; Fargate platform version pinned |
| Elasticity | ECS Service Auto Scaling (Application Auto Scaling) | Adds/removes tasks to track demand | Target-tracking on ALBRequestCountPerTarget (primary) + CPU/memory; sensible min/max task count; scale-in cooldown > scale-out; optional scheduled scaling for known peaks; Fargate Spot capacity provider for a fraction of tasks |
| Database | Amazon RDS (PostgreSQL/MySQL), Multi-AZ | Durable relational store with automatic failover | Multi-AZ (standby in another AZ, synchronous replication, automatic failover); storage-encrypted (KMS); automated backups + PITR; deletion protection; enhanced monitoring + Performance Insights; read replicas for read scale; credentials in Secrets Manager or IAM auth; gp3 storage; managed minor-version upgrades in a maintenance window |
| Cache / session store | Amazon ElastiCache (Redis/Valkey) | Session store + cache-aside hot data | Multi-AZ with automatic failover, ≥1 replica per shard; cluster mode for horizontal scale; encryption in transit + at rest, AUTH/RBAC; TTLs on session keys; subnet group in data subnets only |
| Network | Amazon VPC | Isolated network, three subnet tiers × AZs | 3 subnet tiers (public / app / data) across ≥2 AZs; NAT Gateway per AZ for app-tier egress; security groups as the primary firewall (ALB→app→data, each referencing the prior SG); VPC endpoints (ECR, S3, Secrets Manager, CloudWatch) so traffic stays on the AWS network |
| Secrets & keys | AWS Secrets Manager + KMS | Stores DB credentials; manages encryption keys | Secrets Manager with automatic rotation for the DB credential (Lambda rotation), or RDS IAM auth for no stored password at all; KMS CMKs for RDS, ElastiCache, S3, secrets; least-privilege key policies |
| Static assets / logs | Amazon S3 | User uploads, static site assets, ALB/audit logs | CloudFront + Origin Access Control for assets (bucket stays private); versioning + lifecycle for logs; SSE-KMS; Block Public Access on |
| CI/CD | ECR + CodePipeline/CodeBuild (or GitHub Actions) + CodeDeploy | Build, scan, ship immutable images | ECR image scanning (enhanced/Inspector) + immutable tags; task-definition-per-revision; blue/green via CodeDeploy with ALB test listener for safe cutover |
| Observability | CloudWatch + X-Ray | Metrics, logs, traces, alarms, dashboards | Container Insights on the ECS cluster; RDS Performance Insights; ALB/app access logs; X-Ray / ADOT tracing; alarms drive scaling and paging; dashboards as code |
A few choices deserve the “why,” because they are exactly where the textbook three-tier diagram quietly betrays you.
Why the app tier must be stateless — and why that means ElastiCache, not stickiness. The single most consequential decision in this whole architecture is that an application task holds no user state. The lazy alternative is ALB sticky sessions, which pin a user to one task so in-memory session state “works.” It is a trap: stickiness means a task you remove (to deploy, to scale in, or because it died) takes its users’ sessions with it, your traffic is unevenly distributed, and an AZ failure logs out everyone who was pinned there. Externalizing session state to ElastiCache breaks that coupling completely — any task can serve any request, so the platform is free to kill, replace, scale, and rebalance tasks at will, which is the entire source of the architecture’s resilience and elasticity. If you remember one thing: state in the cache is what makes the compute tier disposable, and a disposable compute tier is what makes the system resilient.
Why RDS Multi-AZ rather than a read replica you promote manually. People conflate two different things. A read replica scales reads (asynchronous, can lag, must be promoted by hand on failure — minutes of human-in-the-loop downtime). A Multi-AZ deployment is for availability: a synchronously-replicated standby that RDS promotes automatically, repointing the endpoint DNS, typically in 60–120 seconds, with no human and no data loss (RPO ≈ 0 for committed transactions). You want both for different reasons — Multi-AZ for HA, replicas for read scale — and you must not substitute one for the other. The “Multi-AZ DB cluster” variant (two readable standbys) lowers failover time further and gives you reader endpoints too, at higher cost. Crucially, the application should connect through the endpoint name and use a connection pool that reconnects on failure, so the failover is a brief blip, not an outage.
Why ECS Fargate over EC2 Auto Scaling Groups (and where ASGs still win). Both can run a stateless app tier behind an ALB. Fargate removes the EC2 layer entirely — no AMIs to patch, no node Auto Scaling Group to manage, no SSH, per-task isolation with its own ENI and security group, and you pay per-task vCPU/memory by the second. EC2 with an Auto Scaling Group wins when you need a specific instance type or GPU, want maximum cost control via Reserved/Savings-Plan-backed instances at steady high utilization, or run software that assumes a host. For the typical resilient web app, Fargate’s operational simplicity is worth the modest per-unit premium; this article uses it as the default and notes the ASG alternative where it matters. Either way the pattern — stateless tasks/instances, ALB target group, target-tracking auto scaling across AZs — is identical.
Why security groups that reference each other, not CIDRs. The data tier’s security group should allow the database port from the app tier’s security group, not from a subnet CIDR and certainly not from 0.0.0.0/0. Referencing the source SG means the rule says “only the application may reach the database,” it stays correct as IPs change, and it makes the blast radius legible: a compromised web request can reach the app SG, the app can reach the data SG, and nothing else. This SG-chaining (ALB-SG → app-SG → data-SG) is the network segmentation.
Implementation guidance
Provision in layers, each with its own Terraform stack and state, so the long-lived network and the shorter-lived app can evolve independently. Terraform is the common choice on AWS (the AWS CDK or CloudFormation are equally valid; on multi-cloud teams Terraform wins). The layering matters more than the tool:
- Layer 0 — Network & shared (platform). VPC, three subnet tiers across AZs, route tables, NAT Gateway per AZ, Internet Gateway, VPC endpoints (S3 gateway endpoint; interface endpoints for ECR, Secrets Manager, CloudWatch, ECS), Route 53 zone, ACM certificate. Long-lived; changes rarely.
- Layer 1 — Data tier (platform/DBA). RDS Multi-AZ instance/cluster, ElastiCache replication group, their subnet groups and security groups, KMS keys, the Secrets Manager secret + rotation. Stateful; treat with care (
prevent_destroy). - Layer 2 — App tier (app team). ECS cluster, task definition, service, target group, ALB + listeners, WAF web ACL, Application Auto Scaling policies, IAM task/execution roles. Shorter-lived; deploys often.
- Layer 3 — Edge & DNS. CloudFront distribution, S3 asset bucket + OAC, Route 53 alias/failover records. Often co-owned.
A representative Terraform skeleton for the load-balanced, auto-scaled Fargate app tier (Layer 2) — note the stateless target group (no stickiness), the AZ-spread service, and the target-tracking policy on request-count:
# --- ALB across the PUBLIC subnets in every AZ ---
resource "aws_lb" "app" {
name = "web-prod-alb"
load_balancer_type = "application"
internal = false
subnets = aws_subnet.public[*].id # one per AZ
security_groups = [aws_security_group.alb.id]
enable_deletion_protection = true
drop_invalid_header_fields = true
access_logs {
bucket = aws_s3_bucket.lb_logs.id
enabled = true
}
}
resource "aws_lb_target_group" "app" {
name = "web-prod-tg"
port = 8080
protocol = "HTTP"
target_type = "ip" # awsvpc / Fargate registers task ENIs
vpc_id = aws_vpc.this.id
deregistration_delay = 30 # graceful drain on scale-in / deploy
health_check {
path = "/healthz"
healthy_threshold = 3
unhealthy_threshold = 3
interval = 15
matcher = "200"
}
# NOTE: no stickiness block — the app tier is stateless (state in ElastiCache)
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.app.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate.app.arn
default_action { type = "forward" target_group_arn = aws_lb_target_group.app.arn }
}
# --- ECS Fargate service: tasks in PRIVATE app subnets, spread across AZs ---
resource "aws_ecs_service" "app" {
name = "web-prod"
cluster = aws_ecs_cluster.this.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 3
launch_type = "FARGATE"
network_configuration {
subnets = aws_subnet.app[*].id # private, no public IP
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "web"
container_port = 8080
}
deployment_controller { type = "ECS" } # or CODE_DEPLOY for blue/green
deployment_circuit_breaker { enable = true rollback = true }
}
# --- Auto Scaling: track requests-per-target (primary), 3..20 tasks ---
resource "aws_appautoscaling_target" "app" {
service_namespace = "ecs"
resource_id = "service/${aws_ecs_cluster.this.name}/${aws_ecs_service.app.name}"
scalable_dimension = "ecs:service:DesiredCount"
min_capacity = 3
max_capacity = 20
}
resource "aws_appautoscaling_policy" "rps" {
name = "track-requests-per-target"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.app.resource_id
scalable_dimension = aws_appautoscaling_target.app.scalable_dimension
service_namespace = "ecs"
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.app.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
}
target_value = 1000 # requests/target; scale out above this
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
The data tier (Layer 1) is where Multi-AZ and “no password in the app” earn their keep:
resource "aws_db_instance" "app" {
identifier = "web-prod-db"
engine = "postgres"
engine_version = "16"
instance_class = "db.r6g.large"
allocated_storage = 100
storage_type = "gp3"
multi_az = true # synchronous standby + auto-failover
db_subnet_group_name = aws_db_subnet_group.data.name
vpc_security_group_ids = [aws_security_group.data.id]
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
backup_retention_period = 14 # PITR window
deletion_protection = true
performance_insights_enabled = true
iam_database_authentication_enabled = true # passwordless token option
# username/password sourced from Secrets Manager (see manage_master_user_password)
manage_master_user_password = true # RDS-managed secret + rotation
auto_minor_version_upgrade = true
lifecycle { prevent_destroy = true }
}
resource "aws_elasticache_replication_group" "sessions" {
replication_group_id = "web-prod-cache"
description = "session + cache-aside"
engine = "valkey"
node_type = "cache.r7g.large"
num_node_groups = 1
replicas_per_node_group = 2
automatic_failover_enabled = true # promote a replica on AZ loss
multi_az_enabled = true
subnet_group_name = aws_elasticache_subnet_group.data.name
security_group_ids = [aws_security_group.data.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
}
Networking and identity wiring, the load-bearing rules:
- Three subnet tiers, every tier in every AZ. Public subnets hold only the ALB and the NAT Gateways. The app subnets hold the Fargate task ENIs (no public IP, egress via the per-AZ NAT or, better, VPC endpoints). The data subnets hold RDS and ElastiCache and have no route to the internet at all. Use one NAT Gateway per AZ so a single AZ’s failure cannot take out egress for the others (a shared single NAT is a sneaky single point of failure people leave in).
- Security groups chain, default-deny.
alb-sgallows 443 from the internet (or only from CloudFront’s managed prefix list).app-sgallows the app port fromalb-sgonly.data-sgallows 5432/6379 fromapp-sgonly. Nothing references a CIDR for east-west traffic; nothing allows0.0.0.0/0inbound except the ALB. That chain is your microsegmentation. - VPC endpoints keep traffic private and cut NAT cost. Add a gateway endpoint for S3 and interface endpoints for ECR (api + dkr), Secrets Manager, CloudWatch Logs, and ECS. Now image pulls, secret fetches, and log shipping never traverse the NAT/Internet — lower latency, lower data-transfer cost, smaller attack surface.
- Identity is task-scoped and secret-less. The ECS execution role can pull from ECR and read the one secret the container needs; the task role grants the application exactly the AWS APIs it uses (e.g.,
s3:GetObjecton one bucket prefix,kms:Decrypton one key) — not account-wide access. Prefer RDS IAM authentication (the task requests a short-lived auth token via its task role) so there is no database password anywhere; where a password is unavoidable, source it from Secrets Manager with rotation, never an environment variable in the task definition. - Connect through endpoints, pool, and reconnect. The app uses the RDS endpoint DNS name (never a hardcoded IP) and a connection pool (e.g., PgBouncer/RDS Proxy) configured to re-resolve and reconnect on error, so a 90-second Multi-AZ failover surfaces as a brief retry, not a crash. RDS Proxy is worth adding for connection multiplexing and even faster failover handling under many concurrent tasks.
Deployment: ship an immutable image to ECR (scanned by Inspector), register a new task-definition revision, and let ECS do a rolling update with the deployment circuit breaker on (auto-rollback if new tasks fail health checks), or a blue/green deploy via CodeDeploy with an ALB test listener for a zero-downtime cutover you validate before shifting production traffic. There is no AMI and no SSH — rollback is selecting the previous task definition.
Enterprise considerations
Security and Zero Trust. Apply Zero Trust at each tier. Network: the only public component is the ALB (ideally only reachable from CloudFront’s prefix list); the app tier has no public IP; the data tier has no internet route; security groups chain ALB→app→data and deny everything else. Identity: the app runs under a least-privilege task role, the database uses IAM auth or a rotated Secrets Manager credential (no password in config), and human access to the database is via short-lived credentials, not a shared root login. Edge: AWS WAF with managed OWASP, known-bad-inputs, and IP-reputation rule groups plus a rate-based rule blunts L7 attacks and credential-stuffing; AWS Shield Standard is automatic and Shield Advanced is available for DDoS-sensitive apps. Data: encryption at rest (KMS CMKs) on RDS, ElastiCache, S3, and secrets, and encryption in transit everywhere (TLS at the ALB, TLS to RDS, transit encryption on ElastiCache). Posture: turn on GuardDuty (threat detection on VPC flow logs, DNS, and CloudTrail), Security Hub (CIS/AWS Foundational benchmark scoring), and Inspector (image and instance CVEs), and treat findings as a backlog. The blast radius of a compromised web request is now bounded by exactly three security-group hops and one narrowly-scoped IAM role.
Cost optimization (FinOps). Elasticity is the headline lever: target-tracking Auto Scaling means you stop paying for the off-peak fleet that the static estate ran at 8% utilization — capacity tracks load. Beyond that: (1) Fargate Spot for a fraction of the tasks (interruptible web capacity at up to ~70% off — keep a Fargate-on-demand floor for the baseline so a Spot reclaim never drops you below safe capacity); (2) Compute Savings Plans for the steady-state Fargate baseline and Reserved Instances for the long-lived RDS and ElastiCache nodes (these run 24×7, so 1- or 3-year reservations cut 30–60%); (3) gp3 storage (cheaper and independently tunable IOPS vs gp2) and right-sized instance classes guided by Performance Insights and Compute Optimizer; (4) VPC endpoints + CloudFront/S3 for static assets to slash NAT and data-transfer charges (data transfer is the line item teams forget); (5) scheduled scaling to shrink non-prod overnight and weekends; (6) per-team cost allocation tags and a CloudWatch + Cost Explorer showback so each service sees its own bill. The cache itself is a cost optimizer: every read it serves is an RDS query you did not run, which lets the database be smaller.
Scalability. Independent axes: the app tier scales horizontally via ECS Service Auto Scaling on request-count and CPU (seconds to minutes, scale-to-floor not zero for a always-on web app); the database read path scales by adding read replicas behind a reader endpoint (and by leaning on the cache); the cache scales by enabling cluster mode and adding shards; writes scale vertically (bigger RDS instance) and via RDS Proxy to multiplex connections — and when single-writer RDS becomes the ceiling, Aurora (separated storage, up to 15 low-lag replicas, faster failover) is the next step without changing the architecture’s shape. Design every request to be servable by any task (stateless), and keep transactions short so the database stays the part you scale last, not first.
Reliability and DR (RTO/RPO). Inside the region this architecture is HA by design: the ALB and Fargate tasks span multiple AZs, RDS Multi-AZ fails over automatically in ~60–120 seconds (RPO ≈ 0 for committed transactions**)**, ElastiCache Multi-AZ promotes a replica on node/AZ loss, and one NAT Gateway per AZ removes the shared-egress SPOF — losing an entire AZ degrades capacity but does not cause an outage, and Auto Scaling backfills the lost tasks in the surviving AZs. Region loss is the harder case and a deliberate trade-off: this is a single-region design, so cross-region DR is warm standby or backup-and-restore, not active-active. The pragmatic posture: cross-region automated backups / RDS snapshot copy and read replica in a second region for the data (your RPO is the replica lag — seconds to low minutes — or the snapshot cadence for backup-restore); Terraform + the ECR image to stand up the app tier in the DR region (your RTO is “promote the cross-region replica + terraform apply the app stack + flip the Route 53 health-check failover record” — realistically 15–45 minutes warm, hours cold). Geo-replicate (or copy) the ECR image and Secrets to the DR region. The reliability discipline that makes this real is rehearsal: run an AZ-failure game day (kill an AZ’s tasks, force an RDS failover) on a schedule and confirm the blip is a blip — and a region game day at least annually.
Observability. Three signals plus the load balancer’s view. Metrics: Container Insights for ECS task/cluster CPU/memory and task counts; the ALB’s TargetResponseTime, HTTPCode_Target_5XX, RequestCount, HealthyHostCount, and RejectedConnectionCount; RDS CPU, connections, replica lag, freeable memory, and read/write latency; ElastiCache evictions, hit-rate, and CPU. Logs: application logs and ALB access logs to CloudWatch Logs / S3 (queried with Logs Insights / Athena). Traces: X-Ray / ADOT stitch a request across ALB → task → cache → database so you can see which tier a slow request spent its time in. Alarms do double duty — they drive Auto Scaling and page on-call. Alert on SLOs and symptoms (5XX rate, p99 latency, HealthyHostCount dropping, RDS replica lag, cache hit-rate collapse), not raw CPU; a cache hit-rate that falls off a cliff is an early warning that the database is about to be hammered.
Governance. Enforce, do not document. Land this in an AWS Organizations member account governed by Control Tower, with Service Control Policies denying the obvious foot-guns (no public RDS, no 0.0.0.0/0 on database ports, no disabling encryption, region restrictions). AWS Config rules (or conformance packs) continuously assert the invariants — rds-multi-az-support, rds-storage-encrypted, elasticache-redis-cluster-automatic-backup, alb-http-to-https-redirection, vpc-sg-open-only-to-authorized-ports — and report drift. IAM Identity Center governs human access via groups and short-lived sessions. And the infrastructure-as-code repo’s PR history is your change-management record — every production change is a reviewed, attributed, revertable commit, which is exactly what an auditor wants and exactly what “SSH in and edit the AMI” never provides.
Reference enterprise example
Larkfield Outfitters is a (fictional) mid-market direct-to-consumer outdoor-gear retailer running its storefront and account portal on AWS — roughly 180,000 daily active shoppers, spiky around weekend sales and a brutal Black-Friday peak, run by one platform team and three product squads. They started on a pair of hand-built EC2 instances behind a classic load balancer with a single MySQL box. Two incidents in one quarter forced the rebuild: a database reboot during a patch window took the entire site down for 22 minutes, and a flash-sale login storm saturated the fixed two-instance fleet while an engineer scrambled to launch more by hand. A PCI assessment also flagged the plaintext database password baked into the launch template and a database security group open to the whole VPC.
What they built. One VPC in us-east-1 spanning three AZs, three subnet tiers (public/app/data). The app — a containerized Node storefront and a Java account service — runs on ECS Fargate in the private app subnets, stateless, behind an internet-facing ALB across the three public AZs, with CloudFront + WAF (OWASP + a rate-based rule at 2,000 req/5-min/IP) in front and ACM TLS. They deleted ALB stickiness and moved all session state to ElastiCache (Valkey, Multi-AZ, 1 primary + 2 replicas), which doubled as the cache-aside store for the product catalog and rendered category fragments. The database became RDS PostgreSQL db.r6g.xlarge Multi-AZ with two read replicas behind a reader endpoint for the read-heavy catalog and order-history pages, storage encrypted with a KMS CMK, 14-day PITR, and the master credential managed and rotated by RDS in Secrets Manager — the PCI password finding closed itself, and the account service uses RDS IAM auth for token-based, passwordless connections. ECS Service Auto Scaling target-tracks ALBRequestCountPerTarget at 1,000, floor 4 tasks, ceiling 40, with scheduled scaling to pre-warm to 12 tasks before each weekend sale and a Fargate Spot capacity provider carrying ~40% of tasks above the on-demand floor. One NAT Gateway per AZ plus S3 gateway and ECR/Secrets/CloudWatch interface endpoints cut their data-transfer bill and kept image pulls private. Static assets and uploads moved to S3 behind CloudFront with OAC. Deploys went to blue/green via CodeDeploy with the deployment circuit breaker on.
The numbers and decisions. Roughly $5,200/month all-in at their normal weekday load: ~$1,500 Fargate (with a 1-year Compute Savings Plan on the baseline and Spot carrying the burst), ~$1,450 RDS (Multi-AZ primary + 2 replicas, on 1-year Reserved Instances), ~$520 ElastiCache (reserved nodes), ~$430 ALB + data processing, ~$520 CloudFront + WAF, ~$380 NAT + the rest. Black-Friday week roughly 2.4×'d the Fargate and ALB lines as Auto Scaling rode demand to 40 tasks and then scaled back down on Monday — they paid for the peak only while it happened. They debated EC2 Auto Scaling Groups vs Fargate and chose Fargate to delete AMI-patching toil for a team of four; they debated Aurora vs RDS PostgreSQL and chose RDS Multi-AZ for now, with Aurora noted as the upgrade path if the single writer becomes the ceiling. They debated keeping stickiness “to avoid a session-store rewrite” and rejected it outright — externalizing sessions to ElastiCache was the keystone that made every other resilience property possible.
The outcome. The database reboot that caused the 22-minute outage became a non-event: a routine RDS minor-version upgrade now triggers a ~70-second Multi-AZ failover that surfaces as a brief retry blip (the connection pool reconnects to the same endpoint) — they verified it in a game day by forcing a failover during business hours and measured 78 seconds to full recovery with zero committed-data loss. The flash-sale meltdown stopped recurring: a Monday-morning login storm now scales the app tier from 4 to 19 tasks in under three minutes automatically, and the cache absorbs the read storm so RDS CPU barely moves (catalog hit-rate ~94%). They ran an AZ-failure game day — terminated every task in one AZ and pulled its NAT — and the site stayed up on the surviving two AZs while Auto Scaling backfilled the lost tasks in ~90 seconds. For region DR they keep a cross-region read replica in us-west-2 and the ECR image + Terraform ready; a rehearsed failover (promote replica, terraform apply the app stack, flip the Route 53 failover record) measured an RTO of ~28 minutes with an RPO under 30 seconds (replica lag). The PCI assessor’s “credentials in config” and “database network exposure” findings were both closed — by Secrets Manager/IAM auth and the SG chain respectively. Net: an application that no longer has a single instance it depends on, that pays for capacity only when it is used, and that a four-person platform team can actually operate — for a bit over $5k/month at steady state.
When to use it
Use this architecture when you run an important, stateful web or API application that must be highly available within a region and elastic with demand — a customer portal, a storefront, a SaaS app, a line-of-business system — and you want it built on managed services with as little undifferentiated operational toil as possible. It is the correct default for “make our web app resilient and scalable” at essentially any size: it scales down to one squad running one product on small instances and up to a regulated enterprise running a fleet of these behind Organizations/Control Tower; the diagram is the same, only the sizes, account count, and guardrail strictness change. The prerequisites are modest: the app must be (or be made) stateless, with its durable state in RDS and its session/hot-read state in ElastiCache.
Trade-offs to accept going in. This is a single-region, intra-region-HA design. It survives instance and AZ failures gracefully and cheaply; surviving a region failure requires the warm-standby DR posture described above (cross-region replica/snapshots + IaC), which is real work you must build and rehearse — it is not active-active multi-region, and pretending otherwise is the most common way these architectures disappoint. You are also accepting the cost of a synchronous standby and replicas/cache nodes running 24×7 (mitigated by reservations) as the price of the availability they buy.
Anti-patterns that quietly defeat the design:
- Sticky sessions to dodge externalizing state. The single most common failure. Stickiness re-couples users to a task and re-introduces every problem the architecture exists to remove — you cannot deploy or scale without dropping sessions, and an AZ loss logs out everyone pinned there. State goes in ElastiCache; the app tier stays stateless. No exceptions.
- A single-AZ database “to start.” RDS without Multi-AZ means a routine patch, reboot, or host degradation is a full outage — the exact failure this pattern is built to eliminate. Multi-AZ from day one for anything that matters; a read replica is not a substitute for the synchronous standby.
- One NAT Gateway shared across AZs. A money-saving “optimization” that re-introduces a single point of failure for all app-tier egress — if that NAT’s AZ fails, every AZ loses egress. One NAT per AZ (or VPC endpoints to avoid NAT for AWS traffic).
- The database password in the task definition / launch template. A plaintext credential in config is the credential-leak path the architecture is meant to remove. Use RDS IAM auth, or Secrets Manager with rotation — never an environment variable.
- Security groups open to
0.0.0.0/0(or a whole CIDR) on the data tier. This erases the segmentation that bounds your blast radius. Chain SGs by reference (ALB→app→data); only the ALB faces the internet. - A cache used as a database, or with no failover. Putting durable data only in ElastiCache (no RDS of record) means a node loss is data loss; running a single cache node with no replica means a node loss is an outage and a thundering herd onto RDS. Cache is for speed and sessions with TTLs; RDS is the source of truth; run the cache Multi-AZ.
- Hand-baked AMIs and SSH deploys. Mutable, drifting hosts make “what is actually running?” unanswerable and rollback slow. Immutable images in ECR, task-definition revisions, blue/green or rolling deploys with auto-rollback.
Alternatives, in increasing capability and operational cost: (1) A fully serverless app — API Gateway/CloudFront + Lambda + DynamoDB — when your workload fits an event/request model and you want zero server (and zero cache/DB) operations and scale-to-zero economics; the right choice for spiky or low-baseline apps, but a poorer fit for long-lived connections, heavy relational joins, or “lift-and-shift a stateful web app.” (2) AWS App Runner / Elastic Beanstalk — a managed, opinionated wrapper over roughly this same pattern for teams that want even less to configure and can live with less control. (3) This article — ALB + Fargate Auto Scaling + RDS Multi-AZ + ElastiCache — the workhorse default for a resilient, elastic, stateful web app in one region. (4) The same shape with Aurora instead of RDS — when single-writer RDS or failover speed becomes the constraint and you want up to 15 low-lag replicas and faster failover. (5) Active-active multi-region (Route 53 latency/failover routing, Aurora Global Database or DynamoDB Global Tables) — when an entire-region outage is unacceptable and you will pay the substantial complexity and cost of running everywhere at once. Pick the lowest tier that meets your availability and statefulness requirements; most teams reach for multi-region when intra-region HA plus a rehearsed warm standby would have done, and pay for global complexity they did not need. The architecture you can actually operate and rehearse beats the one you merely drew.