Architecture AWS

AWS Enterprise Architecture: Resilient Three-Tier Web App

The three-tier web app is the most-deployed and most-misbuilt architecture on the planet. Almost everyone draws the same three boxes — web, app, database — and almost everyone then quietly couples them: the app servers keep session state in memory so they can never be replaced without logging users out, the database is a single instance because “we’ll add a replica later,” credentials live in environment variables baked into the AMI, and the whole thing sits in one Availability Zone because the demo worked. It runs fine until an AZ blips, a deploy needs to scale, or the single database reboots — and then it is down, and the post-mortem says “we always meant to make it resilient.” This reference architecture is the version that is resilient by construction: a public Application Load Balancer spreading traffic across multiple AZs, a stateless application tier on ECS Fargate that the platform can kill and replace at will, EC2/ECS Auto Scaling that tracks real demand, Amazon RDS in a Multi-AZ deployment with automatic failover, and Amazon ElastiCache so that session and hot-read load never touches the database. It scales down to a single team running one product and up to a regulated enterprise running a fleet of these behind one platform — the diagram is the same; what changes is the instance sizes, the number of accounts, and the strictness of the guardrails, not the shape.

This article follows the format of the major architecture centers — the scenario, the end-to-end request and data path, a component-by-component breakdown, concrete implementation and Terraform wiring, the enterprise concerns (security, cost, reliability, observability, governance), a named worked example with real monthly numbers, and an honest section on when not to build this.

The business scenario

Picture a product that has graduated from “it runs on one box” to “it cannot be down.” It might be a Series-A SaaS company’s customer portal, a retailer’s storefront, an insurer’s quote-and-bind app, or an internal HR system that 40,000 employees use on Monday morning. The technology underneath is almost always the same shape — a load balancer, some application servers, a relational database — and the shape of the pain is the same at both ends of the size range:

The problem this architecture solves is precise: serve a stateful web application with no single point of failure, a genuinely stateless and disposable application tier, demand-tracking elasticity, a managed database that fails over automatically, and a cache that protects both the database and the user experience — all behind least-privilege identity and network boundaries, deployed as code, for a cost that tracks load. The non-goals matter too. This is not a globally-distributed, multi-region, active-active system (that is a heavier, more expensive article, and most apps do not need it). It is not a microservices platform with a service mesh (also a different article). It is the smallest coherent architecture that makes one important web application highly available within a region, elastic, and operable — the workhorse that the fancier patterns are usually an over-reaction to.

Architecture overview

The organizing idea is separation of failure domains and separation of state. The three tiers — load balancing/edge, stateless compute, and stateful data — each live in their own subnet tier and their own security boundary, every tier is spread across at least two (ideally three) Availability Zones, and all durable state is pushed down to managed services (RDS, ElastiCache) so that the compute tier in the middle owns nothing it would be sad to lose. That single discipline — the app tier is stateless, the data tier is managed and multi-AZ — is what turns “three boxes” into “resilient three tiers.”

The whole thing sits inside one VPC with a conventional three-subnet-tier layout, replicated across AZs: a public subnet tier (just the load balancer and NAT), a private application subnet tier (the Fargate tasks, with no public IPs), and a private data subnet tier (RDS and ElastiCache, reachable from nowhere but the app tier).

Resilient three-tier AWS web app: edge (Route 53, CloudFront, WAF) into a VPC with public ALB, private Fargate app tier under Auto Scaling, and a private data tier of RDS Multi-AZ plus ElastiCache, with Secrets Manager/KMS, ECR, S3 and CloudWatch on the side; the request path is numbered 1 to 7.

The request path, end to end, for a user hitting the application:

  1. A client resolves the application hostname via Amazon Route 53 to the Application Load Balancer (ALB). In front of the ALB sits AWS WAF (OWASP managed rule groups, rate-based rules, bot control) and, for a public internet app, optionally Amazon CloudFront for TLS termination at the edge and caching of static assets. TLS certificates come from AWS Certificate Manager (ACM) and are attached to the ALB’s HTTPS listener; HTTP is redirected to HTTPS.
  2. The ALB lives in the public subnets across multiple AZs and is the only internet-facing component. It terminates TLS, evaluates listener rules (host/path routing), and forwards each request to a healthy target chosen from its target group — without sticky sessions, because the app tier is stateless. The ALB’s health checks continuously probe each task’s /healthz; an unhealthy task is taken out of rotation automatically, and an unhealthy AZ simply stops receiving traffic.
  3. The target is an ECS Fargate task running the application container in a private application subnet with no public IP. ECS spreads tasks across AZs; ECS Service Auto Scaling (target-tracking on ALB request-count-per-target and/or CPU/memory) adds and removes tasks as load changes, and the ALB registers/deregisters them automatically. The task has a least-privilege IAM task role — it can reach exactly the AWS APIs it needs and nothing else.
  4. The application handles the request without keeping any user state in its own memory. Session data, user carts, feature flags, and other per-user context live in Amazon ElastiCache (Redis/Valkey), reached over the private data subnets. Because state is external, any task can serve any user’s next request — which is precisely what lets the platform kill, replace, scale, and deploy tasks freely.
  5. For data the cache can serve, the app reads from ElastiCache first (cache-aside): hot product data, rendered fragments, rate-limit counters, idempotency keys. On a cache miss, or for any write, the app talks to the database.
  6. The database is Amazon RDS (PostgreSQL or MySQL) in a Multi-AZ deployment in the private data subnets. The application connects through the cluster/primary endpoint; RDS keeps a synchronous standby in another AZ and, on a primary failure, fails over automatically by repointing that DNS endpoint to the promoted standby — the app reconnects and continues. Read replicas (optional) take read-heavy traffic off the primary via a separate reader endpoint. The database credential is never in the app’s config: the task fetches it at startup from AWS Secrets Manager (or uses RDS IAM authentication for a passwordless token), over a private path.
  7. Telemetry flows out continuously: the ALB, ECS tasks, RDS, and ElastiCache publish metrics to Amazon CloudWatch; the application and the ALB access logs stream to CloudWatch Logs (and the ALB logs to S3); AWS X-Ray (or OpenTelemetry via the ADOT collector) traces requests across the tiers; and CloudWatch alarms drive both auto scaling and on-call alerting.

The data and deployment path, briefly, because resilience includes how state and code move:

The mental model: the ALB spreads traffic across AZs and hides unhealthy targets; Fargate runs a stateless, disposable app tier that Auto Scaling sizes to demand; ElastiCache holds the session and hot-read state so the tier can stay stateless and the database stays unburdened; and RDS Multi-AZ owns the durable data and fails over on its own. No single instance, in any tier, is something the system depends on.

Component breakdown

Component AWS service What it does Key configuration choices
DNS & edge Route 53 + CloudFront Hostname resolution; global TLS/cache edge Route 53 alias record to CloudFront/ALB; health-check-based failover records for DR; CloudFront for static-asset caching and TLS at the edge; HTTP→HTTPS redirect
Web Application Firewall AWS WAF Filters malicious requests at the edge AWS Managed Rules (Core/OWASP, Known-Bad-Inputs, IP reputation); rate-based rule per IP; optional Bot Control; associate to CloudFront and/or ALB
Load balancer Application Load Balancer (ALB) L7 entry, TLS termination, health-checked routing across AZs Internet-facing, in ≥2 public subnets/AZs; HTTPS listener with ACM cert + modern TLS policy; deletion protection + access logs to S3; no stickiness (stateless app); deregistration delay tuned for graceful drain
Compute (app tier) Amazon ECS on AWS Fargate Runs the stateless application containers Tasks in private subnets, no public IP; awsvpc networking (each task its own ENI + security group); spread across AZs; task role (app perms) separate from execution role (pull image, read secrets); rolling or blue/green (CodeDeploy) deploys; Fargate platform version pinned
Elasticity ECS Service Auto Scaling (Application Auto Scaling) Adds/removes tasks to track demand Target-tracking on ALBRequestCountPerTarget (primary) + CPU/memory; sensible min/max task count; scale-in cooldown > scale-out; optional scheduled scaling for known peaks; Fargate Spot capacity provider for a fraction of tasks
Database Amazon RDS (PostgreSQL/MySQL), Multi-AZ Durable relational store with automatic failover Multi-AZ (standby in another AZ, synchronous replication, automatic failover); storage-encrypted (KMS); automated backups + PITR; deletion protection; enhanced monitoring + Performance Insights; read replicas for read scale; credentials in Secrets Manager or IAM auth; gp3 storage; managed minor-version upgrades in a maintenance window
Cache / session store Amazon ElastiCache (Redis/Valkey) Session store + cache-aside hot data Multi-AZ with automatic failover, ≥1 replica per shard; cluster mode for horizontal scale; encryption in transit + at rest, AUTH/RBAC; TTLs on session keys; subnet group in data subnets only
Network Amazon VPC Isolated network, three subnet tiers × AZs 3 subnet tiers (public / app / data) across ≥2 AZs; NAT Gateway per AZ for app-tier egress; security groups as the primary firewall (ALB→app→data, each referencing the prior SG); VPC endpoints (ECR, S3, Secrets Manager, CloudWatch) so traffic stays on the AWS network
Secrets & keys AWS Secrets Manager + KMS Stores DB credentials; manages encryption keys Secrets Manager with automatic rotation for the DB credential (Lambda rotation), or RDS IAM auth for no stored password at all; KMS CMKs for RDS, ElastiCache, S3, secrets; least-privilege key policies
Static assets / logs Amazon S3 User uploads, static site assets, ALB/audit logs CloudFront + Origin Access Control for assets (bucket stays private); versioning + lifecycle for logs; SSE-KMS; Block Public Access on
CI/CD ECR + CodePipeline/CodeBuild (or GitHub Actions) + CodeDeploy Build, scan, ship immutable images ECR image scanning (enhanced/Inspector) + immutable tags; task-definition-per-revision; blue/green via CodeDeploy with ALB test listener for safe cutover
Observability CloudWatch + X-Ray Metrics, logs, traces, alarms, dashboards Container Insights on the ECS cluster; RDS Performance Insights; ALB/app access logs; X-Ray / ADOT tracing; alarms drive scaling and paging; dashboards as code

A few choices deserve the “why,” because they are exactly where the textbook three-tier diagram quietly betrays you.

Why the app tier must be stateless — and why that means ElastiCache, not stickiness. The single most consequential decision in this whole architecture is that an application task holds no user state. The lazy alternative is ALB sticky sessions, which pin a user to one task so in-memory session state “works.” It is a trap: stickiness means a task you remove (to deploy, to scale in, or because it died) takes its users’ sessions with it, your traffic is unevenly distributed, and an AZ failure logs out everyone who was pinned there. Externalizing session state to ElastiCache breaks that coupling completely — any task can serve any request, so the platform is free to kill, replace, scale, and rebalance tasks at will, which is the entire source of the architecture’s resilience and elasticity. If you remember one thing: state in the cache is what makes the compute tier disposable, and a disposable compute tier is what makes the system resilient.

Why RDS Multi-AZ rather than a read replica you promote manually. People conflate two different things. A read replica scales reads (asynchronous, can lag, must be promoted by hand on failure — minutes of human-in-the-loop downtime). A Multi-AZ deployment is for availability: a synchronously-replicated standby that RDS promotes automatically, repointing the endpoint DNS, typically in 60–120 seconds, with no human and no data loss (RPO ≈ 0 for committed transactions). You want both for different reasons — Multi-AZ for HA, replicas for read scale — and you must not substitute one for the other. The “Multi-AZ DB cluster” variant (two readable standbys) lowers failover time further and gives you reader endpoints too, at higher cost. Crucially, the application should connect through the endpoint name and use a connection pool that reconnects on failure, so the failover is a brief blip, not an outage.

Why ECS Fargate over EC2 Auto Scaling Groups (and where ASGs still win). Both can run a stateless app tier behind an ALB. Fargate removes the EC2 layer entirely — no AMIs to patch, no node Auto Scaling Group to manage, no SSH, per-task isolation with its own ENI and security group, and you pay per-task vCPU/memory by the second. EC2 with an Auto Scaling Group wins when you need a specific instance type or GPU, want maximum cost control via Reserved/Savings-Plan-backed instances at steady high utilization, or run software that assumes a host. For the typical resilient web app, Fargate’s operational simplicity is worth the modest per-unit premium; this article uses it as the default and notes the ASG alternative where it matters. Either way the pattern — stateless tasks/instances, ALB target group, target-tracking auto scaling across AZs — is identical.

Why security groups that reference each other, not CIDRs. The data tier’s security group should allow the database port from the app tier’s security group, not from a subnet CIDR and certainly not from 0.0.0.0/0. Referencing the source SG means the rule says “only the application may reach the database,” it stays correct as IPs change, and it makes the blast radius legible: a compromised web request can reach the app SG, the app can reach the data SG, and nothing else. This SG-chaining (ALB-SG → app-SG → data-SG) is the network segmentation.

Implementation guidance

Provision in layers, each with its own Terraform stack and state, so the long-lived network and the shorter-lived app can evolve independently. Terraform is the common choice on AWS (the AWS CDK or CloudFormation are equally valid; on multi-cloud teams Terraform wins). The layering matters more than the tool:

A representative Terraform skeleton for the load-balanced, auto-scaled Fargate app tier (Layer 2) — note the stateless target group (no stickiness), the AZ-spread service, and the target-tracking policy on request-count:

# --- ALB across the PUBLIC subnets in every AZ ---
resource "aws_lb" "app" {
  name                       = "web-prod-alb"
  load_balancer_type         = "application"
  internal                   = false
  subnets                    = aws_subnet.public[*].id   # one per AZ
  security_groups            = [aws_security_group.alb.id]
  enable_deletion_protection = true
  drop_invalid_header_fields = true
  access_logs {
    bucket  = aws_s3_bucket.lb_logs.id
    enabled = true
  }
}

resource "aws_lb_target_group" "app" {
  name        = "web-prod-tg"
  port        = 8080
  protocol    = "HTTP"
  target_type = "ip"            # awsvpc / Fargate registers task ENIs
  vpc_id      = aws_vpc.this.id
  deregistration_delay = 30     # graceful drain on scale-in / deploy
  health_check {
    path                = "/healthz"
    healthy_threshold   = 3
    unhealthy_threshold = 3
    interval            = 15
    matcher             = "200"
  }
  # NOTE: no stickiness block — the app tier is stateless (state in ElastiCache)
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.app.arn
  default_action { type = "forward"  target_group_arn = aws_lb_target_group.app.arn }
}

# --- ECS Fargate service: tasks in PRIVATE app subnets, spread across AZs ---
resource "aws_ecs_service" "app" {
  name            = "web-prod"
  cluster         = aws_ecs_cluster.this.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 3
  launch_type     = "FARGATE"
  network_configuration {
    subnets          = aws_subnet.app[*].id          # private, no public IP
    security_groups  = [aws_security_group.app.id]
    assign_public_ip = false
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "web"
    container_port   = 8080
  }
  deployment_controller { type = "ECS" }             # or CODE_DEPLOY for blue/green
  deployment_circuit_breaker { enable = true  rollback = true }
}

# --- Auto Scaling: track requests-per-target (primary), 3..20 tasks ---
resource "aws_appautoscaling_target" "app" {
  service_namespace  = "ecs"
  resource_id        = "service/${aws_ecs_cluster.this.name}/${aws_ecs_service.app.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 3
  max_capacity       = 20
}

resource "aws_appautoscaling_policy" "rps" {
  name               = "track-requests-per-target"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.app.resource_id
  scalable_dimension = aws_appautoscaling_target.app.scalable_dimension
  service_namespace  = "ecs"
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.app.arn_suffix}/${aws_lb_target_group.app.arn_suffix}"
    }
    target_value       = 1000   # requests/target; scale out above this
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

The data tier (Layer 1) is where Multi-AZ and “no password in the app” earn their keep:

resource "aws_db_instance" "app" {
  identifier              = "web-prod-db"
  engine                  = "postgres"
  engine_version          = "16"
  instance_class          = "db.r6g.large"
  allocated_storage       = 100
  storage_type            = "gp3"
  multi_az                = true                 # synchronous standby + auto-failover
  db_subnet_group_name    = aws_db_subnet_group.data.name
  vpc_security_group_ids  = [aws_security_group.data.id]
  storage_encrypted       = true
  kms_key_id              = aws_kms_key.rds.arn
  backup_retention_period = 14                   # PITR window
  deletion_protection     = true
  performance_insights_enabled          = true
  iam_database_authentication_enabled   = true   # passwordless token option
  # username/password sourced from Secrets Manager (see manage_master_user_password)
  manage_master_user_password           = true   # RDS-managed secret + rotation
  auto_minor_version_upgrade            = true
  lifecycle { prevent_destroy = true }
}

resource "aws_elasticache_replication_group" "sessions" {
  replication_group_id       = "web-prod-cache"
  description                = "session + cache-aside"
  engine                     = "valkey"
  node_type                  = "cache.r7g.large"
  num_node_groups            = 1
  replicas_per_node_group    = 2
  automatic_failover_enabled = true              # promote a replica on AZ loss
  multi_az_enabled           = true
  subnet_group_name          = aws_elasticache_subnet_group.data.name
  security_group_ids         = [aws_security_group.data.id]
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
}

Networking and identity wiring, the load-bearing rules:

Deployment: ship an immutable image to ECR (scanned by Inspector), register a new task-definition revision, and let ECS do a rolling update with the deployment circuit breaker on (auto-rollback if new tasks fail health checks), or a blue/green deploy via CodeDeploy with an ALB test listener for a zero-downtime cutover you validate before shifting production traffic. There is no AMI and no SSH — rollback is selecting the previous task definition.

Enterprise considerations

Security and Zero Trust. Apply Zero Trust at each tier. Network: the only public component is the ALB (ideally only reachable from CloudFront’s prefix list); the app tier has no public IP; the data tier has no internet route; security groups chain ALB→app→data and deny everything else. Identity: the app runs under a least-privilege task role, the database uses IAM auth or a rotated Secrets Manager credential (no password in config), and human access to the database is via short-lived credentials, not a shared root login. Edge: AWS WAF with managed OWASP, known-bad-inputs, and IP-reputation rule groups plus a rate-based rule blunts L7 attacks and credential-stuffing; AWS Shield Standard is automatic and Shield Advanced is available for DDoS-sensitive apps. Data: encryption at rest (KMS CMKs) on RDS, ElastiCache, S3, and secrets, and encryption in transit everywhere (TLS at the ALB, TLS to RDS, transit encryption on ElastiCache). Posture: turn on GuardDuty (threat detection on VPC flow logs, DNS, and CloudTrail), Security Hub (CIS/AWS Foundational benchmark scoring), and Inspector (image and instance CVEs), and treat findings as a backlog. The blast radius of a compromised web request is now bounded by exactly three security-group hops and one narrowly-scoped IAM role.

Cost optimization (FinOps). Elasticity is the headline lever: target-tracking Auto Scaling means you stop paying for the off-peak fleet that the static estate ran at 8% utilization — capacity tracks load. Beyond that: (1) Fargate Spot for a fraction of the tasks (interruptible web capacity at up to ~70% off — keep a Fargate-on-demand floor for the baseline so a Spot reclaim never drops you below safe capacity); (2) Compute Savings Plans for the steady-state Fargate baseline and Reserved Instances for the long-lived RDS and ElastiCache nodes (these run 24×7, so 1- or 3-year reservations cut 30–60%); (3) gp3 storage (cheaper and independently tunable IOPS vs gp2) and right-sized instance classes guided by Performance Insights and Compute Optimizer; (4) VPC endpoints + CloudFront/S3 for static assets to slash NAT and data-transfer charges (data transfer is the line item teams forget); (5) scheduled scaling to shrink non-prod overnight and weekends; (6) per-team cost allocation tags and a CloudWatch + Cost Explorer showback so each service sees its own bill. The cache itself is a cost optimizer: every read it serves is an RDS query you did not run, which lets the database be smaller.

Scalability. Independent axes: the app tier scales horizontally via ECS Service Auto Scaling on request-count and CPU (seconds to minutes, scale-to-floor not zero for a always-on web app); the database read path scales by adding read replicas behind a reader endpoint (and by leaning on the cache); the cache scales by enabling cluster mode and adding shards; writes scale vertically (bigger RDS instance) and via RDS Proxy to multiplex connections — and when single-writer RDS becomes the ceiling, Aurora (separated storage, up to 15 low-lag replicas, faster failover) is the next step without changing the architecture’s shape. Design every request to be servable by any task (stateless), and keep transactions short so the database stays the part you scale last, not first.

Reliability and DR (RTO/RPO). Inside the region this architecture is HA by design: the ALB and Fargate tasks span multiple AZs, RDS Multi-AZ fails over automatically in ~60–120 seconds (RPO ≈ 0 for committed transactions**)**, ElastiCache Multi-AZ promotes a replica on node/AZ loss, and one NAT Gateway per AZ removes the shared-egress SPOF — losing an entire AZ degrades capacity but does not cause an outage, and Auto Scaling backfills the lost tasks in the surviving AZs. Region loss is the harder case and a deliberate trade-off: this is a single-region design, so cross-region DR is warm standby or backup-and-restore, not active-active. The pragmatic posture: cross-region automated backups / RDS snapshot copy and read replica in a second region for the data (your RPO is the replica lag — seconds to low minutes — or the snapshot cadence for backup-restore); Terraform + the ECR image to stand up the app tier in the DR region (your RTO is “promote the cross-region replica + terraform apply the app stack + flip the Route 53 health-check failover record” — realistically 15–45 minutes warm, hours cold). Geo-replicate (or copy) the ECR image and Secrets to the DR region. The reliability discipline that makes this real is rehearsal: run an AZ-failure game day (kill an AZ’s tasks, force an RDS failover) on a schedule and confirm the blip is a blip — and a region game day at least annually.

Observability. Three signals plus the load balancer’s view. Metrics: Container Insights for ECS task/cluster CPU/memory and task counts; the ALB’s TargetResponseTime, HTTPCode_Target_5XX, RequestCount, HealthyHostCount, and RejectedConnectionCount; RDS CPU, connections, replica lag, freeable memory, and read/write latency; ElastiCache evictions, hit-rate, and CPU. Logs: application logs and ALB access logs to CloudWatch Logs / S3 (queried with Logs Insights / Athena). Traces: X-Ray / ADOT stitch a request across ALB → task → cache → database so you can see which tier a slow request spent its time in. Alarms do double duty — they drive Auto Scaling and page on-call. Alert on SLOs and symptoms (5XX rate, p99 latency, HealthyHostCount dropping, RDS replica lag, cache hit-rate collapse), not raw CPU; a cache hit-rate that falls off a cliff is an early warning that the database is about to be hammered.

Governance. Enforce, do not document. Land this in an AWS Organizations member account governed by Control Tower, with Service Control Policies denying the obvious foot-guns (no public RDS, no 0.0.0.0/0 on database ports, no disabling encryption, region restrictions). AWS Config rules (or conformance packs) continuously assert the invariants — rds-multi-az-support, rds-storage-encrypted, elasticache-redis-cluster-automatic-backup, alb-http-to-https-redirection, vpc-sg-open-only-to-authorized-ports — and report drift. IAM Identity Center governs human access via groups and short-lived sessions. And the infrastructure-as-code repo’s PR history is your change-management record — every production change is a reviewed, attributed, revertable commit, which is exactly what an auditor wants and exactly what “SSH in and edit the AMI” never provides.

Reference enterprise example

Larkfield Outfitters is a (fictional) mid-market direct-to-consumer outdoor-gear retailer running its storefront and account portal on AWS — roughly 180,000 daily active shoppers, spiky around weekend sales and a brutal Black-Friday peak, run by one platform team and three product squads. They started on a pair of hand-built EC2 instances behind a classic load balancer with a single MySQL box. Two incidents in one quarter forced the rebuild: a database reboot during a patch window took the entire site down for 22 minutes, and a flash-sale login storm saturated the fixed two-instance fleet while an engineer scrambled to launch more by hand. A PCI assessment also flagged the plaintext database password baked into the launch template and a database security group open to the whole VPC.

What they built. One VPC in us-east-1 spanning three AZs, three subnet tiers (public/app/data). The app — a containerized Node storefront and a Java account service — runs on ECS Fargate in the private app subnets, stateless, behind an internet-facing ALB across the three public AZs, with CloudFront + WAF (OWASP + a rate-based rule at 2,000 req/5-min/IP) in front and ACM TLS. They deleted ALB stickiness and moved all session state to ElastiCache (Valkey, Multi-AZ, 1 primary + 2 replicas), which doubled as the cache-aside store for the product catalog and rendered category fragments. The database became RDS PostgreSQL db.r6g.xlarge Multi-AZ with two read replicas behind a reader endpoint for the read-heavy catalog and order-history pages, storage encrypted with a KMS CMK, 14-day PITR, and the master credential managed and rotated by RDS in Secrets Manager — the PCI password finding closed itself, and the account service uses RDS IAM auth for token-based, passwordless connections. ECS Service Auto Scaling target-tracks ALBRequestCountPerTarget at 1,000, floor 4 tasks, ceiling 40, with scheduled scaling to pre-warm to 12 tasks before each weekend sale and a Fargate Spot capacity provider carrying ~40% of tasks above the on-demand floor. One NAT Gateway per AZ plus S3 gateway and ECR/Secrets/CloudWatch interface endpoints cut their data-transfer bill and kept image pulls private. Static assets and uploads moved to S3 behind CloudFront with OAC. Deploys went to blue/green via CodeDeploy with the deployment circuit breaker on.

The numbers and decisions. Roughly $5,200/month all-in at their normal weekday load: ~$1,500 Fargate (with a 1-year Compute Savings Plan on the baseline and Spot carrying the burst), ~$1,450 RDS (Multi-AZ primary + 2 replicas, on 1-year Reserved Instances), ~$520 ElastiCache (reserved nodes), ~$430 ALB + data processing, ~$520 CloudFront + WAF, ~$380 NAT + the rest. Black-Friday week roughly 2.4×'d the Fargate and ALB lines as Auto Scaling rode demand to 40 tasks and then scaled back down on Monday — they paid for the peak only while it happened. They debated EC2 Auto Scaling Groups vs Fargate and chose Fargate to delete AMI-patching toil for a team of four; they debated Aurora vs RDS PostgreSQL and chose RDS Multi-AZ for now, with Aurora noted as the upgrade path if the single writer becomes the ceiling. They debated keeping stickiness “to avoid a session-store rewrite” and rejected it outright — externalizing sessions to ElastiCache was the keystone that made every other resilience property possible.

The outcome. The database reboot that caused the 22-minute outage became a non-event: a routine RDS minor-version upgrade now triggers a ~70-second Multi-AZ failover that surfaces as a brief retry blip (the connection pool reconnects to the same endpoint) — they verified it in a game day by forcing a failover during business hours and measured 78 seconds to full recovery with zero committed-data loss. The flash-sale meltdown stopped recurring: a Monday-morning login storm now scales the app tier from 4 to 19 tasks in under three minutes automatically, and the cache absorbs the read storm so RDS CPU barely moves (catalog hit-rate ~94%). They ran an AZ-failure game day — terminated every task in one AZ and pulled its NAT — and the site stayed up on the surviving two AZs while Auto Scaling backfilled the lost tasks in ~90 seconds. For region DR they keep a cross-region read replica in us-west-2 and the ECR image + Terraform ready; a rehearsed failover (promote replica, terraform apply the app stack, flip the Route 53 failover record) measured an RTO of ~28 minutes with an RPO under 30 seconds (replica lag). The PCI assessor’s “credentials in config” and “database network exposure” findings were both closed — by Secrets Manager/IAM auth and the SG chain respectively. Net: an application that no longer has a single instance it depends on, that pays for capacity only when it is used, and that a four-person platform team can actually operate — for a bit over $5k/month at steady state.

When to use it

Use this architecture when you run an important, stateful web or API application that must be highly available within a region and elastic with demand — a customer portal, a storefront, a SaaS app, a line-of-business system — and you want it built on managed services with as little undifferentiated operational toil as possible. It is the correct default for “make our web app resilient and scalable” at essentially any size: it scales down to one squad running one product on small instances and up to a regulated enterprise running a fleet of these behind Organizations/Control Tower; the diagram is the same, only the sizes, account count, and guardrail strictness change. The prerequisites are modest: the app must be (or be made) stateless, with its durable state in RDS and its session/hot-read state in ElastiCache.

Trade-offs to accept going in. This is a single-region, intra-region-HA design. It survives instance and AZ failures gracefully and cheaply; surviving a region failure requires the warm-standby DR posture described above (cross-region replica/snapshots + IaC), which is real work you must build and rehearse — it is not active-active multi-region, and pretending otherwise is the most common way these architectures disappoint. You are also accepting the cost of a synchronous standby and replicas/cache nodes running 24×7 (mitigated by reservations) as the price of the availability they buy.

Anti-patterns that quietly defeat the design:

Alternatives, in increasing capability and operational cost: (1) A fully serverless appAPI Gateway/CloudFront + Lambda + DynamoDB — when your workload fits an event/request model and you want zero server (and zero cache/DB) operations and scale-to-zero economics; the right choice for spiky or low-baseline apps, but a poorer fit for long-lived connections, heavy relational joins, or “lift-and-shift a stateful web app.” (2) AWS App Runner / Elastic Beanstalk — a managed, opinionated wrapper over roughly this same pattern for teams that want even less to configure and can live with less control. (3) This article — ALB + Fargate Auto Scaling + RDS Multi-AZ + ElastiCache — the workhorse default for a resilient, elastic, stateful web app in one region. (4) The same shape with Aurora instead of RDS — when single-writer RDS or failover speed becomes the constraint and you want up to 15 low-lag replicas and faster failover. (5) Active-active multi-region (Route 53 latency/failover routing, Aurora Global Database or DynamoDB Global Tables) — when an entire-region outage is unacceptable and you will pay the substantial complexity and cost of running everywhere at once. Pick the lowest tier that meets your availability and statefulness requirements; most teams reach for multi-region when intra-region HA plus a rehearsed warm standby would have done, and pay for global complexity they did not need. The architecture you can actually operate and rehearse beats the one you merely drew.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading