Architecture AWS

Your First Highly Available Web App on AWS

A mid-sized e-learning company runs a Moodle-based course platform for about 60,000 students across a handful of universities, and right now it lives on a single beefy server one of the founders set up three years ago. It works — until it doesn’t. The afternoon before semester exams, ten thousand students log in within the same hour to download study material, the one box runs out of memory, and the whole platform goes dark for forty minutes. Worse, the database password is sitting in a config.php file on that same server, in plain text, and it was once accidentally committed to a Git repo. The new platform lead has a clear mandate: the site must survive losing a server, survive losing a whole data-centre, and stop storing the database password in a file anyone can read. This article is the reference architecture for doing exactly that on AWS — the foundational, “this is how you do it properly the first time” version. It is deliberately not exotic. It is the pattern every team should reach for before anything fancier.

The pressures here are the ones every growing app eventually hits. Availability: a single server is a single point of failure, and “the server died” cannot be the reason 60,000 students miss an exam deadline. Spiky load: traffic is flat most of the term and then spikes 20x the week before exams, so a fixed-size fleet is either wasteful most of the time or too small exactly when it matters. Security: a database password in a config file is a breach waiting to happen, and the team has already been burned by leaking credentials into Git once. And cost: this is a budget-conscious shop, so the design has to be cheap when traffic is low and only spend money when students actually show up. High availability on AWS is the pattern that answers all four at once — by spreading the app across independent failure domains and letting it grow and shrink with demand.

Why not just a bigger server

The tempting shortcut is to buy a bigger box, and it is worth naming why that fails, because someone on the team will suggest it.

One large server still has one power supply, one host, one Availability Zone, and one operating system to patch. When it goes down — and it will, for hardware, for an OS update, for an AWS maintenance event — the entire platform is down with it. Making it bigger (vertical scaling) buys headroom but not resilience; you have spent more money to have the same single point of failure. Two servers behind a round-robin DNS is better but crude: DNS caches, so when one server dies, a chunk of students keep getting sent to the dead one for minutes, and DNS has no idea whether a server is actually healthy.

The real fix is horizontal scaling across Availability Zones with a health-aware load balancer in front. Run several identical app servers, spread them across two physically separate AWS data-centres (AZs), and put a load balancer ahead of them that constantly health-checks each one and only sends traffic to the healthy ones. Now losing a server is a non-event — the load balancer simply stops routing to it — and losing an entire data-centre still leaves you running in the other one. That is what “highly available” actually means: no single failure takes you down.

Architecture overview

Your First Highly Available Web App on AWS — architecture

Everything lives inside one VPC (your own private network in AWS) spread across two Availability Zones in a single region. Think of an AZ as an independent data-centre with its own power and cooling; if one catches fire, the other keeps running. The VPC is split into subnets, and the most important early decision is which tiers are public and which are private.

This three-tier split (load balancer → app → database, each more private than the last) is the backbone of the whole design.

The request path, following a student loading a course page:

  1. The student’s browser first resolves CloudFront, AWS’s CDN. Static assets — the Moodle theme CSS, JavaScript, course images, lecture PDFs — are served from CloudFront’s edge cache close to the student, pulling from an S3 bucket as the origin. This means the bulk of the bytes never touch your servers at all. In front of CloudFront, the company runs Akamai as its enterprise edge for global TLS termination, WAF, and bot/DDoS protection — a single security perimeter that fronts this app and the company’s other properties — before traffic is handed to CloudFront and the AWS origin.
  2. The dynamic request (the actual page logic — “show me my enrolled courses”) goes to the ALB in the public subnets. The ALB terminates HTTPS (its certificate managed by AWS Certificate Manager) and looks at its target group of healthy app servers.
  3. The ALB forwards the request to one healthy EC2 instance in the private subnet, picked across both AZs. The instance is part of an Auto Scaling Group (ASG) that keeps the fleet at the right size and replaces any instance that fails its health check.
  4. The Moodle code on that instance needs to read or write data, so it connects to the RDS database. Crucially, it does not read the password from a file. At startup the instance fetches the DB credentials from AWS Secrets Manager, using its IAM instance role — no password is ever stored on disk or in the code.
  5. RDS runs as Multi-AZ: a primary in one AZ with a synchronous standby in the other. The app only ever talks to a single DNS endpoint; if the primary fails, AWS promotes the standby and re-points that endpoint automatically.

The response flows back the same way: app server → ALB → student. Static bytes came from CloudFront; only the dynamic, personalised part of the page involved your EC2 fleet.

The components, and why each one is here

Tier AWS service What it does here Why it gives you HA
Edge / CDN Akamai → CloudFront + S3 Serve and cache static assets close to students Offloads servers; survives origin blips from cache
Front door Application Load Balancer Single HTTPS entry; health-checks and routes to app servers Stops sending traffic to dead instances instantly
Compute EC2 + Auto Scaling Group Run the Moodle app; grow/shrink with load; self-heal Spread across 2 AZs; replaces failed instances
Database RDS Multi-AZ Managed SQL database for courses, users, grades Synchronous standby in a second AZ; auto-failover
Secrets AWS Secrets Manager Holds the DB password; rotates it No plaintext credential on any server
Identity (app) IAM roles Lets EC2 fetch the secret with no stored keys Removes long-lived credentials entirely
Identity (people) Okta / Entra ID SSO for staff into Moodle and the AWS Console Central control; no shared local logins

A few of these deserve the why, because they are the choices junior teams most often get wrong.

Why the database is Multi-AZ, not just backed up. A nightly backup protects you from data loss, but restoring it takes time — your platform is down while you do it. Multi-AZ is about availability: RDS keeps a hot standby copy in the second AZ, kept in sync in real time, and if the primary dies it fails over to the standby in typically 60–120 seconds with the same endpoint name. You want both — Multi-AZ for staying up, and automated backups (plus point-in-time recovery) for getting data back if something corrupts it. They solve different problems.

Why Secrets Manager, not a config file or an environment variable. The password in config.php was the bug that started this project. Secrets Manager stores the credential encrypted, hands it out only to identities you authorise via IAM, logs every access in CloudTrail, and can rotate the password automatically on a schedule — when it rotates, it updates both Secrets Manager and RDS together so nothing breaks. The app fetches it at runtime over a private call. There is no file to leak and no password to commit to Git.

Why an Auto Scaling Group, even at minimum size. Even if you never scaled up, an ASG earns its keep: if an instance crashes or fails its health check, the ASG automatically launches a fresh one from your launch template to restore the desired count. Self-healing is free. On top of that, scaling policies let the fleet grow when CPU or request count climbs — which is exactly what the exam-week spike needs.

Handling the exam-week spike with Auto Scaling

The whole reason a single server failed was the 20x login surge before exams. The ASG turns that from an outage into a non-event. You attach a target-tracking scaling policy that says, in effect, “keep average CPU near 50% — add instances when it climbs, remove them when it drops.” When ten thousand students log in, CPU rises, the ASG launches more instances across both AZs, the ALB starts routing to them as soon as they pass health checks, and the platform absorbs the load. When the rush passes, the ASG scales back down so you stop paying for idle servers.

A minimal scaling policy is just a target and a metric:

{
  "TargetValue": 50.0,
  "PredefinedMetricSpecification": {
    "PredefinedMetricType": "ASGAverageCPUUtilization"
  },
  "EstimatedInstanceWarmup": 120
}

Two settings make this behave well in the real world. Set a sensible minimum (say 2, one per AZ, so you are always redundant) and a maximum that caps spend even under a runaway spike or an attack. And set EstimatedInstanceWarmup so the ASG waits for new instances to actually boot and warm up before judging whether it needs even more — otherwise it over-reacts and launches a stampede. For predictable calendar events like exam week, you can also add a scheduled action to pre-scale the fleet at 7am before the rush, rather than waiting for CPU to prove it is needed.

Security: locking the doors with security groups and IAM

Security in this design is mostly about who is allowed to talk to whom, enforced by security groups — stateful virtual firewalls attached to each tier. The rule of thumb is that each tier only accepts traffic from the tier directly in front of it. You reference security groups as the source, not IP ranges, so the rules keep working as instances come and go.

Security group Inbound allowed from Effect
alb-sg (load balancer) 0.0.0.0/0 on 443 (via Akamai/CloudFront) The internet can reach only the ALB, only on HTTPS
app-sg (EC2 fleet) alb-sg on the app port Only the ALB can reach app servers; no direct internet
db-sg (RDS) app-sg on 3306/5432 Only app servers can reach the database

This chain means even if an attacker found an app server’s private IP, they could not reach it — nothing but the ALB is allowed in. And the database is doubly protected: it is in a private subnet and only the app tier’s security group can connect.

Two more identity layers complete the picture, and they map to two different audiences:

For a security-conscious shop, two more guardrails are worth adding from day one without complicating the core design. Wiz (with Wiz Code scanning the Terraform before it is applied) runs continuous cloud-posture checks and would loudly flag exactly the mistakes that hurt before — an S3 bucket gone public, a security group opened to the world, a secret drifting into plaintext. And CrowdStrike Falcon sensors on the EC2 instances provide runtime threat detection on the servers themselves, feeding alerts to whoever is on call. Neither changes the architecture; they are the safety net that catches human error.

Cost: cheap when quiet, paying only for the spike

This is a budget-conscious team, so the design is built to be inexpensive at rest. The biggest savings come from a few deliberate choices.

Lever Mechanism Effect on the bill
Auto Scaling to demand ASG runs ~2 instances off-peak, many at peak You pay for the spike only while it lasts
Right-size + Savings Plans Buy a Compute Savings Plan for the always-on baseline ~30–50% off the 2 baseline instances
Offload static to CloudFront/S3 Cache assets at the edge from cheap S3 storage Fewer/smaller EC2 instances; lower data-transfer cost
Single-AZ NAT, or none Endpoints/careful routing to avoid pricey NAT data Cuts a sneaky recurring cost
RDS sized to load Start small; Multi-AZ doubles DB cost — accept it for HA Predictable, and the price of staying up

The one cost to go in with eyes open about is Multi-AZ RDS roughly doubles the database bill, because you are paying for the standby that sits ready. That is the deliberate price of surviving a data-centre failure, and for a platform that 60,000 students depend on at exam time, it is worth it. Everything else — scaling compute to actual demand, serving static bytes from CloudFront instead of EC2, buying a Savings Plan for the steady baseline — keeps the day-to-day bill low and makes the cost track real usage.

Operations: how the team actually runs this

Building it is half the job; running it is the other half, and the foundational version still needs a real operating model.

Everything as code with Terraform. The entire stack — VPC, subnets, ALB, ASG, RDS, security groups, the S3 bucket and CloudFront distribution — is defined in Terraform, not clicked together in the console. That means the environment is reproducible, reviewable in a pull request, and you can stand up an identical staging copy in minutes. Ansible handles the inside-the-instance configuration (installing Moodle, PHP, and the agents) so a fresh instance from the ASG comes up correctly every time. Defining infrastructure as code is also what lets Wiz Code scan it for misconfigurations before anything is deployed.

A simple CI/CD pipeline. Code changes flow through GitHub Actions (or Jenkins if the team standardises on it): on a merge, the pipeline runs tests, bakes a new application image, and rolls it out to the ASG instances behind the ALB with health checks gating each step, so a bad deploy never takes the whole fleet down at once. Teams that grow into Kubernetes later add Argo CD for GitOps-style continuous delivery, but for a foundational EC2 fleet, a straightforward pipeline that updates the launch template and triggers an instance refresh is exactly right — don’t over-build it.

Monitoring and alerting. CloudWatch collects the basics — CPU, request counts, ALB 5xx errors, RDS connections, healthy host count — and alarms page the on-call engineer when something is wrong. As the team matures, Datadog (or Dynatrace) layers on richer application performance monitoring: distributed tracing of a slow course-page load, dashboards the platform lead watches during exam week, and anomaly detection that surfaces a problem before students notice. Whichever you pick, the metric that matters most is healthy host count behind the ALB — if it drops, you are losing redundancy.

Incidents and change control through ServiceNow. When an alarm fires or a student-affecting outage happens, an incident is raised in ServiceNow so there is a tracked ticket, an owner, and a record — not just a Slack message that scrolls away. Planned changes (a database engine upgrade, a Moodle version bump) go through ServiceNow change management too, which gives the universities the documented, auditable process they expect from a vendor.

A note on virtual appliances: if a university security team mandates a specific third-party firewall or web-application-firewall product, you can run it as a virtual appliance (a vendor’s pre-built EC2 AMI) in the public subnet and route ingress through it. For most foundational builds, AWS’s own ALB plus the Akamai/CloudFront WAF cover this, and adding an appliance introduces its own scaling and HA work — so reach for it only when a compliance requirement actually forces it.

Failure modes, and what each one looks like

Naming the failures before they happen is what separates a design that claims high availability from one that delivers it.

The one failure this single-region design does not survive is an entire AWS region going down. That is a deliberate scope choice: full multi-region active-active is a large step up in cost and complexity, and it is the right next project, not part of the foundational build. What you do have today is solid backups — automated RDS snapshots and point-in-time recovery, plus S3’s built-in durability — so even a regional disaster is recoverable, just not instantly.

Explicit tradeoffs

What this design accepts. It is two-AZ, single-region: it shrugs off a lost server or a lost data-centre, but a regional outage is a recover-from-backups event, not a seamless failover. Multi-AZ RDS doubles the database cost for a standby that mostly sits idle — you are paying an insurance premium. And there is more moving machinery than a single box: a load balancer, an Auto Scaling Group, a managed database, a secrets store, and security groups to reason about. For a junior team, that learning curve is real — but every piece here is a managed AWS service doing the heavy lifting, which is precisely why this is the foundational pattern and not an advanced one.

The alternatives, and when they win. If you genuinely have a tiny, low-stakes internal tool, a single EC2 instance with good backups is cheaper and simpler — just be honest that it has no HA. If your app can be made serverless (Lambda + API Gateway + DynamoDB), you get HA and scaling without managing servers at all, and for new greenfield apps that is often the better starting point — but Moodle is a traditional server-based PHP application, so the EC2 + RDS pattern fits what actually has to run. If you outgrow EC2 fleets and want finer-grained scaling and richer deploys, containers on ECS or EKS (with Argo CD for delivery) are the natural graduation — but moving there before you need it is complexity you will pay for and not use.

The shape of the win

For the e-learning company, the payoff is concrete: the afternoon before exams, thirty thousand students log in within the hour, the Auto Scaling Group quietly adds instances across both AZs, CloudFront serves the lecture PDFs from the edge, and the platform stays fast — and when one of the app servers crashes at 2am, the on-call engineer sleeps through it because the ASG already replaced it. The database password no longer lives in a file; it is in Secrets Manager, rotating on a schedule, fetched at runtime by an IAM role, and there is nothing left to leak into Git. That is the whole promise of foundational high availability on AWS: not that nothing ever fails, but that when things fail — and they will — your students never find out. Start exactly here. It is the right first architecture, and most of what comes later is just refinement on top of these same bones.

AWSHigh AvailabilityAuto ScalingRDSCloudFrontFoundational
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading