A public university’s digital learning team gets the call every August: 38,000 students are about to enrol, every one of them logging into Moodle in the same fortnight to find their courses, download reading lists, and submit the first quiz — and last year the LMS fell over on day two of orientation, the help desk took 900 tickets in a morning, and the pro-vice-chancellor wants it never to happen again. The platform also has to survive the genuinely brutal moment in the academic calendar: a 9 a.m. timed online exam where 4,000 students hit “start attempt” inside the same sixty seconds. Moodle is not a website that can shed load gracefully; a student who cannot submit an assessment before the deadline is an academic-appeals case, not a retry. This article is the reference architecture for running Moodle properly on AWS — a horizontally scalable, identity-federated, exam-season-hardened LMS that the university’s IT director and academic registrar will both sign off on.
The pressures here are specific to higher education, and they do not look like a typical SaaS load curve. Demand is spiky and calendar-driven: near-idle over summer, a wall of traffic at enrolment, and sharp synchronous spikes at every exam window. Identity is sprawling and seasonal: tens of thousands of students arrive and leave each year, plus academic and professional-services staff, and every one needs the right access on day one and zero access the day they leave. Availability is non-negotiable during assessments but the budget is a public-sector budget, so paying for peak capacity year-round is not acceptable. And the data is regulated — student records sit under data-protection law, and an exam submission is a record the university may have to defend at an appeals tribunal. The architecture has to bend to all four.
Why Moodle resists the naive lift-and-shift
The instinct is to take the single fat virtual machine the university has run for a decade and move it to one big EC2 instance. That fails for reasons worth naming, because someone on the project will propose it.
Moodle is a PHP application with two pieces of state that make horizontal scaling non-obvious. First, the moodledata directory — the “Moodle data root” — holds uploaded files, the file-cache, session data, and temporary files, and every web node must see the same directory or a file uploaded on node A is invisible on node B. Second, Moodle keeps user sessions server-side by default, so without shared session state a load balancer round-robining requests logs a student out on every other click. A single VM hides both problems by having exactly one of everything; the moment you want a second node for capacity or resilience, both surface at once.
The fix is to externalise every piece of state out of the compute tier so the web nodes become stateless and disposable — which is the precondition for auto-scaling. The moodledata directory moves to Amazon EFS, a shared NFS file system every node mounts. Sessions move to Amazon ElastiCache for Redis, configured as Moodle’s session handler so any node can serve any request. The database moves to Amazon Aurora MySQL. Once state lives outside the instances, an Auto Scaling group can add and kill web nodes freely, and that single capability is what lets the platform breathe with the academic calendar.
Architecture overview
The platform separates cleanly into a request-serving path that handles student and staff traffic and an identity/provisioning path that keeps the user population correct. They share nothing at runtime but the user records, and keeping them distinct in your head is the first step to operating this well.
The defining property of the request tier is that the EC2 web nodes hold no durable state. They run Moodle + PHP-FPM + a web server, mount the shared moodledata from EFS, read and write sessions in Redis, and talk to Aurora for everything relational. Any node can be terminated and replaced without a user noticing, which is exactly what an Auto Scaling group does during a scale-in after exams end.
Request path, following the flow:
- A student opens Moodle. DNS resolves to Amazon CloudFront, the CDN that fronts the platform. Static and cacheable assets — theme CSS/JS, course images, the large file downloads that dominate reading-week traffic — are served from CloudFront’s edge, taking that load off the origin entirely. CloudFront also terminates TLS and is the attachment point for AWS WAF managed rules against the OWASP top ten and bad-bot floods.
- Cache-miss and dynamic requests travel to an Application Load Balancer (ALB) in the public subnets. The ALB is a Layer-7 load balancer: it health-checks the web nodes (hitting a lightweight Moodle health endpoint), terminates the connection, and distributes requests across whatever nodes the Auto Scaling group currently has running.
- The request reaches an EC2 web node in a private subnet — one member of the Auto Scaling group spread across at least two Availability Zones. The node runs Moodle, mounts
moodledatafrom EFS over NFS, and on each request reads or writes the user’s session in ElastiCache for Redis. - Moodle queries Aurora MySQL — reads and writes go to the cluster’s writer endpoint, and you can route heavy read traffic (gradebook reports, course listings) to the reader endpoint to spare the writer.
- Moodle’s application cache (the MUC — Moodle Universal Cache) is also pointed at Redis, so expensive computed data is shared across nodes instead of recomputed per node.
- Background work — sending forum digest emails, processing the gradebook, running scheduled tasks — runs on the Moodle cron, which on a multi-node deployment must run on exactly one node (or a dedicated worker), never on all of them.
Identity path, where the seasonal population is managed: Okta is the university’s identity provider. Students and staff authenticate to Moodle via SAML 2.0 single sign-on — Moodle is configured as a SAML service provider, Okta as the IdP, so a user logs in once with their university credentials (and Okta’s adaptive MFA) and lands in Moodle without a second password. Crucially, SCIM provisioning runs the other direction: Okta pushes user create/update/deactivate events into Moodle (and into the student-information feed), so when a student enrols they get a Moodle account automatically, and the day they graduate or withdraw, Okta deactivates them and their access to course material and exam systems is revoked the same day. That joiner-mover-leaver automation is what makes 38,000 churning users operationally survivable.
Component breakdown
| Component | AWS service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| CDN & edge | Amazon CloudFront + AWS WAF | Cache static assets & large downloads, TLS, OWASP/bot protection | Long TTL on theme/static paths; pass-through dynamic; WAF managed rule groups |
| Load balancing | Application Load Balancer | L7 distribution + health checks across the web fleet | Health check on Moodle endpoint; sticky-optional (Redis makes it unnecessary); cross-zone |
| Web tier | EC2 Auto Scaling group | Stateless Moodle/PHP nodes that scale with demand | Multi-AZ; launch template w/ golden AMI; target-tracking + scheduled scaling |
| Shared files | Amazon EFS | The shared moodledata root every node mounts |
General Purpose mode; mount targets per AZ; bursting/elastic throughput |
| Sessions & cache | ElastiCache for Redis | Server-side sessions + Moodle MUC application cache | Multi-AZ with automatic failover; Moodle session_redis + cache store |
| Database | Aurora MySQL | All relational data; writer + reader endpoints | Multi-AZ writer/replica; reader endpoint for reports; backtrack/automated backups |
| Identity / SSO | Okta | SAML SSO for login; SCIM for lifecycle provisioning | SAML SP in Moodle; SCIM create/deactivate; adaptive MFA; group-to-role mapping |
| Secrets | AWS Secrets Manager | DB credentials, Okta SCIM token, app secrets | Rotation on the DB secret; nodes read via IAM instance role |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on EC2 web/cron nodes | Sensor baked into the AMI; detections streamed to the SOC |
| Observability | Datadog | Metrics, logs, traces, exam-day dashboards & SLOs | Agent on nodes; RUM on the student UX; synthetics on login + submit |
| IaC & delivery | Terraform + GitHub Actions | Infrastructure as code; AMI bake + deploy pipeline | OIDC to AWS (no stored keys); plan-gate; immutable AMI roll |
| ITSM | ServiceNow | Change records for exam-window scaling, incident tickets | Change gate before exam-day config; auto-ticket on Datadog SLO breach |
A few of these choices carry the weight of the design and are the ones teams get wrong.
Why EFS for moodledata, and what it costs you. EFS is the pragmatic answer because it is a fully managed, multi-AZ, NFS-shared file system that every node mounts with no cluster software to run — the simplest way to give a fleet one shared data root. The tradeoff is honest: NFS has higher per-operation latency than a local disk, and Moodle’s file-cache is chatty. You mitigate this by serving the worst offenders — sessions and the application cache — from Redis instead of the filesystem, and by letting CloudFront absorb the large static downloads so they never hit EFS at all. With those two offloads in place, EFS comfortably handles the residual shared-file workload. The alternative — an S3-backed object file system via a Moodle plugin — reduces shared-filesystem dependence further but adds its own moving parts; EFS is the right default for most universities.
Why Redis is load-bearing, not a nice-to-have. On a single node you could let Moodle keep sessions on disk and skip Redis. On a fleet you cannot: without shared sessions a student is logged out on every other click as the ALB moves them between nodes. Pointing Moodle’s session handler and its MUC application cache at a Multi-AZ ElastiCache for Redis cluster solves both the session-consistency and the cross-node-cache problems at once, and it is the component that makes “any node serves any request” actually true.
// config.php — sessions and application cache in Redis, not on the shared disk
$CFG->session_handler_class = '\core\session\redis';
$CFG->session_redis_host = 'moodle-redis.xxxx.cache.amazonaws.com';
$CFG->session_redis_port = 6379;
$CFG->session_redis_prefix = 'mdl_sess_';
$CFG->session_redis_acquire_lock_timeout = 120;
$CFG->session_redis_lock_expire = 7200;
Why Aurora, not a plain RDS MySQL. Aurora MySQL gives you a storage layer that replicates six ways across three AZs, fast failover, and — critically for this workload — reader-endpoint scaling so the read-heavy reporting and course-listing queries that spike during enrolment can be offloaded from the writer. For a university LMS where the database is the single hardest tier to scale horizontally, that headroom matters.
Scaling for the academic calendar
This is the part of the architecture that justifies the whole exercise, because Moodle’s load is not a smooth curve — it has two distinct shapes, and you scale for each differently.
Predictable, calendar-driven peaks (enrolment, exam weeks) are handled with scheduled scaling. You know the exam timetable weeks ahead, so you tell the Auto Scaling group to raise its minimum capacity before the 9 a.m. exam window and lower it after. Reactive scaling alone is dangerous here: an exam spike arrives in under a minute, and EC2 instances take a few minutes to boot, mount EFS, and pass health checks — by the time target-tracking reacts, the exam has started and 4,000 students are queued. Pre-warming with scheduled scaling means the capacity is already running when the gun goes off.
Unpredictable, gradual load (a viral assignment, an unexpected reading-week surge) is handled with target-tracking scaling on a metric that reflects real saturation — average CPU across the fleet, or ALB request count per target. This adds nodes when sustained load climbs and removes them when it falls, keeping cost down during the long quiet stretches.
A pragmatic combined policy:
| Calendar phase | Scaling approach | Web-tier posture |
|---|---|---|
| Summer / inter-term | Target-tracking only, low minimum | Small fleet, scale to floor |
| Enrolment fortnight | Scheduled min-capacity bump + target-tracking | Larger floor, headroom for surges |
| Exam window (per timed exam) | Scheduled pre-warm 30–60 min ahead, hold, then release | Peak fleet running before the start time |
| Normal teaching weeks | Target-tracking | Moderate, demand-following |
Three correctness rules keep multi-node Moodle honest at scale. Run cron on exactly one node — a dedicated small instance or a single designated member — because Moodle’s scheduled tasks are not all safe to run concurrently across the fleet. Bake a golden AMI with Moodle, PHP, and the EFS/Redis config pre-installed so a scaling event adds a ready node in a couple of minutes rather than configuring from scratch under load. And mind Aurora connections — every web node opens a connection pool, so a large scale-out can exhaust the database’s connection limit; size max_connections and the per-node pool together, and consider a proxy if the fleet gets large.
Enterprise considerations
Security and access. The platform is defence-in-depth. Network: the web nodes sit in private subnets reachable only via the ALB; EFS, ElastiCache, and Aurora are reachable only from the web-tier security group, never the internet; egress for OS patching goes through a NAT gateway. Identity: Okta is the single front door — SAML SSO with adaptive MFA means no local Moodle passwords for the population, and Okta group-to-role mapping drives Moodle authorisation, so “members of the staff-academic Okta group become teachers, students become students” is enforced at the identity layer rather than hand-managed in Moodle. Secrets — the Aurora credentials, the Okta SCIM token — live in AWS Secrets Manager with rotation on the database secret, and nodes read them through an IAM instance role so nothing is baked into the AMI or config.php. CrowdStrike Falcon sensors run on the EC2 nodes (baked into the golden AMI) for runtime threat detection feeding the university SOC, and AWS WAF at CloudFront blocks the OWASP-pattern and credential-stuffing traffic that perpetually targets education platforms. Data at rest is encrypted with KMS on EFS, Aurora, ElastiCache, and the S3 backup targets.
Cost optimisation. A public-sector budget rewards paying only for what the calendar demands.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Scale to the calendar | Scheduled scale-in over summer/inter-term; small floor | Avoids paying for peak 12 months a year |
| Commitment for the baseline | Savings Plans / Reserved capacity on the always-on floor | Discount on the steady minimum fleet |
| Spot for stateless burst | Spot Instances for surge capacity in the ASG | Cheap headroom for non-exam burst (keep on-demand for exams) |
| EFS lifecycle | Move infrequently-accessed moodledata files to EFS-IA |
Lowers storage cost on cold course archives |
| CloudFront offload | Edge-cache static + large downloads | Cuts EC2/EFS/data-transfer cost during reading weeks |
The single biggest saving is the first one: a university LMS is genuinely near-idle for months, and an architecture that scales to a small floor in those months — instead of running enrolment-day capacity year-round — is the difference that makes this design fit the budget. Surface the spend in a Datadog cost-and-utilisation dashboard the IT director reviews, and gate any exam-window capacity change behind a ServiceNow change record so the registrar’s office has a documented trail.
Reliability and DR (RTO/RPO). Decide the numbers per tier. The web tier is stateless and Multi-AZ by construction — lose an AZ and the Auto Scaling group replaces nodes in the survivor. Aurora is Multi-AZ with automatic failover in seconds, and automated backups plus point-in-time recovery cover data loss. EFS is regionally redundant across AZs natively. ElastiCache runs Multi-AZ with automatic failover so a session-store node loss does not log everyone out. A pragmatic posture for a single-region deployment: RTO of minutes for an AZ failure (automatic) and a documented cross-region restore path — Aurora snapshots and EFS backups (via AWS Backup) copied to a second region — for the rare regional event, accepting a larger RTO there in exchange for not paying for a hot standby region. Whatever you choose, rehearse the exam-day failure: a database failover at 9:05 a.m. mid-exam is the scenario that actually matters, so test that Moodle reconnects cleanly and in-flight attempts are not lost.
Observability and the exam-day runbook. Instrument with Datadog: agents on the nodes for infrastructure metrics, log collection from Moodle and the web server, real-user monitoring (RUM) on the student experience, and — the ones that matter on exam morning — synthetic checks that log in via SSO and submit a test quiz every minute, so you find out the submit path is broken before a student does. Define SLOs on login success rate, page latency, and quiz-submission success, and route an SLO breach to ServiceNow as an incident so the bridge has a ticket, not just a dashboard going red. For each major exam window, run a pre-flight: confirm scheduled scaling fired and the fleet is at peak, confirm Aurora and Redis are healthy, confirm CloudFront is serving, and have the on-call bridge open before the start time.
Explicit tradeoffs
Accept these or do not build it this way. Making Moodle horizontally scalable is real engineering, not a checkbox: you take on a shared filesystem (EFS) with NFS latency characteristics, an external session/cache store (Redis) that is now a dependency in the login path, a multi-endpoint database, and the operational discipline of running cron on exactly one node and baking AMIs for fast scale-out. A single big VM is genuinely simpler to stand up — and for a small college with a few hundred users and no synchronous exams, it may be the right call. This architecture earns its complexity precisely at university scale, where the calendar peaks and the exam-day stakes make a single node a liability. The Okta SAML/SCIM integration adds setup work and a federation hop that a tiny deployment with local accounts would skip, but it is the only sane way to manage tens of thousands of arriving and departing users.
The alternatives, and when they win. If you would rather not operate any of this, managed Moodle hosting (a Moodle Partner / MoodleCloud) hands the whole stack to a vendor — less control, less tuning headroom, but no infrastructure to run, and the right choice for a team without cloud-ops capacity. If your institution is consolidating everything onto Kubernetes, Moodle can run as containers on EKS with the same external-state pattern (EFS via the CSI driver, Redis, Aurora) — more powerful autoscaling and bin-packing, but a steeper operational bar than an Auto Scaling group, and worth it only if you already run EKS. And if you want the identical pattern on another cloud, it maps cleanly: Azure with App Service or VM Scale Sets + Database for MySQL Flexible Server + Azure Files + Cache for Redis, or GCP with managed instance groups + Cloud SQL + Filestore + Memorystore. AWS is the worked example here; the externalise-state-then-autoscale principle is portable.
The shape of the win
For the university, the payoff is not “Moodle in the cloud.” It is that on the morning 4,000 students start a timed exam at 9 a.m. sharp, the web fleet was already pre-warmed to peak by scheduled scaling, CloudFront absorbed the rush for the question PDFs, Redis kept every session stable as the ALB spread the load, Aurora served the writes, and every student submitted before the deadline — no 900-ticket morning, no academic-appeals queue. And it is that when a student graduates in July, Okta deactivates them and their access to course material disappears the same day, with no administrator touching Moodle. Everything upstream — the stateless nodes, the EFS data root, the Redis session store, the Aurora cluster, the Okta SAML/SCIM federation, the Datadog synthetics on the submit path — exists so that the IT director, the registrar, and the budget holder each say yes. The architecture here is the destination for an LMS that has to survive the academic calendar; start with the externalised-state foundation, and this is where it has to land.