Architecture GCP

Education Platform Scaling for a National Exam Day on GCP

A national testing authority that runs a country’s university-entrance exam gives its platform team a deadline with no slack in it: on a single Saturday in May, 1.4 million candidates will press “Start” within the same ninety-second window, because the exam is timed identically nationwide and the proctors release it on a synchronized clock. The board chair has lived through the alternative — two years ago the legacy provider’s load balancer melted under the thundering herd, candidates stared at spinners for eleven minutes, the press called it a “rigged exam,” and the resulting court case forced a full re-sit at a cost that dwarfs a decade of cloud bills. The new mandate is unambiguous: the platform must absorb a cold-start spike from near-zero to peak in under two minutes, stay up through a hostile internet, and never lose a single submitted answer — because a lost answer is not a glitch, it is a candidate’s university place. This is the reference architecture for building that platform on Google Cloud, where the engineering problem is not steady-state throughput but the brutal, unforgiving shape of the load curve.

The pressures here are unlike a normal SaaS app, and naming them sets up every decision that follows. The spike is synchronous and predictable to the minute — you know the exact Saturday and the exact second, which is a gift, because it means you can pre-warm instead of react. The load is bursty, not sustained — peak concurrency lasts three hours and is near-zero the rest of the quarter, so paying for peak capacity year-round is indefensible. Correctness is sacred — answer autosave must be durable and globally consistent, because a candidate who answers in one region and reconnects through another must see their work. And the event is a magnet for attack — a DDoS during the entrance exam, or a single cheater scripting the API, is worth real money to someone, so the platform is hostile-internet-facing by definition. GCP gives you the primitives — GKE, Cloud Spanner, Memorystore, Cloud Armor — but the architecture is in how you compose them for this curve.

Why the obvious approaches fail

Three shortcuts will be proposed in the first planning meeting, and each fails in a way worth naming before someone builds it.

“Just turn on the Horizontal Pod Autoscaler and let it react.” Reactive autoscaling watches CPU, notices the spike, and then asks for pods — but GKE has to schedule pods onto nodes, and if the nodes do not exist, the Cluster Autoscaler must call Compute Engine to provision VMs, pull container images, and pass readiness probes. That control loop takes minutes you do not have when the entire country starts in ninety seconds. By the time capacity arrives, the spike is over and the candidates have already seen the spinner. Reactive scaling is a strategy for gradual load; it is a guaranteed outage for a synchronized cold start.

“Run it on a single regional SQL database.” A national exam means national users, and a single Cloud SQL primary becomes the write bottleneck for 1.4 million autosaving sessions while also being a single point of failure and a single region of latency. Sharding it by hand reintroduces every distributed-systems problem you were trying to avoid, and a failover mid-exam loses in-flight transactions — the one thing you swore would never happen.

“Cache everything and hope.” Putting a CDN in front and calling it done handles the static exam shell, but the exam interaction — fetching the next question, saving an answer, syncing a timer — is dynamic, per-candidate, and write-heavy. You cannot cache a write. The dynamic path is the whole problem, and a CDN does not touch it.

The architecture below threads the needle: pre-provision against a known schedule instead of reacting, use a horizontally-scalable consistent database built for exactly this write pattern, absorb reads at the edge and in-memory, and put the dynamic write path behind autoscaling compute that was warmed before the gun fired.

Architecture overview

Education Platform Scaling for a National Exam Day on GCP — architecture

The platform separates three traffic classes that scale on entirely different curves, and keeping them distinct is the first discipline of operating it well: the static delivery path (the exam shell, JS bundles, instructions — heavy on read, trivially cacheable), the dynamic exam path (fetch question, autosave answer, sync timer — the write-heavy core that everything else exists to protect), and the proctoring/admin path (invigilators monitoring sessions, releasing the exam, handling incidents — low volume, high privilege).

The defining property of the whole topology is the one the board cares about: capacity is provisioned ahead of a known clock, not discovered after the spike. This is a scheduled event, so the architecture treats the exam start like a rocket launch — everything is warm, primed, and load-tested before T-0.

Dynamic exam path, following the request flow at T-0:

  1. A candidate’s browser, already showing the pre-loaded exam shell, hits Akamai at the edge. Akamai serves all static assets from cache (so 1.4 million shell loads never touch GCP), terminates TLS, and runs as the first DDoS and bot-mitigation layer with rate controls tuned for exam-flood patterns. Only genuine dynamic API calls are forwarded to the origin.
  2. Forwarded traffic reaches Google Cloud Load Balancing (global external Application Load Balancer) with Cloud Armor attached as the second, GCP-native defense layer — WAF rules, per-IP and per-token rate limiting, and Adaptive Protection ML-driven L7 DDoS detection. Cloud Armor is also where a scripted cheater hammering the answer-submit endpoint gets throttled before it reaches the application.
  3. The candidate’s session is authenticated by a token minted at login. Candidates authenticate through the exam platform’s own identity service backed by Identity Platform, while proctors and administrators sign in through Okta as the workforce IdP — Okta enforces MFA and conditional access for invigilators, and federates to GCP so an invigilator’s privileged session is a first-class, auditable identity separate from any candidate.
  4. The request lands on the exam-session service running on GKE — a regional, multi-zone cluster that was pre-scaled to peak before the exam window opened (more on the predictive mechanism below). The service reads and writes the candidate’s live exam state.
  5. Hot, per-session state — the current question pointer, the server-authoritative countdown timer, recent autosaves — lives in Memorystore for Redis, which absorbs the brutal read/write rate of timers ticking and answers saving without hammering the database on every keystroke.
  6. The durable answer of record is written to Cloud Spanner. Every autosave is an upsert into Spanner, which gives horizontal write scalability and external (globally strong) consistency — so a candidate who drops Wi-Fi and reconnects through a different region sees exactly the answers they saved, with no lost write and no stale read. This is the property that lets the authority promise “we will never lose your answer.”
  7. Secrets the services cannot derive from workload identity — third-party proctoring-vendor API keys, the Okta introspection secret, exam-content decryption keys — come from HashiCorp Vault via a sidecar with GCP-backed auth, so nothing sensitive sits in a Kubernetes Secret. Critically, the encrypted exam paper itself is decrypted only at the synchronized release time, with Vault holding the key until the clock says go.

Proctoring/admin path, lower volume and higher privilege: invigilators use a console (Okta-gated) to monitor live sessions, flag anomalies, grant time accommodations, and — the highest-stakes action — trigger the synchronized exam release. That release is a control-plane event that flips a flag in Spanner and pushes via the real-time channel to every candidate at once.

Component breakdown

Component Service / tool Role on exam day Key configuration choices
Edge / static Akamai Serve cached exam shell to 1.4M browsers; first DDoS/bot layer Cache static bundles; flood rules on dynamic API; origin shield to GCLB
L7 load balancing + WAF Cloud Armor + Global ALB GCP-native WAF, rate limiting, Adaptive Protection DDoS Per-token rate limits; Adaptive Protection on; preconfigured OWASP rules
Candidate identity Identity Platform Candidate auth, session tokens, sign-in throttling Short-lived tokens; per-IP sign-in limits; passwordless option
Workforce SSO Okta Proctor/admin SSO, MFA, conditional access, federation to GCP OIDC federation; step-up MFA for “release exam”; group claims to authz
Exam-session compute GKE (regional, multi-zone) The dynamic exam logic: fetch question, autosave, timer sync Pre-scaled to peak; PDB; multi-zone spread; node pools warmed
Hot session state Memorystore for Redis Live timer, question pointer, recent autosaves; read shock absorber Standard tier (HA); read replicas; per-session keyspace
Durable answers Cloud Spanner Globally consistent answer-of-record; horizontal write scale Multi-region or regional+read replicas; interleaved session tables
Real-time channel Pub/Sub + WebSocket gateway Synchronized release, time warnings, proctor pushes Fan-out to candidate connections; backpressure handling
Secrets & exam keys HashiCorp Vault Vendor keys, Okta secret, exam-paper decryption key GCP auth method; dynamic leases; key released at start time only
Predictive scaling Custom controller + Cloud Monitoring Pre-warm nodes/pods against the known schedule Scheduled scale-up; custom metric HPA; surge node pool
CSPM / posture Wiz + Wiz Code Cloud posture, exposure, IaC scanning before exam day Agentless scan of GKE/Spanner; Wiz Code gates Terraform PRs
Runtime security CrowdStrike Falcon Runtime threat detection on GKE nodes Sensor on node pools; detections to the SOC
Observability Datadog Real-time dashboards, the war-room view, SLO burn alerts Agent on GKE; RUM on candidate client; live concurrency metric
ITSM / incident ServiceNow Exam-day incident bridge, change gate, accommodation tickets Major-incident workflow; change freeze gate; auto-ticket on SLO breach
CI / IaC GitHub Actions + Argo CD + Terraform Build/test/load-test pipeline; GitOps deploy; infra as code OIDC to GCP (no stored creds); Argo CD syncs cluster; load-test gate

A few of these choices carry the architecture and deserve the why, because they are the ones teams get wrong on a spike workload.

Why predictive (scheduled) autoscaling, not reactive. Because the exact start time is known, the platform does not wait for CPU to climb — a scheduled scale-up drives the Cluster Autoscaler and HPA to peak capacity before the exam window, holds it warm through the event, and scales back down after. The Horizontal Pod Autoscaler is still configured, but on a custom metric — active exam sessions — not CPU, because CPU lags the real signal and active-session count rises with the herd. A dedicated surge node pool is provisioned and warmed (images pre-pulled, a low-priority “balloon” Deployment holding the nodes) so that when real pods land they schedule in seconds, not minutes. The mental model: you are not autoscaling into the spike, you are pre-staging for it and using autoscaling only to trim.

# HPA on a business metric (active exam sessions), with a high floor
# pre-staged for the known start time — not reacting to CPU.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: exam-session-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: exam-session
  minReplicas: 600        # pre-warmed floor, raised by the scheduled job before T-0
  maxReplicas: 1200
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_exam_sessions_per_pod
        target:
          type: AverageValue
          averageValue: "1200"
  behavior:
    scaleUp:
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30   # allow fast doubling if the floor is under-set

Why Cloud Spanner instead of Cloud SQL. The autosave workload is a high-volume, geographically distributed write stream that must be strongly consistent and must never lose a committed transaction. Spanner is purpose-built for exactly this: it scales writes horizontally by splitting across nodes, gives external consistency (the strongest guarantee, so reads always reflect all prior writes globally), and survives a zone or region failure without losing committed data. A single Cloud SQL primary cannot scale the writes, and its failover window is precisely when you would lose answers. The cost is real — Spanner is more expensive per hour and demands schema discipline — but for “never lose a candidate’s answer,” it is the only primitive that delivers the guarantee.

Why Memorystore sits in front of Spanner. A server-authoritative countdown timer for 1.4 million candidates, ticking and being read constantly, plus autosaves landing every few seconds, would generate a punishing read/write rate straight onto the database. Memorystore for Redis absorbs that: the live timer and current-question pointer live in Redis, autosaves write to Redis immediately for instant client acknowledgment and are flushed durably to Spanner, and reads are served from memory. Redis is the shock absorber that keeps Spanner doing durable writes rather than serving hot per-keystroke reads. The tradeoff is a consistency seam — Redis is fast but volatile — so Spanner remains the answer of record and the flush path is designed to never drop a write.

Implementation guidance

Provision with Terraform, gate it with Wiz Code, deploy with Argo CD. Infrastructure is code, and on a workload where a misconfiguration is a national incident, the pipeline is part of the safety case.

  1. A regional, multi-zone GKE cluster (private nodes, Workload Identity on) with two node pools: a baseline pool and a surge pool sized to peak, kept warm by a balloon Deployment.
  2. Cloud Spanner with the exam schema — interleave the per-answer rows under the session row so a candidate’s session and all their answers co-locate, keeping autosave writes to a single split where possible.
  3. Memorystore for Redis Standard tier (HA with automatic failover) with read replicas sized to the timer/read load.
  4. Cloud Armor security policy attached to the global ALB, Adaptive Protection enabled, with explicit per-token rate-limit rules on the autosave and submit endpoints.
  5. Akamai in front, caching static assets and pointing at the GCLB private origin as shielded backend.

The pipeline runs in GitHub Actions, authenticating to GCP via OIDC Workload Identity Federation so there is no long-lived service-account key to leak — a lesson the platform team intends never to relearn. Wiz Code scans every Terraform and Kubernetes manifest PR for misconfigurations (a public Spanner endpoint, an over-broad firewall, a missing Pod Security control) and blocks the merge if it finds one. Argo CD then syncs the desired cluster state from Git, so the production cluster is always a known, reviewed commit — no out-of-band kubectl apply on exam morning.

The load test is the deliverable, not an afterthought. You cannot claim to survive 1.4 million synchronized starts; you have to demonstrate it. A required pipeline gate runs a distributed load test that reproduces the curve — near-zero to peak in ninety seconds — against a production-clone environment, asserting p99 autosave latency and zero dropped writes. A scale event that has not been load-tested at the real shape is an unvalidated assumption, and on this platform unvalidated assumptions become court cases.

Synchronized release without a thundering-herd self-DDoS. The exam content is encrypted at rest; Vault holds the decryption key and releases it only at the start time, so even an insider cannot read the paper early. At T-0 the release flips a flag in Spanner and fans out over Pub/Sub to a WebSocket gateway that pushes to already-connected candidates — candidates are connected and idle before the start, so the gun does not trigger 1.4 million simultaneous new connections, only a lightweight push over existing ones. Pre-connecting the herd is what converts a connection storm into a trivial fan-out.

Enterprise considerations

Security & Zero Trust. The platform is hostile-internet-facing on its biggest day, so defense is layered and identity is strict. Akamai and Cloud Armor form two independent DDoS/WAF tiers — Akamai absorbs volumetric and static-flood attacks at the edge, Cloud Armor’s Adaptive Protection catches application-layer (L7) attacks and rate-limits the autosave/submit endpoints so a scripted cheater is throttled before reaching the app. Privileged actions are gated hard: invigilators authenticate through Okta with MFA, and the highest-stakes action — releasing the exam — requires step-up authentication, so no single stolen session can leak the paper. HashiCorp Vault holds the exam-paper key and releases it only at start time, making “early access to questions” a non-event. Wiz runs continuous CSPM across GKE and Spanner, alerting on any drift to public exposure or an over-permissive IAM binding, while Wiz Code shifts that same checking left into the IaC pipeline. CrowdStrike Falcon sensors on the GKE node pools provide runtime threat detection feeding the testing authority’s SOC, and any security event auto-raises a ServiceNow incident so there is a ticket and a bridge, not just a log line. Least-privilege IAM scopes each service to exactly the resources it needs, and candidate and proctor identities are separate trust domains by construction.

Cost optimization. The economics of a once-a-quarter spike are the whole game: peak capacity for three hours, near-zero for months. Engineer for the curve, not the peak.

Lever Mechanism Typical effect on exam-day economics
Scheduled scale-up/down Pre-warm to peak before T-0, scale to a small floor after Pay for peak only during the ~4-hour window, not year-round
Surge node pool on Spot Run the warmed surge pool on Spot/preemptible where the workload tolerates it Large discount on the burst capacity that is idle most of the time
Spanner sizing per event Scale Spanner nodes up for the event, down after; use processing units granularly Avoid paying for peak write throughput when no exam is running
Memorystore offload Serve timers/reads from Redis so Spanner is sized for durable writes only Smaller, cheaper Spanner footprint for the same correctness
Edge offload Akamai serves 100% of static shell loads 1.4M shell fetches never bill as GCP egress or compute
Commitment vs. on-demand Committed-use discounts for the steady baseline, on-demand for the surge Lowest blended rate across a spiky year

The discipline is to treat the surge as ephemeral: provision it the day before, tear it down the day after, and never let “we might need it” leave peak capacity running into the next quarter. Pipe the per-window cost to Datadog so finance sees exactly what each exam day costs.

Scalability and the spike, concretely. Each tier scales on its own axis. GKE scales pods on the active-sessions custom metric and nodes via the Cluster Autoscaler against the warmed surge pool, so capacity is present before load. Spanner scales by adding nodes/processing units, raised for the event window. Memorystore scales reads via read replicas. The real ceiling to plan against is not CPU but the regional quotas — Compute Engine instance quota, in-use IP addresses, Spanner node limits — which is why the load test runs against the real project with the real quotas, and why a quota-increase request goes in weeks early. A spike architecture that hits an unraised quota at T-0 fails exactly as badly as one with no autoscaling at all.

Failure modes, and what each one looks like on the day. Name them before they page the war room.

Reliability & DR (RTO/RPO). The numbers are decided per tier and the bar is exceptional because a failed exam is not retryable in the moment. Cloud Spanner multi-region gives synchronous replication and survives a full region loss with zero RPO for committed answers — the non-negotiable guarantee. GKE is regional and multi-zone, so a zone failure is transparent; a region failure fails over to a warm standby cluster the load balancer can route to. Memorystore Standard tier provides automatic cross-zone failover. A pragmatic target for the exam-day service: RTO under 5 minutes, RPO zero for submitted answers, with the explicit design principle that a submitted answer is durable the instant Spanner commits it, region failure included. The exam itself can tolerate a brief connectivity blip for a candidate (the client buffers and re-syncs) far more easily than it can tolerate a single lost commit — so the architecture spends its reliability budget on durability first.

Observability and the war room. Exam day is run from a live war room, and Datadog is the single pane: real-time active-concurrency (the metric that proves the herd arrived and capacity held), p99 autosave latency, Spanner CPU and commit latency, Redis hit rate, Cloud Armor blocked-request rate, and SLO burn-rate alerts that page before candidates feel pain. Real User Monitoring on the candidate client surfaces what actual browsers experience, not just server health. A blocked-request spike or an SLO breach auto-opens a ServiceNow major incident with the bridge details, so the org is in incident response within seconds. The principle: on a scheduled high-stakes event, you watch leading indicators (concurrency rising, latency creeping) and act before the lagging ones (errors, abandons) ever move.

Governance. A strict change freeze goes into effect days before the exam — enforced as a ServiceNow change gate that blocks any non-emergency deploy, with Argo CD ensuring production matches a reviewed Git commit and nothing else. Exam content keys, IAM bindings, and Cloud Armor policies are all version-controlled and reviewed. Every privileged proctor action (release, accommodation grant, session intervention) is logged immutably for audit, because the integrity of the exam — and its defensibility in the inevitable challenge — depends on a complete record of who did what when.

Explicit tradeoffs

Accept these or do not build it. This architecture optimizes for a brutal, scheduled spike, and that focus has costs. Cloud Spanner is more expensive and more demanding than a single SQL database — you pay for horizontal write scale and global consistency in dollars and in schema discipline (get the key design wrong and you hot-spot), and for a small, steady workload it is overkill. Predictive scaling trades simplicity for safety: you must know and trust the schedule and pre-provision against it, which means a balloon Deployment and a warmed surge pool burning some money before the event and the operational ritual of scaling up and tearing down each cycle. The layered Akamai + Cloud Armor defense is two products to license, tune, and test rather than one. The Redis-in-front-of-Spanner design adds a consistency seam you must engineer carefully so a fast acknowledgment never becomes a lost answer. And the whole thing demands a real load test at the real shape, which is itself a significant engineering investment — but skipping it is how the legacy provider ended up in court.

The alternatives, and when they win. If your exams are asynchronous — candidates start whenever they like across a window — the synchronized-spike problem evaporates and you can lean on ordinary reactive autoscaling and a regional database; this entire architecture is overkill. If you are running a small institutional exam (a few thousand students, one university), a single Cloud SQL instance and a modest GKE deployment are simpler, cheaper, and entirely sufficient — graduate to this design only when scale and simultaneity demand it. If the platform is a learning management system rather than a high-stakes exam — think Moodle delivering courseware and quizzes where a momentary blip is an annoyance, not a lawsuit — the correctness and DR bar drops dramatically and a managed Moodle on autoscaled GKE with Cloud SQL is the pragmatic fit. And if you genuinely cannot predict the spike timing, you fall back to reactive scaling with a generous floor and accept the cold-start risk this design is built to eliminate. The architecture here is the destination for a national, synchronized, high-stakes exam; the right starting point depends on which of those three words actually apply to you.

The shape of the win

For the testing authority, the payoff is not “a website that stayed up.” It is that on the Saturday in May, 1.4 million candidates pressed “Start” inside the same ninety seconds, the first question rendered in under a second, every autosave landed durably in Spanner the instant it was made, a volumetric attack at 09:01 was absorbed by Akamai and Cloud Armor without a candidate noticing, and at the end of the three hours not one answer was lost and not one candidate saw a spinner — so the result stood, unchallenged, and there was no re-sit. That last clause is what funds the platform. Everything upstream — the scheduled pre-warm, the surge node pool, Spanner’s external consistency, Memorystore’s shock absorption, the layered DDoS defense, Vault holding the paper key until the gun, the load test at the real shape, the Datadog war room — exists so that a candidate, a regulator, and a court each conclude the exam was fair. Start narrower if your problem is smaller, but for a country’s entrance exam on a single synchronized morning, this is where the architecture has to land.

GCPGKECloud SpannerAutoscalingEnterpriseEdTech
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading