Architecture GCP

GCP Well-Architected: Reliability — User-Experience SLOs, Error Budgets, Redundancy Across Failure Domains, Graceful Degradation, Failure Recovery, Chaos Testing & Capacity Planning

Where this fits

The Google Cloud Architecture Framework organizes Google’s guidance into pillars — System Design, Operational Excellence, Security/Privacy/Compliance, Reliability, Cost Optimization, and Performance Optimization — and Reliability is part 4 of this series. It sits deliberately after System Design (part 1) and Operational Excellence (part 2), because reliability is not a property you bolt on: it is the discipline of taking the topology you chose in System Design and the observability and automation you built in Operational Excellence, and turning a vague “it must be highly available” into an evidenced, measured, and tested set of promises. Google frames the pillar around four focus areas — Scoping, Observation, Response, and Learning — and a set of nine principles, all anchored on one reframing that runs through the whole pillar: reliability is defined by the user’s experience, not by server-side health. This article walks the seven engineering sub-components that operationalize those principles: the reliability principles themselves, SLIs/SLOs/error budgets, redundancy, graceful degradation, failure recovery, chaos testing, and capacity planning.

Google Cloud Architecture Framework — animated overview

Reliability principles — the design philosophy and the four focus areas

What it is. The Reliability pillar is expressed as nine principles grouped under four focus areas. The focus areas are the lifecycle frame — Scoping (understand the system, its components, and how they fail), Observation (detect problems proactively), Response (recover efficiently, ideally automatically), and Learning (stop failures recurring). The nine principles are the concrete practices that live inside those areas:

# Principle (official title) Focus area The sub-component it drives
1 Define reliability based on user-experience goals Scoping SLIs/SLOs
2 Set realistic targets for reliability Scoping SLOs & error budgets
3 Build highly available systems through resource redundancy Scoping Redundancy
4 Take advantage of horizontal scalability Scoping Redundancy / capacity
5 Detect potential failures by using observability Observation Detection (feeds all)
6 Design for graceful degradation Response Graceful degradation
7 Perform testing for recovery from failures Response Failure recovery / chaos
8 Perform testing for recovery from data loss Response Failure recovery (data)
9 Conduct thorough postmortems Learning Learning loop

A tenth principle that Google publishes alongside these in the pillar’s recommendations — Manage capacity and quota — is the capacity-planning sub-component this article also covers.

Why it matters. The principles encode the single most important shift in the pillar: principle 1 insists you define reliability from the user’s perspective, not from CPU, memory, or “is the task running.” A node can be pegged at 100% CPU, a pod can be crash-looping, and a task can have died — and none of that is an outage if user requests are still succeeding within latency targets. Conversely, every dependency can report “healthy” while users see 503s. Server-centric metrics routinely mislead teams into firefighting non-incidents and missing real ones; the principles redirect attention to the success ratio of user requests and the latency of user journeys. The rest of the pillar — redundancy, degradation, recovery, capacity — exists only to keep that user-facing success ratio above an agreed line.

How to do it well. Treat the four focus areas as the order of work: scope the critical user journeys and their dependencies first; stand up observation that measures the journeys (not just the servers); design the response mechanisms (redundancy, degradation, self-healing, failover); and close the learning loop with blameless postmortems. Use Cloud Trace to walk a real user’s request through the system and find the latency-contributing hops, because that trace is the map of what your SLIs must measure. The anti-pattern is jumping straight to “deploy in three zones” before you have written down which journeys matter and what success means for them.

Artifacts & GCP tooling. The principles are operationalized through the Architecture Framework / Well-Architected content and the Cloud Well-Architected review as a design gate; Active Assist and Recommender surface reliability anti-patterns per resource. The output of this sub-component is a critical-user-journey (CUJ) inventory and a reliability requirements document recording the user-experience goals each journey must honor.

SLIs, SLOs, and error budgets — making reliability measurable and negotiable

What it is. This is the heart of Google’s SRE practice and of principles 1 and 2. The three terms are precise and must not be blurred:

Term Definition The question it answers GCP expression
SLI (indicator) A quantitative measure of some aspect of service level — usually good events ÷ valid events “How good is the service right now?” A request-based or windows-based ratio in Cloud Monitoring service monitoring
SLO (objective) A target value or range for an SLI over a rolling window “What’s the line we promise to stay above?” An SLO object on a Service in Cloud Monitoring
SLA (agreement) A contract with consequences (credits) if the SLO is missed “What do we owe if we breach?” Commercial contract; SLO should be stricter than SLA
Error budget 100% − SLO over the window — the allowed unreliability “How much failure can we spend?” Derived from the SLO; tracked as burn rate

Why it matters. The error budget is the pillar’s most powerful idea because it converts reliability from a moral argument into an accounting one. The gap between 100% and your SLO is a budget you are allowed to spend — on risky deploys, experiments, and feature velocity. A concrete anchor: a 99.99% availability SLO over a rolling 30 days yields an error budget of about 4 minutes of allowed downtime; a 99.9% SLO yields roughly 43 minutes. That single number ends most “let’s just promise five nines” conversations, because four minutes a month leaves essentially no room for a bad deploy, a noisy dependency, or a maintenance window. The budget also resolves the eternal dev-vs-SRE tension with a rule, not a fight: while budget remains, ship features; when it is exhausted, the policy freezes risky changes and the team’s priority shifts to reliability work until the budget recovers.

Choosing the SLI — the request/success model. Google’s canonical SLI types map to the journey:

SLI type Good ÷ valid definition Best for
Availability successful responses ÷ valid requests Almost every request/response flow
Latency requests faster than threshold ÷ valid requests User-perceived speed (set a threshold, e.g. 95% < 400 ms)
Quality full-fidelity responses ÷ valid responses Services that degrade (full vs degraded answers)
Freshness data younger than threshold ÷ valid data Pipelines, caches, replicas
Correctness / coverage correct records ÷ records processed Batch/data jobs

You compute these from real signals — request-based SLIs divide good requests by total; windows-based SLIs count good time-windows. In Cloud Monitoring you build them from Application Load Balancer request metrics, Cloud Run / GKE request counts, Cloud Trace latency distributions, or log-based metrics.

How to do it well.

Artifacts & GCP tooling. Produce an SLO specification per CUJ (SLI definition, target, window, rationale) and a composite-availability calculation from the dependency chain. The native tool is Cloud Monitoring service-level objectives (define a Service, attach SLOs, view error-budget burn on the SLO dashboard, and configure burn-rate alerting policies). For GKE/microservices, Cloud Service Mesh can auto-discover services and help author SLOs from telemetry. Error-budget policy (what happens when it is spent) is a written document, not a tool — and it is the artifact most teams skip.

Redundancy across failure domains — turning resilience into availability

What it is. Principles 3 (“build highly available systems through resource redundancy”) and 4 (“take advantage of horizontal scalability”). Redundancy means mapping your failure domains — from a single VM, to a zone, to a region — and deploying replicated, independent copies of every component on a critical path so the loss of one domain does not take the journey down. On Google Cloud the domain hierarchy is explicit: a zone is an isolated failure domain within a region (independent power/cooling/network at the relevant layer), and a region is an independent geography composed of multiple zones.

Why it matters. Redundancy is the mechanism that produces the availability your SLO promises. But the level of redundancy is where most reliability money is won or lost: putting a flow that tolerates an hour of downtime into an active-active multi-region topology is money lit on fire, while running a database that backs a four-nines journey in a single zone is a breach waiting to happen. The level must be driven by the SLO and RTO/RPO, and it must respect data gravity — stateless tiers go multi-zone almost for free, while stateful tiers force the hard replication and consistency decisions.

How to do it well — choose the redundancy tier per component.

Tier Protects against GCP building blocks Cost/complexity step
Single zone Nothing beyond instance/disk faults Zonal MIG, zonal PD, single-zone GKE Baseline — avoid for critical paths
Multi-zone (regional) Loss of one zone Regional managed instance group, regional GKE / Autopilot, regional Persistent Disk / Hyperdisk (synchronous cross-zone), Cloud SQL HA (regional, standby in 2nd zone), zonal NEGs behind a regional LB The modern default for “in-region HA”
Multi-region Loss of a whole region Spanner multi-region, Cloud Storage dual/multi-region, BigQuery multi-region, Firestore multi-region, cross-region replicas, global LB across regional backends Forces replication, consistency, and routing decisions
Global (anycast frontend) Regional ingress / latency Global External Application Load Balancer + Cloud CDN + Cloud Armor on Google’s anycast edge The front door that ties regions together

Then confront the active-active vs active-passive decision explicitly, because it sets your routing and data-consistency design:

Decision Active-passive (warm standby) Active-active
Cost One region full, one minimal Two+ regions at full freight
RTO Minutes (failover required) Near-zero (already serving)
Data Async cross-region replication (non-zero RPO) Multi-write (Spanner) or partitioned-by-region
Complexity Failover orchestration & testing Conflict resolution, global routing, split-brain
Routing DNS / global LB failover on health Global LB active load distribution across regions
When Most enterprise workloads Only when SLO + RTO/RPO truly demand it

The decisive GCP-specific moves: use regional resources by default (a regional MIG keeps capacity across zones; regional Persistent Disk gives synchronous two-zone block replication so a database VM survives a zone loss; Cloud SQL HA runs a standby in a second zone with automatic failover), reach for Spanner when you need horizontal scale with strong consistency and multi-region availability (its multi-region configurations carry a 99.999% availability SLA and remove application-level sharding), and front everything with the global External Application Load Balancer so traffic shifts off an unhealthy region automatically. Principle 4’s horizontal-scalability angle matters here too: redundant replicas should also be the unit of scale — stateless, fungible instances behind an autoscaler — so the same design that survives a failure also absorbs load.

Artifacts & GCP tooling. Deliver a per-tier redundancy decision record (zone / region / global + active-active vs passive), a failure-domain map per critical journey, a region-pair selection with data-residency justification, and the Terraform / Infrastructure Manager that encodes zone and region spread so it is repeatable and reviewable. The non-negotiable rule: redundancy you have never failed over is a hypothesis, not a control — which is why principle 3 is inert without principle 7.

Graceful degradation — failing partial instead of total

What it is. Principle 6, in the Response focus area. Graceful degradation is designing a system so that under overload or partial failure it continues to serve, with reduced functionality, fidelity, or freshness, rather than failing completely — and recovers full function automatically when conditions normalize. It is the structural complement to the quality SLI: a degraded-but-serving response is still a good-ish event, and a well-degraded system keeps its availability SLO even while its quality SLO dips.

Why it matters. Without degradation, every dependency is a hard dependency — its failure is your failure, and the composite-availability multiplication drags your achievable SLO down. The single highest-leverage reliability move in many designs is converting a hard dependency to a soft one: cache the product catalog so the catalog service’s outage downgrades to stale-but-served; queue the order-confirmation email so the mail provider’s brownout never blocks checkout; return last-known position when the live-tracking feed stalls. Each conversion removes that dependency from the availability multiplication entirely. Degradation is also the antidote to retry storms and cascading failure: a system that sheds load gracefully protects the very dependencies that are struggling.

How to do it well — the GCP toolkit of degradation patterns.

Pattern What it does GCP mechanism
Caching / serve-stale Answer from cache when source is down Memorystore (Redis/Valkey), Cloud CDN, HTTP cache headers, stale-while-revalidate
Asynchronous decoupling Buffer work so a slow consumer deepens a queue, not a failure Pub/Sub between tiers; queue-based load leveling
Circuit breaker + backoff Stop hammering a sick dependency; fail fast to a fallback App-level breakers; Cloud Service Mesh / Envoy outlier detection & retries with backoff
Load shedding & rate limiting Drop low-priority work to protect the core Cloud Armor rate-limiting/throttling rules at the edge; priority queues
Bulkheads / isolation Contain a failure to one partition Separate pools, cell-based partitioning, per-tenant quotas
Timeouts & fallbacks Bound waits; return a default/partial answer Aggressive client timeouts; feature flags to disable non-critical features
Throttling autoscaler-aware Cap concurrency to protected backends Cloud Run max-instances / concurrency; GKE HPA + PodDisruptionBudgets

How to apply it. Walk each critical journey and classify every dependency as hard or soft; for each currently hard dependency, decide whether a cache, queue, default value, or feature flag can make it soft, and design the fallback explicitly (what does “degraded” actually return?). Pair every retry with a circuit breaker and exponential backoff with jitter — retries without a breaker convert a dependency outage into a self-inflicted DDoS that takes the dependency down harder. Put Cloud Armor rate-limiting at the edge so a traffic spike or abusive client is shed before it reaches origin, and use Pub/Sub to absorb downstream slowness as backlog rather than front-door errors. Surface degradation explicitly via the quality SLI so you can see when the system is running degraded and for how long.

Artifacts & GCP tooling. Produce a hard/soft dependency map per journey, a degradation design per critical dependency (fallback behavior + the feature flags that toggle it), and the quality SLI definitions that measure degraded-vs-full responses. Tooling: Memorystore, Cloud CDN, Pub/Sub, Cloud Armor, Cloud Service Mesh (retries, timeouts, outlier detection), and feature-flag configuration (e.g. via a config service or Firebase Remote Config for client tiers).

Failure recovery — RTO/RPO, automated response, and recovering from data loss

What it is. Principles 7 (“perform testing for recovery from failures”) and 8 (“perform testing for recovery from data loss”), in the Response focus area. Recovery is governed by two numbers set per critical journey: RTO (Recovery Time Objective) — the maximum acceptable downtime before recovery — and RPO (Recovery Point Objective) — the maximum acceptable data loss measured in time. RTO selects your recovery mechanism and speed; RPO selects your replication and backup cadence. They are distinct knobs that, together, select your DR architecture and its bill.

Why it matters. RTO and RPO are not aspirations to minimize blindly — driving both to zero is the most expensive mistake in the pillar, because zero RPO and surviving a full region loss are mutually exclusive: synchronous cross-region replication is not physically offered (the speed of light forbids it at distance), so zero data loss is an in-region property. Stating honest, per-journey RTO/RPO is what justifies the cheapest sufficient DR pattern instead of defaulting to active-active everywhere. Equally important is principle 8’s distinction: DR is not backup. Cross-region replication faithfully copies a logical-corruption or a fat-fingered DELETE to the standby in milliseconds — so you also need point-in-time recovery and immutable backups to recover from data loss/corruption, which is a different failure class from infrastructure loss.

How to do it well — map each RTO/RPO pair to a GCP pattern.

Pattern Typical RTO Typical RPO GCP mechanism Relative cost
Backup & restore Hours–days Hours (last backup) Backup and DR Service, Cloud SQL automated backups, Cloud Storage snapshots/exports Lowest
Pilot light Tens of min Minutes Core data replicated (async) to a 2nd region; compute scaled up on failover Low
Warm standby (active-passive) Minutes Seconds–minutes Cross-region read replica promoted; global LB shifts traffic Medium
Hot / active-active Near-zero Near-zero (in-region sync) Spanner multi-region; partitioned multi-region writes; global LB across live regions Highest
In-region synchronous HA Seconds Zero Cloud SQL HA (standby zone), regional Persistent Disk, AlloyDB HA Medium (no region-loss cover)

The GCP recovery toolkit:

The brownout caveat. A failover that triggers only on a hard outage does not protect against a sick-but-alive primary (high-latency brownout). Pair the failover trigger with a synthetic probe (an uptime check / SLI-based signal) so a primary that is up-but-slow is detected and, if it breaches the journey’s latency SLO, fails over.

Artifacts & GCP tooling. Deliver an RTO/RPO register per journey, a DR runbook (structured, documented, and tested) covering both region-loss and data-loss scenarios, and the backup/replication policy per data store. Record the RTA (Recovery Time Actual) measured in drills next to the RTO so the gap is visible. Tooling: Backup and DR Service, Cloud SQL/AlloyDB HA + cross-region replicas + PITR, Spanner backups/PITR, regional Persistent Disk, Cloud Storage versioning/Bucket Lock, regional MIG autohealing, and global LB / Cloud DNS for traffic shifting.

Chaos testing — proving the recovery design with deliberate fault injection

What it is. The testing half of principles 7 and 8 — periodically and deliberately injecting the faults you enumerated (zone failovers, region failovers, release rollbacks, dependency outages, data restores) and verifying the system withstands the fault, scales under demand, and recovers within its RTO/RPO. Google’s guidance is explicit that the objectives include validating RTO and RPO, assessing fault tolerance under varied failure scenarios, and confirming automated failover mechanisms actually work.

Why it matters. A redundancy design (principle 3), a self-healing loop (principle 5), and an RTO/RPO register are all hypotheses until you fail them on purpose and watch the response. Chaos testing is the one activity that validates every prior sub-component at once: it proves the redundancy fails over inside RTO, the degradation actually sheds load instead of cascading, the recovery automation fires, and the SLO/error-budget instrumentation actually detects the event. The pass criterion is not “it recovered” but “it recovered within RTO/RPO and inside the error budget.”

How to do it well — disciplined experiments, not breakage.

  1. Form a hypothesis tied to an SLO: “loss of zone asia-south1-a fails the Cloud SQL primary over to its standby in under 60 s with no breach of the booking journey’s 99.95% SLO.”
  2. Bound the blast radius — non-production first, then a controlled production game day with the error budget watched live and an abort condition ready (and only run in production if budget remains).
  3. Inject a fault from the failure-domain map — kill a GKE pod (kubectl delete pod) to test probes; cordon/drain a node; sever a dependency with a firewall rule; inject latency; trigger a Cloud SQL failover; force a regional MIG instance recreation; restore a backup into a scratch project to time RPO/RTO.
  4. Measure RTA against RTO and confirm self-healing and degradation behaved as designed.
  5. Capture results in a reliability scorecard and feed gaps back into the failure-domain map and the DR runbook.

The GCP reality on tooling. Google does not ship a first-party managed chaos service equivalent to a hosted fault-injection product; chaos engineering on GCP is assembled from primitives and open-source tools. The practical toolkit:

Capability How you do it on GCP
Kill/disrupt workloads kubectl delete pod, node cordon/drain, GKE + open-source Chaos Mesh or LitmusChaos for pod/network/IO/stress faults
Network faults / partition VPC firewall rules to block a dependency; service-mesh fault injection (Cloud Service Mesh / Envoy abort & delay)
Stress / resource pressure stress-ng, Chaos Mesh stress experiments, latency injection at the mesh
Zone / region failover Trigger Cloud SQL / AlloyDB failover; recreate a zone’s MIG instances; drain a region’s backend from the global LB
Data-loss drill Restore a PITR / Backup and DR recovery point into an isolated project and measure RPO/RTO
Scale-under-demand Load tests (open-source k6/Locust on GKE, or Cloud Load Balancing synthetic traffic) to validate horizontal scaling and capacity

Artifacts & GCP tooling. Deliverables: a chaos experiment catalog mapped one-to-one to failure-domain rows, game-day runbooks with explicit abort criteria, and a reliability scorecard per journey (SLO attainment, error budget remaining, zone/region failover RTA, RPO measured). Validate the “scale under demand” half with load testing against autoscaled Cloud Run / GKE tiers, and watch the run on the Cloud Monitoring SLO dashboards so the experiment’s impact on the error budget is visible in real time.

Capacity and quota planning — having the resources when the failure happens

What it is. The Manage capacity and quota principle. Capacity planning ensures that sufficient capacity is reserved in every region you intend to fail into, or that the risks of relying on emergency autoscaling are explicitly accepted. Quota management ensures the per-project, per-region service limits that gate your scaling paths are raised ahead of need. The two are tightly coupled to reliability because a failover into a region that lacks capacity — or a scale-up that hits a quota wall — is an outage, no matter how good the redundancy design looked on paper.

Why it matters. This is the sub-component teams most often discover during an incident. Two classic failure modes: (1) you designed active-passive DR but never reserved capacity in the standby region, so when the primary fails and everyone fails over at once, the standby cannot get enough instances and your beautiful failover degrades into a brownout; (2) a traffic surge or a failover doubles your instance count and you hit a Compute Engine CPU quota or a Cloud Run max-instances ceiling, and the autoscaler simply stops scaling. Google’s guidance is blunt: do data-driven capacity planning using load tests and traffic forecasts, and either reserve capacity or consciously accept the autoscaling risk — silence is not an option.

How to do it well.

Lever Reliability purpose GCP mechanism
Capacity reservation Guarantee standby-region capacity for failover Compute Engine reservations, CUDs
Autoscaling Absorb demand and replace failed capacity MIG autoscaler, GKE cluster autoscaler / NAP, Cloud Run autoscaling, HPA
Quota headroom Stop scaling from hitting a wall Cloud Quotas, quota increase requests, usage alerts
Load testing & forecasting Make the capacity number evidence-based k6/Locust on GKE, traffic forecasts from Monitoring history
Headroom for scale-up latency Cover the seconds/minutes new capacity takes Min-instances / warm pools / pre-provisioned standby

Artifacts & GCP tooling. Produce a capacity plan per journey (forecast peak, reserved vs autoscaled, standby-region capacity strategy), a quota register mapping every scaling path to its per-region limit and headroom, and load-test results validating the forecast. Tooling: Compute Engine reservations / CUDs, MIG & GKE autoscaling, Cloud Run scaling controls, Cloud Quotas with monitoring alerts, and Cloud Monitoring as the source of historical traffic for forecasting.

Real-world enterprise scenario

Aerolux Airways is a fictional full-service carrier headquartered in Dubai, running its flight-booking and check-in platform on GKE and Cloud SQL in me-central1 (Doha) with europe-west1 (Belgium) as a secondary, fronted by the global External Application Load Balancer and Cloud Armor. Booking volume triples around holiday peaks and major fare sales; a regional control-plane event the previous year took bookings down for ~70 minutes during a sale and cost an estimated $1.9M in lost bookings and goodwill. The VP of Engineering mandates a Google Cloud Architecture Framework Reliability review before the next sale.

Reliability principles. The platform team starts in the Scoping focus area: they inventory three critical user journeys — Search & Book, Check-in, and Loyalty/Miss-fare reporting — and run Cloud Trace to map each journey’s dependency chain. Applying define reliability based on user-experience goals, they refuse a blanket “five nines” and instead measure the success ratio of booking requests, not node CPU; a prior habit of paging on CPU saturation (which routinely self-corrected) is dropped in favor of journey-based signals.

SLIs, SLOs, error budgets. They set per-journey SLOs in Cloud Monitoring service monitoring: Search & Book — 99.95% availability + 95% of requests < 500 ms (30-day window, ~22 min/month budget); Check-in — 99.9% availability (degrades to cached boarding-pass data); Reporting — 99.5%. The composite math on Search & Book exposes six hard dependencies capping it near 99.5%, so they convert the fare-rules lookup and seat-map service from hard synchronous calls to Memorystore-cached soft dependencies, lifting the achievable ceiling above 99.95%. Multi-window burn-rate alerts (14.4x fast-burn paging; 3x slow-burn ticketing) replace static error thresholds, and a written error-budget policy freezes risky deploys when a journey’s budget is exhausted.

Redundancy across failure domains. Decision record: all stateless tiers and the GKE node pools go regional across three zones; Cloud SQL moves to regional HA (standby in a second zone) for in-region zero-RPO; the DR posture for region loss is active-passive Doha → Belgium with a promotable cross-region read replica; ingress stays on the global External Application Load Balancer so traffic shifts off an unhealthy region automatically. The loyalty datastore, needing global scale with strong consistency, is moved to Spanner (multi-region) — eliminating the application-level sharding the old system carried.

Graceful degradation. The hard/soft map drives concrete fallbacks: Memorystore serves stale fare/seat data when those services are down; Pub/Sub decouples booking-confirmation emails and loyalty accrual so a downstream brownout never blocks a booking; Cloud Armor rate-limiting sheds abusive/burst traffic at the edge; and a quality SLI tracks how often Search & Book is serving degraded (cached) results so degradation is visible, not silent. Every external call gets a circuit breaker + backoff via Cloud Service Mesh.

Failure recovery. The RTO/RPO register: Search & Book RTO 10 min / RPO 0 (bookings cannot be lost) → Cloud SQL regional HA for zero in-region RPO plus the cross-region replica for region loss (accepting a few seconds’ RPO only in the catastrophic case); Check-in RTO 15 min / RPO 5 min; Reporting RTO 24 h / RPO 24 hBackup and DR Service restore. They explicitly document that zero-RPO and region-survival cannot both hold, and add Cloud SQL PITR + Cloud Storage Object Versioning with Bucket Lock to defend against data corruption (a logical-delete that replication would otherwise propagate). Regional MIG/GKE autohealing and automatic Cloud SQL failover handle the common faults without a human.

Chaos testing. Using Chaos Mesh on GKE plus native triggers, they catalog experiments straight from the failure-domain map: pod kills, a firewall-rule partition simulating loss of the seat-map service, a Cloud SQL failover flip, draining the Doha backend from the global LB, and a PITR restore into a scratch project to time RPO. The first production game day reveals the cross-region replica promotion RTA is 12 minutes against a 10-minute RTO — a fail — fixed by keeping the Belgium compute warm (pilot-light → warm-standby) and pre-promoting on a synthetic-probe latency breach rather than only on hard outage.

Capacity and quota. A load test to forecast holiday-sale peak shows a near-2x instance surge on failover; they place Compute Engine reservations in europe-west1 so the standby actually has capacity, raise Compute CPU, in-use IP, and Cloud Run instance quotas to 2x peak in both regions via Cloud Quotas, and wire quota-usage alerts. CUDs cover the steady baseline.

Outcome. Going into the next sale, Search & Book held 99.97% through a real single-zone incident with zero data loss (regional HA), the fare-rules circuit breaker prevented a repeat cascade, the rehearsed regional failover RTA came down to 6m 10s — inside RTO and inside the error budget — and the standby region’s reserved capacity absorbed the failover surge without hitting a quota wall. The per-journey reliability scorecard, reviewed each release with the error-budget policy, became the artifact the business uses to steer peak-readiness.

Deliverables & checklist

By the end of the Reliability phase you should hold:

Common pitfalls

What’s next

Part 5 of the Google Cloud Architecture Framework series turns to the Cost Optimization pillar — aligning the redundancy, capacity reservations, and DR posture you just designed with a sustainable spend through right-sizing, committed-use discounts, autoscaling efficiency, and FinOps practices on Google Cloud.

GCPWell-ArchitectedReliabilityEnterprise
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

// part 4 of 6 · Google Cloud Architecture Framework

Keep Reading