GCP Well-Architected: System Design — Core Principles, Geography & Regions, the Resource Hierarchy, Networking Foundations, and Choosing Compute, Storage & Databases

Where this fits

The Google Cloud Architecture Framework is Google’s body of guidance for building and running workloads on Google Cloud, organized into pillars — System Design, Operational Excellence, Security, Privacy & Compliance, Reliability, Cost Optimization, and Performance Optimization — sitting on top of a set of cross-cutting core principles. System Design is the foundational pillar — part 1 of the series — and it is deliberately first, because it is where you make the structural, hard-to-reverse decisions that every other pillar later inherits: where your data physically lives (geography and regions), how your environment is organized for governance and billing (the resource hierarchy), how packets move and what is reachable (networking foundations), and which managed primitives you build on (compute, storage, and databases). Get System Design right and reliability, cost, and performance become tuning problems; get it wrong and they become migration projects.

Google Cloud Architecture Framework — animated overview

Core principles — the design philosophy underneath every decision

The Architecture Framework’s core principles are the lens you apply before you reach for a service. They are not pillar-specific; they are the shared design philosophy that keeps the pillars coherent, and the System Design pillar is where you operationalize them first.

The principles you are applying. Across Google’s guidance the load-bearing principles are:

Principle	What it means in practice	The System Design consequence
Design for change	Requirements, traffic, and Google’s own services evolve; bake in the ability to evolve	Favor loosely coupled, API-fronted services; avoid hard-wiring regions or instance types
Document your architecture	An architecture that lives only in someone’s head cannot be reviewed, audited, or evolved	Produce diagrams, an ADR (architecture decision record) log, and IaC as the source of truth
Simplify and use managed services	Undifferentiated heavy lifting (patching, replication, failover) is Google’s job, not yours	Prefer Cloud Run, GKE Autopilot, BigQuery, Spanner over self-managed equivalents
Decouple your architecture	Tight coupling turns a local failure into a global one and blocks independent scaling	Insert Pub/Sub, queues, and well-defined service boundaries between components
Use a stateless architecture	Stateless tiers scale horizontally and recover by replacement, not repair	Push state to managed data stores; keep compute fungible
Automate and use IaC	Manual, click-ops change is unreproducible and drift-prone	Terraform / Infrastructure Manager, Config Controller, CI/CD for infra

Why it matters. These principles are what stop System Design from degenerating into “pick a VM and a database.” Design for change is the difference between a workload you can move to a second region in a sprint and one that needs a re-platform. Simplify and use managed services is usually the single highest-leverage decision a team makes on Google Cloud, because Google’s managed primitives (Spanner, BigQuery, Cloud Run, GKE Autopilot) absorb exactly the operational toil that sinks projects.

How to do it well. Treat the principles as a checklist you run against every significant decision and record the answer. The concrete artifacts are an Architecture Decision Record (ADR) log (one short document per significant, hard-to-reverse choice, with context, options, decision, and consequences), a set of reference architecture diagrams, and an IaC repository that is the environment rather than describing it. The framework’s own Architecture Center, Cloud Well-Architected content, and the Google Cloud Architecture Diagramming tool are the canonical references; Active Assist and Recommender later surface where reality has drifted from these principles.

Geography and regions — where your data and compute physically live

Region and zone selection is the most physical decision in System Design: it sets latency to your users, your data-residency and compliance posture, your blast radius, and a meaningful slice of your cost. It is also one of the hardest to reverse, because data has gravity — once petabytes live in europe-west1, moving them is a project, not a config change.

The hierarchy of physical placement.

A region (e.g., asia-south1 = Mumbai, us-central1 = Iowa, europe-west1 = Belgium) is an independent geographic area, itself made of zones.
A zone (e.g., asia-south1-a) is a deployment area within a region, isolated for failure-domain purposes; Google guidance is to spread across at least two, preferably three, zones so the loss of one zone does not take the workload down.
Multi-region locations (e.g., US, EU, ASIA) exist for specific services — Cloud Storage, BigQuery, Spanner, Firestore — and replicate data across multiple regions in a geography for durability and availability.

The decisions you are actually making.

Decision driver	What it forces you to evaluate	GCP levers
Latency to users	Pick regions close to the user population; multi-region front door	Cloud CDN, global External Application Load Balancer, Network Service Tiers (Premium vs Standard)
Data residency / sovereignty	Some data legally cannot leave a country/continent	Regional resources, Organization Policy location constraints, Assured Workloads, Sovereign Controls
Service availability	Not every product or machine type exists in every region	Per-region product availability, GPU/TPU availability
Reliability target	Single-zone vs multi-zone vs multi-region	Zonal vs regional resources (regional MIGs, regional persistent disk)
Carbon footprint	Regions differ in carbon intensity	Per-region carbon-free energy % published by Google
Cost	Pricing differs by region; egress between regions is charged	Per-region pricing, network egress matrix

How to do it well. Choose a primary region by latency and residency, then a secondary region in the same geography (and ideally the same continent for low cross-region latency and for legal residency) for DR. Make every resource’s location an explicit, IaC-set value — never accept a default region. For data-residency-bound workloads, enforce placement with the Organization Policy constraint gcp.resourceLocations (allowed/denied locations) rather than trusting engineers to remember, and consider Assured Workloads to get a compliance-scoped folder with location and personnel controls baked in. Artifacts: a region-selection rationale (latency, residency, CFE%, cost) recorded as an ADR, a primary/secondary region pair per workload, and Organization Policy location constraints applied at the right folder.

The resource hierarchy — the structural backbone for governance, billing, and isolation

The Google Cloud resource hierarchy is the single most important governance artifact you create, because almost everything else attaches to it: IAM policies, Organization Policies, billing, networking scope, and quotas all inherit down it. Designing it deliberately, up front, is the difference between governance-by-design and governance-by-cleanup-project-two-years-later.

The levels, top to bottom.

Organization — the root node, tied 1:1 to a Cloud Identity or Google Workspace account; it represents the company and is where org-wide IAM and Organization Policy live.
Folders — optional grouping nodes (nestable) that typically model departments, teams, environments, or legal entities; they are the natural attach point for delegated administration and environment-specific policy.
Projects — the fundamental unit of resource ownership, billing, quota, and isolation; every resource lives in exactly one project, and a project is the boundary for APIs, service accounts, and most IAM.
Resources — the VMs, buckets, databases, etc., inside projects.

Why it matters. The hierarchy is how inheritance works. An IAM role granted at a folder flows to every project beneath it; an Organization Policy set at the org constrains everything below unless explicitly overridden. This is enormously powerful and equally dangerous: a project landing in the wrong folder silently inherits the wrong access and the wrong guardrails. The project is also the billing and quota boundary — billing rolls up to a Cloud Billing account linked per project, and quotas are largely per-project-per-region, so your project topology directly shapes both your cost reporting and your scaling ceilings.

How to do it well. Follow the Google enterprise foundations / landing zone pattern:

Separate environments (prod / non-prod / dev) into different folders and different projects — never share a project across environments — so blast radius, IAM, and billing are cleanly split.
Put shared infrastructure (host VPC, logging, monitoring, security tooling) in its own dedicated projects (e.g., a vpc-host-prod, a logging sink project, a security project), distinct from workload projects.
Use a resource naming and labeling convention and apply labels (env, cost-center, owner, data-classification) consistently so billing export and policy can slice by them.
Bootstrap with the Cloud Foundation Toolkit / Terraform terraform-google-modules and the enterprise foundations blueprint, and govern ongoing structure with Organization Policy Service constraints (e.g., disable default network, restrict external IPs, restrict resource locations, require OS Login).
Route org-wide audit data with an aggregated log sink at the org or folder level into a central logging project / BigQuery / Cloud Storage.

Artifact	Purpose	Tool
Org → folder → project diagram	The map everything inherits from	Architecture diagram + Terraform
Folder-per-environment layout	Clean prod/non-prod isolation	Resource Manager, folders
Org Policy constraint set	Org-wide guardrails by inheritance	Organization Policy Service
Billing accounts + budgets	Cost ownership and alerting	Cloud Billing, budgets/alerts, BigQuery billing export
Foundation IaC	Reproducible, reviewable structure	Cloud Foundation Toolkit, Infrastructure Manager/Terraform

Artifacts to produce: an organization/folder/project topology diagram, an Organization Policy baseline, a naming-and-labeling standard, the billing-account-to-project mapping with budgets and alerts, and the foundation expressed as version-controlled Terraform.

Networking foundations — VPC design, hybrid connectivity, and reachability

Networking is the substrate every workload runs on, and on Google Cloud it has one property that makes it different from most clouds: the VPC is a global resource whose subnets are regional. That single fact reshapes how you design — you do not need a separate VPC per region, and you can route between regions over Google’s backbone without peering meshes.

The foundational VPC decisions.

VPC mode: custom, not auto. Always create custom-mode VPCs (and disable the default network via Organization Policy) so you control every subnet and CIDR explicitly. Auto-mode creates a subnet in every region with fixed ranges — convenient and wrong for enterprise IP planning.
Shared VPC vs VPC Peering. Shared VPC lets a host project own the network and service projects attach to it — the standard enterprise pattern, giving central network teams control while application teams own their workloads. VPC Network Peering connects two separate VPCs with non-transitive routing, used when teams or business units need full network autonomy. Network Connectivity Center provides hub-and-spoke transitive connectivity when you need many VPCs/on-prem sites to interconnect.
IP address planning. Allocate non-overlapping RFC 1918 CIDRs centrally with room for growth, and reserve dedicated secondary ranges for GKE Pods and Services (VPC-native/alias IP). Overlapping ranges are the classic blocker to future peering, hybrid connectivity, and acquisitions — plan generously now.

Connectivity, ingress, and security.

Concern	The right Google Cloud building block
Hybrid connectivity	Cloud Interconnect (Dedicated or Partner) for private, high-bandwidth links; Cloud VPN (HA VPN) as backup or for lower bandwidth
Dynamic routing	Cloud Router with BGP for hybrid and for regional/global dynamic routing mode
Outbound from private VMs	Cloud NAT (managed, no NAT instances) so private VMs reach the internet without external IPs
Private access to Google APIs	Private Google Access, Private Service Connect, VPC Service Controls to reach BigQuery, Cloud Storage, etc. without traversing the public internet
Global load balancing / ingress	Cloud Load Balancing (global External Application LB, regional, internal, network LB) on Google’s anycast frontend
Edge protection	Cloud Armor (WAF, DDoS, geo/rate rules), Cloud CDN for caching
East-west security	VPC firewall rules + hierarchical firewall policies (org/folder-level), firewall policies, tags/service accounts as rule targets
Exfiltration control	VPC Service Controls service perimeters around sensitive data services
DNS	Cloud DNS (public + private zones, DNS peering/forwarding for hybrid)

Why it matters and how to do it well. Network topology is a foundational, slow-to-change decision; a flat single VPC with overlapping ranges and public IPs everywhere is a security and scaling dead end. The enterprise-standard pattern is: Shared VPC with a host project per environment, custom-mode subnets with deliberate CIDR allocation (including GKE secondary ranges), Cloud NAT for egress, Private Google Access / Private Service Connect so workloads never need public IPs to reach Google services, hierarchical firewall policies for org-wide baseline rules, VPC Service Controls perimeters around data, and HA VPN + Cloud Interconnect for redundant hybrid links terminated on Cloud Router. Artifacts: a CIDR/IP allocation plan, a Shared VPC host/service-project design, a network topology diagram (regions, subnets, connectivity, LB ingress), firewall and hierarchical-policy definitions, and a hybrid-connectivity design with redundancy.

Choosing compute, storage, and databases — selecting the right managed primitives

This is where System Design becomes concrete: matching each workload to the Google Cloud compute, storage, and data services whose operational model, scaling behavior, and consistency guarantees fit the requirement. The framework’s bias is explicit — prefer the most managed option that meets the requirement — because every operational concern you hand to Google is one your team does not run at 2 a.m.

Compute — the managed-vs-control spectrum

Service	Model	Best for	You manage
Cloud Run	Serverless containers, scale-to-zero	Stateless HTTP/event services, APIs, web apps, jobs	Just the container
Cloud Run functions (Cloud Functions)	Event-driven functions (FaaS)	Glue, event handlers, lightweight endpoints	Just the function code
App Engine	PaaS (standard/flexible)	Classic web apps wanting a fully managed platform	App + minimal config
GKE Autopilot	Managed Kubernetes, Google runs nodes	Containerized platforms wanting K8s API without node ops	Workloads + manifests
GKE Standard	Managed Kubernetes, you size node pools	K8s needing node-level control, GPUs/TPUs, custom networking	Node pools + workloads
Compute Engine (MIGs)	VMs / managed instance groups	Lift-and-shift, licensed software, full OS control, specialized hardware	OS, patching, scaling config
Batch / Dataflow / Dataproc	Managed batch & data processing	HPC/batch jobs, streaming/batch ETL, Spark/Hadoop	Job definition

How to choose. Walk the spectrum from most-managed to least: Can it run as a stateless container? → Cloud Run (with scale-to-zero and request-based autoscaling). Does it need the Kubernetes API but not node control? → GKE Autopilot. Does it need node-level control, GPUs/TPUs, or a service mesh you tune? → GKE Standard. Is it a VM-shaped, lift-and-shift, or license-bound workload? → Compute Engine with regional managed instance groups for zonal redundancy and autoscaling. For VM cost/efficiency, layer machine families (E2/N2/N2D general purpose, C-series compute-optimized, M-series memory-optimized), Spot VMs for fault-tolerant work, and committed use discounts for steady baseline.

Storage — match the access pattern, not the habit

Service	Type	Best for	Notes
Cloud Storage	Object	Unstructured blobs, data lake, backups, static assets	Storage classes: Standard / Nearline / Coldline / Archive; Autoclass; regional/dual-region/multi-region
Persistent Disk / Hyperdisk	Block (VM-attached)	Boot disks, databases on VMs, low-latency block	Zonal vs regional PD (synchronous cross-zone replication)
Local SSD	Ephemeral block	Scratch, caches, very high IOPS	Data lost on stop/terminate
Filestore	Managed NFS	Shared POSIX file systems, lift-and-shift apps, GKE RWX	Tiers from Basic to Enterprise
Cloud Storage FUSE / Parallelstore	File-over-object / HPC parallel FS	ML training data, HPC scratch	High-throughput AI/HPC

How to choose. Decide by access shape and durability need: object (Cloud Storage) for anything blob-like or lake-like, picking the storage class by access frequency (or Autoclass to let Google move objects automatically and avoid early-deletion fees); block (Persistent Disk/Hyperdisk, using regional PD when a database VM must survive a zone failure) for VM-attached low-latency storage; managed NFS (Filestore) only when you genuinely need shared POSIX semantics. Govern object data with lifecycle policies, Object Versioning, retention policies/Bucket Lock (WORM), and uniform bucket-level access.

Databases — pick by data model and consistency, not by familiarity

Service	Model	Consistency / scale	Best for
Cloud SQL	Managed MySQL / PostgreSQL / SQL Server	Regional HA (multi-zone), read replicas; vertical scale	Lift-and-shift relational, classic OLTP
AlloyDB for PostgreSQL	Managed PostgreSQL-compatible	High-performance HA, columnar engine for analytics	Demanding PostgreSQL OLTP/HTAP
Spanner	Distributed relational	Horizontal scale + strong consistency, global, 99.999%	Global OLTP, financial/inventory, no-sharding scale
Firestore	Document NoSQL	Serverless, regional/multi-region, real-time	Mobile/web app data, real-time sync
Bigtable	Wide-column NoSQL	Massive scale, low-latency, high-throughput	Time-series, IoT, ad-tech, large analytical KV
Memorystore	Managed Redis / Valkey / Memcached	In-memory cache	Caching, sessions, leaderboards
BigQuery	Serverless data warehouse	Petabyte-scale analytics, separation of storage/compute	Analytics, BI, ELT, ML on data (BigQuery ML)

How to choose. Start from the data model and consistency requirement: relational + needs to scale globally with strong consistency → Spanner; relational, regional, lift-and-shift → Cloud SQL (or AlloyDB when you need more performance/HTAP from PostgreSQL); document with real-time sync → Firestore; huge-scale low-latency key/wide-column (time-series, IoT) → Bigtable; analytical/warehouse → BigQuery; hot-path caching → Memorystore. Capture for each store the RPO/RTO, HA topology (multi-zone vs multi-region), read-replica strategy, and consistency model, because those are the System Design facts the Reliability pillar will later depend on. Artifacts: a per-workload service-selection matrix (compute/storage/database) with the rationale, a data-classification and residency note per data store, and the HA/replication topology for each stateful service.

Foundational system-design decisions — the cross-cutting choices

Some System Design decisions cut across all the sub-components above and set the trajectory of the whole platform. These are the ones to make consciously, early, and record as ADRs.

Identity foundation. Stand up Cloud Identity / Workspace federated to your IdP (e.g., via Workforce Identity Federation / SAML), use groups as the unit of IAM grants (never grant to individuals), and adopt Workload Identity Federation so workloads (and CI/CD) use short-lived federated credentials instead of long-lived service-account keys.
Single vs multi-region topology. Decide per workload tier whether it is zonal, regional, or multi-region, and write down the resulting availability ceiling — this is the decision the Reliability pillar inherits wholesale.
Managed-first vs control. Default to serverless/managed (Cloud Run, GKE Autopilot, Spanner, BigQuery) and justify any move toward self-managed VMs with a concrete requirement (licensing, OS control, specialized hardware).
Consistency and coupling model. Choose where you accept eventual consistency and asynchronous, Pub/Sub-decoupled processing versus where you require strong consistency (Spanner) — this shapes both performance and reliability.
Quota and scaling ceilings. Inventory the per-project, per-region quotas that gate your scaling paths and request increases ahead of need; project topology and region choice both move these ceilings.
Tagging/labeling and FinOps hooks. Bake in labels and billing export to BigQuery from day one so cost, residency, and ownership are queryable, not retrofitted.

Real-world enterprise scenario

Meridian Pay is a fictional pan-Asian digital-payments and merchant-settlement company headquartered in Bengaluru. Their platform handles real-time card authorizations, a merchant ledger, a settlement engine, and a fraud-scoring service. Peak load reaches ~30,000 authorizations/second during festival-sale windows across India and Southeast Asia. Regulation requires that Indian payment data remain resident in India, and the board has set a target of four-nines (99.99%) availability for the authorization path with a regional active topology plus tested cross-region DR. They are migrating from an on-prem data centre and are net-new on Google Cloud.

Core principles. The platform team writes an ADR log from day one and adopts a managed-first rule: nothing self-hosted unless a licensing or hardware requirement forces it. Everything is Terraform via Infrastructure Manager, decoupled with Pub/Sub between authorization, ledger, and settlement so a slow downstream deepens a queue rather than failing a card swipe.

Geography and regions. They choose asia-south1 (Mumbai) as primary and asia-south2 (Delhi) as secondary — both in India, satisfying data residency while giving a true second region for DR. The authorization tier runs across three zones in asia-south1. An Organization Policy gcp.resourceLocations constraint, set at the production folder, hard-blocks any resource outside the two Indian regions, and the regulated ledger sits inside an Assured Workloads folder.

Resource hierarchy. Under the org they create folders prod, nonprod, and shared. Workloads land in dedicated projects (auth-prod, ledger-prod, settlement-prod), while shared concerns get their own projects: vpc-host-prod (Shared VPC host), logging (aggregated org-level log sink to BigQuery), and security. An Organization Policy baseline disables the default network, blocks external IPs on VMs, enforces OS Login, and restricts locations. Labels (env, cost-center, data-classification=pci) are mandatory, and Cloud Billing budgets with alerts are wired per project, with billing export to BigQuery.

Networking foundations. A custom-mode Shared VPC lives in vpc-host-prod; the three workload projects attach as service projects. Non-overlapping /16s are allocated per region with dedicated GKE secondary ranges for Pods/Services. Cloud NAT provides egress so no VM has a public IP; Private Service Connect and Private Google Access reach BigQuery and Cloud Storage privately; VPC Service Controls wraps the ledger and fraud data in a perimeter to block exfiltration. Ingress is a global External Application Load Balancer fronted by Cloud Armor (rate-limiting + geo rules) and Cloud CDN for static merchant assets. Redundant HA VPN plus a Partner Interconnect link, terminated on Cloud Router, connect the remaining on-prem reconciliation systems.

Compute, storage, databases. The authorization service and merchant dashboard run on Cloud Run (request-based autoscaling, scale-to-zero for non-prod). The fraud-scoring platform, needing GPUs and a service mesh, runs on GKE Standard. The merchant ledger moves to Spanner for horizontal scale with strong consistency and 99.999% availability — no application-level sharding — which is the decision that lets the auth path hit four-nines. Reporting and settlement analytics land in BigQuery (with BigQuery ML for fraud features); the hot authorization cache uses Memorystore for Redis; transaction-evidence blobs and statements go to Cloud Storage with Bucket Lock (WORM) for the regulator’s retention requirement.

Foundational decisions. Identity federates the corporate IdP into Cloud Identity via Workforce Identity Federation; all IAM is granted to groups, and CI/CD uses Workload Identity Federation (zero downloaded service-account keys). Per-project, per-region quotas on Cloud Run instances and Spanner nodes are raised to 2x projected peak ahead of the festival window.

Outcome. Meridian Pay went live with 99.99% measured availability on the authorization path over its first two quarters, passed a residency audit cleanly (the Organization Policy location constraint produced zero out-of-region resources), and executed a DR drill that failed asia-south1 over to asia-south2 with Spanner multi-region promotion and a load-balancer traffic shift in under 8 minutes. Choosing Spanner over sharded Cloud SQL at System Design time is what the team credits for avoiding the re-platform that the old on-prem ledger would have required to scale.

Deliverables & checklist

Common pitfalls

Accepting default regions and the default network. Resources land wherever a quickstart put them, creating a residency violation or a flat, overlapping-CIDR network. Avoid it: set every location explicitly in IaC, disable the default network via Organization Policy, and enforce gcp.resourceLocations.
Retrofitting the resource hierarchy. Teams start in one shared project and discover months later that IAM, billing, and blast radius are hopelessly entangled. Avoid it: design the org/folder/project topology and Organization Policy baseline first, using the enterprise foundations blueprint, with one project per environment per workload.
Overlapping or stingy IP ranges. A flat or overlapping CIDR plan blocks future peering, hybrid links, GKE growth, and acquisitions. Avoid it: allocate non-overlapping RFC 1918 ranges centrally with generous headroom and dedicated GKE secondary ranges before the first subnet is built.
Sharded relational DB where Spanner fits. Building application-level sharding on Cloud SQL to chase global scale creates years of operational pain. Avoid it: when you need horizontal scale with strong consistency, choose Spanner at design time rather than re-platforming later.
Reaching for VMs by habit. Lifting everything onto Compute Engine ignores the managed-first principle and signs the team up for patching, scaling, and failover toil. Avoid it: walk the managed spectrum (Cloud Run → GKE Autopilot → GKE Standard → Compute Engine) and justify each step down with a concrete requirement.
Long-lived service-account keys. Downloaded JSON keys leak and never rotate, becoming a standing breach. Avoid it: use Workload Identity Federation for workloads and CI/CD and grant IAM to groups, not individuals.

What’s next

Part 2 of the Google Cloud Architecture Framework series turns to the Operational Excellence pillar — building the observability, automation, incident-management, and operational-readiness practices that keep the system you have just designed running reliably in production.