Where this fits
The Google Cloud Architecture Framework is Google’s body of guidance for building and running workloads on Google Cloud, organized into pillars — System Design, Operational Excellence, Security, Privacy & Compliance, Reliability, Cost Optimization, and Performance Optimization — sitting on top of a set of cross-cutting core principles. System Design is the foundational pillar — part 1 of the series — and it is deliberately first, because it is where you make the structural, hard-to-reverse decisions that every other pillar later inherits: where your data physically lives (geography and regions), how your environment is organized for governance and billing (the resource hierarchy), how packets move and what is reachable (networking foundations), and which managed primitives you build on (compute, storage, and databases). Get System Design right and reliability, cost, and performance become tuning problems; get it wrong and they become migration projects.

Core principles — the design philosophy underneath every decision
The Architecture Framework’s core principles are the lens you apply before you reach for a service. They are not pillar-specific; they are the shared design philosophy that keeps the pillars coherent, and the System Design pillar is where you operationalize them first.
The principles you are applying. Across Google’s guidance the load-bearing principles are:
| Principle | What it means in practice | The System Design consequence |
|---|---|---|
| Design for change | Requirements, traffic, and Google’s own services evolve; bake in the ability to evolve | Favor loosely coupled, API-fronted services; avoid hard-wiring regions or instance types |
| Document your architecture | An architecture that lives only in someone’s head cannot be reviewed, audited, or evolved | Produce diagrams, an ADR (architecture decision record) log, and IaC as the source of truth |
| Simplify and use managed services | Undifferentiated heavy lifting (patching, replication, failover) is Google’s job, not yours | Prefer Cloud Run, GKE Autopilot, BigQuery, Spanner over self-managed equivalents |
| Decouple your architecture | Tight coupling turns a local failure into a global one and blocks independent scaling | Insert Pub/Sub, queues, and well-defined service boundaries between components |
| Use a stateless architecture | Stateless tiers scale horizontally and recover by replacement, not repair | Push state to managed data stores; keep compute fungible |
| Automate and use IaC | Manual, click-ops change is unreproducible and drift-prone | Terraform / Infrastructure Manager, Config Controller, CI/CD for infra |
Why it matters. These principles are what stop System Design from degenerating into “pick a VM and a database.” Design for change is the difference between a workload you can move to a second region in a sprint and one that needs a re-platform. Simplify and use managed services is usually the single highest-leverage decision a team makes on Google Cloud, because Google’s managed primitives (Spanner, BigQuery, Cloud Run, GKE Autopilot) absorb exactly the operational toil that sinks projects.
How to do it well. Treat the principles as a checklist you run against every significant decision and record the answer. The concrete artifacts are an Architecture Decision Record (ADR) log (one short document per significant, hard-to-reverse choice, with context, options, decision, and consequences), a set of reference architecture diagrams, and an IaC repository that is the environment rather than describing it. The framework’s own Architecture Center, Cloud Well-Architected content, and the Google Cloud Architecture Diagramming tool are the canonical references; Active Assist and Recommender later surface where reality has drifted from these principles.
Geography and regions — where your data and compute physically live
Region and zone selection is the most physical decision in System Design: it sets latency to your users, your data-residency and compliance posture, your blast radius, and a meaningful slice of your cost. It is also one of the hardest to reverse, because data has gravity — once petabytes live in europe-west1, moving them is a project, not a config change.
The hierarchy of physical placement.
- A region (e.g.,
asia-south1= Mumbai,us-central1= Iowa,europe-west1= Belgium) is an independent geographic area, itself made of zones. - A zone (e.g.,
asia-south1-a) is a deployment area within a region, isolated for failure-domain purposes; Google guidance is to spread across at least two, preferably three, zones so the loss of one zone does not take the workload down. - Multi-region locations (e.g.,
US,EU,ASIA) exist for specific services — Cloud Storage, BigQuery, Spanner, Firestore — and replicate data across multiple regions in a geography for durability and availability.
The decisions you are actually making.
| Decision driver | What it forces you to evaluate | GCP levers |
|---|---|---|
| Latency to users | Pick regions close to the user population; multi-region front door | Cloud CDN, global External Application Load Balancer, Network Service Tiers (Premium vs Standard) |
| Data residency / sovereignty | Some data legally cannot leave a country/continent | Regional resources, Organization Policy location constraints, Assured Workloads, Sovereign Controls |
| Service availability | Not every product or machine type exists in every region | Per-region product availability, GPU/TPU availability |
| Reliability target | Single-zone vs multi-zone vs multi-region | Zonal vs regional resources (regional MIGs, regional persistent disk) |
| Carbon footprint | Regions differ in carbon intensity | Per-region carbon-free energy % published by Google |
| Cost | Pricing differs by region; egress between regions is charged | Per-region pricing, network egress matrix |
How to do it well. Choose a primary region by latency and residency, then a secondary region in the same geography (and ideally the same continent for low cross-region latency and for legal residency) for DR. Make every resource’s location an explicit, IaC-set value — never accept a default region. For data-residency-bound workloads, enforce placement with the Organization Policy constraint gcp.resourceLocations (allowed/denied locations) rather than trusting engineers to remember, and consider Assured Workloads to get a compliance-scoped folder with location and personnel controls baked in. Artifacts: a region-selection rationale (latency, residency, CFE%, cost) recorded as an ADR, a primary/secondary region pair per workload, and Organization Policy location constraints applied at the right folder.
The resource hierarchy — the structural backbone for governance, billing, and isolation
The Google Cloud resource hierarchy is the single most important governance artifact you create, because almost everything else attaches to it: IAM policies, Organization Policies, billing, networking scope, and quotas all inherit down it. Designing it deliberately, up front, is the difference between governance-by-design and governance-by-cleanup-project-two-years-later.
The levels, top to bottom.
- Organization — the root node, tied 1:1 to a Cloud Identity or Google Workspace account; it represents the company and is where org-wide IAM and Organization Policy live.
- Folders — optional grouping nodes (nestable) that typically model departments, teams, environments, or legal entities; they are the natural attach point for delegated administration and environment-specific policy.
- Projects — the fundamental unit of resource ownership, billing, quota, and isolation; every resource lives in exactly one project, and a project is the boundary for APIs, service accounts, and most IAM.
- Resources — the VMs, buckets, databases, etc., inside projects.
Why it matters. The hierarchy is how inheritance works. An IAM role granted at a folder flows to every project beneath it; an Organization Policy set at the org constrains everything below unless explicitly overridden. This is enormously powerful and equally dangerous: a project landing in the wrong folder silently inherits the wrong access and the wrong guardrails. The project is also the billing and quota boundary — billing rolls up to a Cloud Billing account linked per project, and quotas are largely per-project-per-region, so your project topology directly shapes both your cost reporting and your scaling ceilings.
How to do it well. Follow the Google enterprise foundations / landing zone pattern:
- Separate environments (prod / non-prod / dev) into different folders and different projects — never share a project across environments — so blast radius, IAM, and billing are cleanly split.
- Put shared infrastructure (host VPC, logging, monitoring, security tooling) in its own dedicated projects (e.g., a
vpc-host-prod, aloggingsink project, asecurityproject), distinct from workload projects. - Use a resource naming and labeling convention and apply labels (
env,cost-center,owner,data-classification) consistently so billing export and policy can slice by them. - Bootstrap with the Cloud Foundation Toolkit / Terraform
terraform-google-modulesand the enterprise foundations blueprint, and govern ongoing structure with Organization Policy Service constraints (e.g., disable default network, restrict external IPs, restrict resource locations, require OS Login). - Route org-wide audit data with an aggregated log sink at the org or folder level into a central logging project / BigQuery / Cloud Storage.
| Artifact | Purpose | Tool |
|---|---|---|
| Org → folder → project diagram | The map everything inherits from | Architecture diagram + Terraform |
| Folder-per-environment layout | Clean prod/non-prod isolation | Resource Manager, folders |
| Org Policy constraint set | Org-wide guardrails by inheritance | Organization Policy Service |
| Billing accounts + budgets | Cost ownership and alerting | Cloud Billing, budgets/alerts, BigQuery billing export |
| Foundation IaC | Reproducible, reviewable structure | Cloud Foundation Toolkit, Infrastructure Manager/Terraform |
Artifacts to produce: an organization/folder/project topology diagram, an Organization Policy baseline, a naming-and-labeling standard, the billing-account-to-project mapping with budgets and alerts, and the foundation expressed as version-controlled Terraform.
Networking foundations — VPC design, hybrid connectivity, and reachability
Networking is the substrate every workload runs on, and on Google Cloud it has one property that makes it different from most clouds: the VPC is a global resource whose subnets are regional. That single fact reshapes how you design — you do not need a separate VPC per region, and you can route between regions over Google’s backbone without peering meshes.
The foundational VPC decisions.
- VPC mode: custom, not auto. Always create custom-mode VPCs (and disable the default network via Organization Policy) so you control every subnet and CIDR explicitly. Auto-mode creates a subnet in every region with fixed ranges — convenient and wrong for enterprise IP planning.
- Shared VPC vs VPC Peering. Shared VPC lets a host project own the network and service projects attach to it — the standard enterprise pattern, giving central network teams control while application teams own their workloads. VPC Network Peering connects two separate VPCs with non-transitive routing, used when teams or business units need full network autonomy. Network Connectivity Center provides hub-and-spoke transitive connectivity when you need many VPCs/on-prem sites to interconnect.
- IP address planning. Allocate non-overlapping RFC 1918 CIDRs centrally with room for growth, and reserve dedicated secondary ranges for GKE Pods and Services (VPC-native/alias IP). Overlapping ranges are the classic blocker to future peering, hybrid connectivity, and acquisitions — plan generously now.
Connectivity, ingress, and security.
| Concern | The right Google Cloud building block |
|---|---|
| Hybrid connectivity | Cloud Interconnect (Dedicated or Partner) for private, high-bandwidth links; Cloud VPN (HA VPN) as backup or for lower bandwidth |
| Dynamic routing | Cloud Router with BGP for hybrid and for regional/global dynamic routing mode |
| Outbound from private VMs | Cloud NAT (managed, no NAT instances) so private VMs reach the internet without external IPs |
| Private access to Google APIs | Private Google Access, Private Service Connect, VPC Service Controls to reach BigQuery, Cloud Storage, etc. without traversing the public internet |
| Global load balancing / ingress | Cloud Load Balancing (global External Application LB, regional, internal, network LB) on Google’s anycast frontend |
| Edge protection | Cloud Armor (WAF, DDoS, geo/rate rules), Cloud CDN for caching |
| East-west security | VPC firewall rules + hierarchical firewall policies (org/folder-level), firewall policies, tags/service accounts as rule targets |
| Exfiltration control | VPC Service Controls service perimeters around sensitive data services |
| DNS | Cloud DNS (public + private zones, DNS peering/forwarding for hybrid) |
Why it matters and how to do it well. Network topology is a foundational, slow-to-change decision; a flat single VPC with overlapping ranges and public IPs everywhere is a security and scaling dead end. The enterprise-standard pattern is: Shared VPC with a host project per environment, custom-mode subnets with deliberate CIDR allocation (including GKE secondary ranges), Cloud NAT for egress, Private Google Access / Private Service Connect so workloads never need public IPs to reach Google services, hierarchical firewall policies for org-wide baseline rules, VPC Service Controls perimeters around data, and HA VPN + Cloud Interconnect for redundant hybrid links terminated on Cloud Router. Artifacts: a CIDR/IP allocation plan, a Shared VPC host/service-project design, a network topology diagram (regions, subnets, connectivity, LB ingress), firewall and hierarchical-policy definitions, and a hybrid-connectivity design with redundancy.
Choosing compute, storage, and databases — selecting the right managed primitives
This is where System Design becomes concrete: matching each workload to the Google Cloud compute, storage, and data services whose operational model, scaling behavior, and consistency guarantees fit the requirement. The framework’s bias is explicit — prefer the most managed option that meets the requirement — because every operational concern you hand to Google is one your team does not run at 2 a.m.
Compute — the managed-vs-control spectrum
| Service | Model | Best for | You manage |
|---|---|---|---|
| Cloud Run | Serverless containers, scale-to-zero | Stateless HTTP/event services, APIs, web apps, jobs | Just the container |
| Cloud Run functions (Cloud Functions) | Event-driven functions (FaaS) | Glue, event handlers, lightweight endpoints | Just the function code |
| App Engine | PaaS (standard/flexible) | Classic web apps wanting a fully managed platform | App + minimal config |
| GKE Autopilot | Managed Kubernetes, Google runs nodes | Containerized platforms wanting K8s API without node ops | Workloads + manifests |
| GKE Standard | Managed Kubernetes, you size node pools | K8s needing node-level control, GPUs/TPUs, custom networking | Node pools + workloads |
| Compute Engine (MIGs) | VMs / managed instance groups | Lift-and-shift, licensed software, full OS control, specialized hardware | OS, patching, scaling config |
| Batch / Dataflow / Dataproc | Managed batch & data processing | HPC/batch jobs, streaming/batch ETL, Spark/Hadoop | Job definition |
How to choose. Walk the spectrum from most-managed to least: Can it run as a stateless container? → Cloud Run (with scale-to-zero and request-based autoscaling). Does it need the Kubernetes API but not node control? → GKE Autopilot. Does it need node-level control, GPUs/TPUs, or a service mesh you tune? → GKE Standard. Is it a VM-shaped, lift-and-shift, or license-bound workload? → Compute Engine with regional managed instance groups for zonal redundancy and autoscaling. For VM cost/efficiency, layer machine families (E2/N2/N2D general purpose, C-series compute-optimized, M-series memory-optimized), Spot VMs for fault-tolerant work, and committed use discounts for steady baseline.
Storage — match the access pattern, not the habit
| Service | Type | Best for | Notes |
|---|---|---|---|
| Cloud Storage | Object | Unstructured blobs, data lake, backups, static assets | Storage classes: Standard / Nearline / Coldline / Archive; Autoclass; regional/dual-region/multi-region |
| Persistent Disk / Hyperdisk | Block (VM-attached) | Boot disks, databases on VMs, low-latency block | Zonal vs regional PD (synchronous cross-zone replication) |
| Local SSD | Ephemeral block | Scratch, caches, very high IOPS | Data lost on stop/terminate |
| Filestore | Managed NFS | Shared POSIX file systems, lift-and-shift apps, GKE RWX | Tiers from Basic to Enterprise |
| Cloud Storage FUSE / Parallelstore | File-over-object / HPC parallel FS | ML training data, HPC scratch | High-throughput AI/HPC |
How to choose. Decide by access shape and durability need: object (Cloud Storage) for anything blob-like or lake-like, picking the storage class by access frequency (or Autoclass to let Google move objects automatically and avoid early-deletion fees); block (Persistent Disk/Hyperdisk, using regional PD when a database VM must survive a zone failure) for VM-attached low-latency storage; managed NFS (Filestore) only when you genuinely need shared POSIX semantics. Govern object data with lifecycle policies, Object Versioning, retention policies/Bucket Lock (WORM), and uniform bucket-level access.
Databases — pick by data model and consistency, not by familiarity
| Service | Model | Consistency / scale | Best for |
|---|---|---|---|
| Cloud SQL | Managed MySQL / PostgreSQL / SQL Server | Regional HA (multi-zone), read replicas; vertical scale | Lift-and-shift relational, classic OLTP |
| AlloyDB for PostgreSQL | Managed PostgreSQL-compatible | High-performance HA, columnar engine for analytics | Demanding PostgreSQL OLTP/HTAP |
| Spanner | Distributed relational | Horizontal scale + strong consistency, global, 99.999% | Global OLTP, financial/inventory, no-sharding scale |
| Firestore | Document NoSQL | Serverless, regional/multi-region, real-time | Mobile/web app data, real-time sync |
| Bigtable | Wide-column NoSQL | Massive scale, low-latency, high-throughput | Time-series, IoT, ad-tech, large analytical KV |
| Memorystore | Managed Redis / Valkey / Memcached | In-memory cache | Caching, sessions, leaderboards |
| BigQuery | Serverless data warehouse | Petabyte-scale analytics, separation of storage/compute | Analytics, BI, ELT, ML on data (BigQuery ML) |
How to choose. Start from the data model and consistency requirement: relational + needs to scale globally with strong consistency → Spanner; relational, regional, lift-and-shift → Cloud SQL (or AlloyDB when you need more performance/HTAP from PostgreSQL); document with real-time sync → Firestore; huge-scale low-latency key/wide-column (time-series, IoT) → Bigtable; analytical/warehouse → BigQuery; hot-path caching → Memorystore. Capture for each store the RPO/RTO, HA topology (multi-zone vs multi-region), read-replica strategy, and consistency model, because those are the System Design facts the Reliability pillar will later depend on. Artifacts: a per-workload service-selection matrix (compute/storage/database) with the rationale, a data-classification and residency note per data store, and the HA/replication topology for each stateful service.
Foundational system-design decisions — the cross-cutting choices
Some System Design decisions cut across all the sub-components above and set the trajectory of the whole platform. These are the ones to make consciously, early, and record as ADRs.
- Identity foundation. Stand up Cloud Identity / Workspace federated to your IdP (e.g., via Workforce Identity Federation / SAML), use groups as the unit of IAM grants (never grant to individuals), and adopt Workload Identity Federation so workloads (and CI/CD) use short-lived federated credentials instead of long-lived service-account keys.
- Single vs multi-region topology. Decide per workload tier whether it is zonal, regional, or multi-region, and write down the resulting availability ceiling — this is the decision the Reliability pillar inherits wholesale.
- Managed-first vs control. Default to serverless/managed (Cloud Run, GKE Autopilot, Spanner, BigQuery) and justify any move toward self-managed VMs with a concrete requirement (licensing, OS control, specialized hardware).
- Consistency and coupling model. Choose where you accept eventual consistency and asynchronous, Pub/Sub-decoupled processing versus where you require strong consistency (Spanner) — this shapes both performance and reliability.
- Quota and scaling ceilings. Inventory the per-project, per-region quotas that gate your scaling paths and request increases ahead of need; project topology and region choice both move these ceilings.
- Tagging/labeling and FinOps hooks. Bake in labels and billing export to BigQuery from day one so cost, residency, and ownership are queryable, not retrofitted.
Real-world enterprise scenario
Meridian Pay is a fictional pan-Asian digital-payments and merchant-settlement company headquartered in Bengaluru. Their platform handles real-time card authorizations, a merchant ledger, a settlement engine, and a fraud-scoring service. Peak load reaches ~30,000 authorizations/second during festival-sale windows across India and Southeast Asia. Regulation requires that Indian payment data remain resident in India, and the board has set a target of four-nines (99.99%) availability for the authorization path with a regional active topology plus tested cross-region DR. They are migrating from an on-prem data centre and are net-new on Google Cloud.
Core principles. The platform team writes an ADR log from day one and adopts a managed-first rule: nothing self-hosted unless a licensing or hardware requirement forces it. Everything is Terraform via Infrastructure Manager, decoupled with Pub/Sub between authorization, ledger, and settlement so a slow downstream deepens a queue rather than failing a card swipe.
Geography and regions. They choose asia-south1 (Mumbai) as primary and asia-south2 (Delhi) as secondary — both in India, satisfying data residency while giving a true second region for DR. The authorization tier runs across three zones in asia-south1. An Organization Policy gcp.resourceLocations constraint, set at the production folder, hard-blocks any resource outside the two Indian regions, and the regulated ledger sits inside an Assured Workloads folder.
Resource hierarchy. Under the org they create folders prod, nonprod, and shared. Workloads land in dedicated projects (auth-prod, ledger-prod, settlement-prod), while shared concerns get their own projects: vpc-host-prod (Shared VPC host), logging (aggregated org-level log sink to BigQuery), and security. An Organization Policy baseline disables the default network, blocks external IPs on VMs, enforces OS Login, and restricts locations. Labels (env, cost-center, data-classification=pci) are mandatory, and Cloud Billing budgets with alerts are wired per project, with billing export to BigQuery.
Networking foundations. A custom-mode Shared VPC lives in vpc-host-prod; the three workload projects attach as service projects. Non-overlapping /16s are allocated per region with dedicated GKE secondary ranges for Pods/Services. Cloud NAT provides egress so no VM has a public IP; Private Service Connect and Private Google Access reach BigQuery and Cloud Storage privately; VPC Service Controls wraps the ledger and fraud data in a perimeter to block exfiltration. Ingress is a global External Application Load Balancer fronted by Cloud Armor (rate-limiting + geo rules) and Cloud CDN for static merchant assets. Redundant HA VPN plus a Partner Interconnect link, terminated on Cloud Router, connect the remaining on-prem reconciliation systems.
Compute, storage, databases. The authorization service and merchant dashboard run on Cloud Run (request-based autoscaling, scale-to-zero for non-prod). The fraud-scoring platform, needing GPUs and a service mesh, runs on GKE Standard. The merchant ledger moves to Spanner for horizontal scale with strong consistency and 99.999% availability — no application-level sharding — which is the decision that lets the auth path hit four-nines. Reporting and settlement analytics land in BigQuery (with BigQuery ML for fraud features); the hot authorization cache uses Memorystore for Redis; transaction-evidence blobs and statements go to Cloud Storage with Bucket Lock (WORM) for the regulator’s retention requirement.
Foundational decisions. Identity federates the corporate IdP into Cloud Identity via Workforce Identity Federation; all IAM is granted to groups, and CI/CD uses Workload Identity Federation (zero downloaded service-account keys). Per-project, per-region quotas on Cloud Run instances and Spanner nodes are raised to 2x projected peak ahead of the festival window.
Outcome. Meridian Pay went live with 99.99% measured availability on the authorization path over its first two quarters, passed a residency audit cleanly (the Organization Policy location constraint produced zero out-of-region resources), and executed a DR drill that failed asia-south1 over to asia-south2 with Spanner multi-region promotion and a load-balancer traffic shift in under 8 minutes. Choosing Spanner over sharded Cloud SQL at System Design time is what the team credits for avoiding the re-platform that the old on-prem ledger would have required to scale.
Deliverables & checklist
Common pitfalls
- Accepting default regions and the default network. Resources land wherever a quickstart put them, creating a residency violation or a flat, overlapping-CIDR network. Avoid it: set every location explicitly in IaC, disable the default network via Organization Policy, and enforce
gcp.resourceLocations. - Retrofitting the resource hierarchy. Teams start in one shared project and discover months later that IAM, billing, and blast radius are hopelessly entangled. Avoid it: design the org/folder/project topology and Organization Policy baseline first, using the enterprise foundations blueprint, with one project per environment per workload.
- Overlapping or stingy IP ranges. A flat or overlapping CIDR plan blocks future peering, hybrid links, GKE growth, and acquisitions. Avoid it: allocate non-overlapping RFC 1918 ranges centrally with generous headroom and dedicated GKE secondary ranges before the first subnet is built.
- Sharded relational DB where Spanner fits. Building application-level sharding on Cloud SQL to chase global scale creates years of operational pain. Avoid it: when you need horizontal scale with strong consistency, choose Spanner at design time rather than re-platforming later.
- Reaching for VMs by habit. Lifting everything onto Compute Engine ignores the managed-first principle and signs the team up for patching, scaling, and failover toil. Avoid it: walk the managed spectrum (Cloud Run → GKE Autopilot → GKE Standard → Compute Engine) and justify each step down with a concrete requirement.
- Long-lived service-account keys. Downloaded JSON keys leak and never rotate, becoming a standing breach. Avoid it: use Workload Identity Federation for workloads and CI/CD and grant IAM to groups, not individuals.
What’s next
Part 2 of the Google Cloud Architecture Framework series turns to the Operational Excellence pillar — building the observability, automation, incident-management, and operational-readiness practices that keep the system you have just designed running reliably in production.