Architecture GCP

GCP Enterprise Architecture: Hybrid & Multi-Cloud

The hardest question in any multi-cloud strategy is not “how do I run a container on three clouds” — Kubernetes solved that years ago — it is “how do I operate, secure, and reason about thirty clusters spread across two data centers, AWS, and GCP as one thing, without thirty separate audit stories, thirty RBAC models, and thirty CI/CD snowflakes.” The moment you have more than a couple of clusters, the cluster stops being the unit of management and becomes a liability: every new one multiplies the policy drift, the patching surface, the “wait, which cluster is that service even on” confusion. The discipline that turns a sprawl of clusters into a governable platform is fleet management — treating a set of clusters as a single object with one identity model, one policy source, one service-mesh control plane, and one place to ask “is everything compliant right now.” On Google Cloud the product that does this is GKE Enterprise (the capability formerly and still widely branded Anthos): a fleet abstraction that spans GKE on Google Cloud, GKE on AWS/Azure, and Google Distributed Cloud (GDC) on-prem (VMware or bare metal), with Config Sync for GitOps policy, Policy Controller for guardrails, Cloud Service Mesh for identity-based traffic, and fleet Workload Identity so every workload on every cluster proves who it is the same way.

That fleet only means something if the clusters can actually reach each other and reach Google’s APIs over a private, reliable network — which is the other half of this article. A multi-cloud control plane riding over flaky public-internet tunnels is a demo, not a platform. So the connectivity spine here is Cloud Interconnect (Dedicated or Partner) for the high-bandwidth, low-jitter link to the data center, HA VPN as the encrypted day-one and permanent-backup path (and the primary link into AWS), Network Connectivity Center as the hub that ties the spokes together, and Cloud DNS for the hybrid name resolution that makes service discovery work across the boundary. The thesis of the article is that these two halves — the fleet (control) and the connectivity (data path) — are a single architecture: you cannot govern what you cannot reach, and reaching without governing is how you end up with the very sprawl you were trying to escape.

The business scenario

Helvetia Risk is a fictional but representative mid-market specialty insurer: roughly 3,400 employees, writing commercial property and marine cover, with around ₹4,100 crore (~USD 490M) in annual gross written premium. Their estate is the textbook “we ended up multi-cloud by accident, and now it’s strategy” shape that lands on a platform team three years into a cloud journey:

The mandate is therefore not “migrate to GCP.” It is: operate all three Kubernetes environments as one fleet with one policy source, one identity model, and one compliance dashboard; connect Zurich on-prem to GCP over a private, high-bandwidth, redundant link; bring the AWS workloads under the same governance and onto the same private network as the GCP and on-prem ones; give services a consistent, identity-based way to call each other across all three clouds with mTLS; and make “prove this control is enforced everywhere” a dashboard query, not a project. Every one of those maps onto a specific GKE Enterprise or connectivity primitive — that is what makes this a hybrid/multi-cloud architecture and not just “we have clusters in three places.”

Crucially the shape scales both ways. A 15-person team with one on-prem cluster and one GKE cluster deploys the identical pattern — a single fleet, Config Sync pointing at one Git repo, HA VPN to the data center, Cloud Service Mesh across two members — for a modest monthly spend, and grows into the Interconnect-backed, multi-cloud, dozens-of-clusters version without redrawing the diagram or relearning the model. That is what makes it a reference architecture rather than a megacorp special.

Architecture overview

The architecture has two planes that are easy to conflate and must be kept distinct: the management plane (the fleet — how clusters are registered, governed, and observed as one) and the data plane (the network — how packets and service calls actually traverse on-prem, AWS, and GCP). The fleet is “Google Cloud’s view of all your clusters”; the network is “how those clusters and Google’s APIs are privately reachable.” You design them together but you reason about them separately.

GCP hybrid and multi-cloud reference architecture: a single GKE Enterprise fleet (Config Sync, Policy Controller, Cloud Service Mesh, Workload Identity) spanning on-prem VMware, AWS EKS and GCP GKE on top, over a private network spine (Shared VPC, Cloud Interconnect, HA VPN, Network Connectivity Center, Cloud DNS) on the bottom, with the numbered cross-cloud request flow.

The management plane — one fleet, many members. Everything hangs off a single fleet (a fleet is scoped to one GCP project, the fleet host project). Into that fleet you register each cluster as a member, regardless of where it runs:

  1. GKE on Google Cloud clusters in europe-west6 register natively.
  2. EKS clusters in AWS register through the Connect Agent — a small deployment in the cluster that establishes an outbound, mTLS tunnel to Google and makes the cluster manageable from the fleet without opening any inbound port on the AWS side. (This is Anthos clusters attached / multi-cloud registration.)
  3. On-prem clusters run as Google Distributed Cloud (software, on VMware), which is GKE’s on-prem distribution, and register the same way through Connect.

Once registered, every member gets the same fleet-wide capabilities, turned on as fleet features rather than per-cluster bolt-ons:

The data plane — a private spine, not the public internet. The fleet’s control traffic is outbound-only mTLS to Google, but the workloads need real network reachability across the three environments. That spine is built in layers:

  1. A Shared VPC in GCP (host project owns the network; service projects attach) is the GCP-side backbone. Workload subnets for europe-west6 live here, with Private Google Access so nodes reach Google APIs over internal IPs, and Private Service Connect endpoints for consuming Google and partner services privately.
  2. Cloud Interconnect (Dedicated, 2 × 10 Gbps in two edge locations) to Zurich on-prem. This is the high-bandwidth, low-jitter, private link the pricing engine uses to read policyholder/claims data. Two interconnects in two metro availability domains give a 99.99% SLA; Cloud Router runs BGP to exchange routes dynamically between on-prem and the VPC.
  3. HA VPN as the always-on backup to on-prem and the primary path to AWS. HA VPN is a two-tunnel, 99.99%-SLA managed VPN. To on-prem it advertises the same prefixes as the Interconnect but at a lower BGP priority, so a total Interconnect failure fails over automatically with no operator action. To AWS, an HA VPN ↔ AWS VPN Gateway pair (two tunnels, BGP via Cloud Router and an AWS Transit Gateway / VGW) is the private path between the GCP VPC and the AWS VPC where EKS runs.
  4. Network Connectivity Center (NCC) is the hub that makes this a network, not a set of point links. The on-prem (via VLAN attachments) and the AWS (via the HA VPN) connections become spokes on an NCC hub, so on-prem, AWS, and GCP can route to each other transitively through Google’s backbone — rather than you hand-stitching every pair.
  5. Cloud DNS with forwarding and peering resolves names across the boundary: GCP private zones, inbound/outbound DNS forwarding to the on-prem resolver, and Cloud DNS for the mesh / GKE so a service in AWS can resolve and reach a GKE service by name.

How a real cross-cloud request flows. Trace a quote: a broker hits the customer portal on EKS (AWS). The portal needs a price, so it calls the pricing service on GKE (GCP). Because both clusters are members of one fleet with Cloud Service Mesh and a shared CA, the call is an mTLS connection authenticated by SPIFFE identity (spiffe://.../ns/pricing/sa/pricing-svc), not a trust-the-IP hop. The packet leaves the EKS pod, traverses the HA VPN tunnel (or the NCC spine) onto the GCP Shared VPC, and Cloud DNS resolves the pricing service to its mesh endpoint. The pricing service, in turn, needs policyholder history that lives on-prem in Zurich: that call goes out over Cloud Interconnect (BGP-routed via Cloud Router), reaching the Db2-fronting service in the VMware cluster — which is also a fleet member, so even that hop is mesh-secured and policy-governed. The response retraces the path. At no point does the traffic touch the public internet, and at every hop the calling workload proves its identity. Meanwhile, entirely out-of-band, Config Sync has guaranteed every one of those clusters is running the network policies and pod-security settings the Git repo mandates, and Policy Controller is continuously asserting none of them drifted.

The diagram in words. Picture three columns — On-prem (Zurich VMware), AWS (eu-west-1, EKS), GCP (europe-west6, GKE) — each containing one or more Kubernetes clusters. Across the top, spanning all three columns, sits a single horizontal band labeled Fleet (GKE Enterprise): Config Sync ← one Git repo, Policy Controller, Cloud Service Mesh (one CA), Fleet Workload Identity, and the unified Cloud Logging/Monitoring + fleet dashboard. Dashed lines run from every cluster up into that band (the Connect Agent’s outbound mTLS). Across the bottom, spanning all three columns, sits the network spine: GCP Shared VPC in the middle, a Cloud Interconnect (solid, fat) line to the on-prem column, an HA VPN (solid) line to the AWS column and a parallel HA VPN backup line to on-prem, all converging on a Network Connectivity Center hub, with Cloud DNS forwarding threaded across. Workload-to-workload calls are horizontal arrows along the bottom spine (mesh mTLS); governance flows are vertical arrows into the top band. The two bands never cross — that separation is the architecture.

Component breakdown

Component GCP / service What it does here Key configuration choices
Fleet GKE Enterprise fleet (fleet host project) The single object that groups all clusters; the unit of identity, policy, and observability One fleet per org platform; fleet scopes to segment teams/environments; members named by purpose, not location
GKE clusters GKE on Google Cloud (Autopilot or Standard) The GCP-native compute for new strategic workloads Private clusters (no public node IPs); regional control plane for HA; Autopilot where you want zero node ops
AWS clusters EKS, registered as attached members (Connect Agent) Brings existing AWS Kubernetes under one fleet without re-platforming Connect Agent outbound-only; no inbound port; fleet WI federated to AWS IAM via OIDC
On-prem clusters Google Distributed Cloud (software / VMware) GKE’s on-prem distribution next to the regulated data Bundled load balancing (MetalLB/Seesaw) or F5; registered via Connect; admin + user cluster split
GitOps engine Config Sync Continuously reconciles every member against one Git repo; auto-corrects drift Unstructured repo with per-cluster/per-scope overlays; RootSync for cluster-scoped, RepoSync for namespace teams
Policy guardrails Policy Controller (managed Gatekeeper/OPA) Admission-time enforcement + continuous audit of constraints fleet-wide Start in dryrun/audit, then enforce; use the policy bundles (CIS, PCI, pod-security) as a baseline
Service mesh Cloud Service Mesh (managed Istio) Identity-based mTLS, traffic management, telemetry across all members Shared CA / trust domain across clusters; mesh expansion for the VMware/EKS members; STRICT mTLS in PII namespaces
Identity Fleet Workload Identity One identity pool so any workload on any member authenticates as itself, no keys Federate the fleet WI pool to AWS IAM (OIDC) so GKE/EKS workloads assume AWS roles keylessly
GCP backbone Shared VPC (host + service projects) The GCP-side network all workloads and links attach to Host project owns subnets/firewall/routing; Private Google Access; PSC for private service consumption
On-prem link Cloud Interconnect (Dedicated) + Cloud Router Private, high-bandwidth, low-jitter path to Zurich 2 connections in 2 edge availability domains for 99.99%; BGP via Cloud Router; VLAN attachments
Encrypted / backup link HA VPN + Cloud Router Always-on encrypted backup to on-prem; primary private path to AWS Two tunnels per gateway; same prefixes as Interconnect at lower BGP priority for auto-failover
Multi-cloud hub Network Connectivity Center Ties on-prem, AWS, and GCP into one transitive routing domain On-prem + AWS connections as spokes; lets the three environments route through Google’s backbone
Hybrid DNS Cloud DNS (private zones + forwarding) Cross-boundary name resolution for service discovery Inbound/outbound forwarding to the on-prem resolver; private zones peered to the Shared VPC
Observability Cloud Logging + Monitoring (fleet) + Managed Prometheus One pane of glass across all members; SLOs and dashboards GKE Enterprise dashboards; Managed Service for Prometheus scrapes every member; alerting on policy/sync drift

A handful of these choices carry the design and deserve the why, not just the what.

The fleet is the unit of management — not the cluster, not the project. The single most important mental shift is to stop operating clusters and start operating a fleet. Once a cluster is a member, you turn on Config Sync, Policy Controller, and the mesh as fleet-default configurations, so a newly registered cluster inherits the entire governance posture automatically — it joins the fleet already compliant rather than being configured into compliance afterward. Fleet scopes then let you carve the fleet into bounded sets (e.g. team-pricing, env-prod) so a team’s namespaces and the services they can see in the mesh are governed together across whatever clusters they span. The payoff is that “onboard the thirty-first cluster” and “onboard the third” are the same one-step operation, and the auditor’s “is X enforced everywhere” becomes a single dashboard, because everywhere is now a defined, enumerable thing.

Config Sync + Policy Controller is how “the standard” stops living in a wiki. The reason every cluster drifted is that the standard was documentation, not code — and documentation does not reconcile. Config Sync makes the Git repo the desired state: it continuously applies it and reverts anything that drifts, so a hand-edit on a cluster is undone within minutes. Policy Controller is the complementary half — it both blocks non-compliant resources at admission and continuously audits what already exists, reporting violations to the fleet dashboard. Together they convert governance from a periodic human review into a closed control loop: desired state declared once in Git, enforced and verified everywhere, with the dashboard as the live compliance report. This is the mechanism that makes the security team’s mandate (“prove the control is enforced”) a query rather than a quarter.

Cloud Service Mesh with a shared CA is what makes “trust the IP” go away. The reason cross-cloud calls today are secured by IP range is that there was no common identity across the three Kubernetes worlds. Cloud Service Mesh issues every workload a SPIFFE identity from one shared root of trust spanning all members, so a service on EKS and a service on GKE authenticate to each other by identity (spiffe://trust-domain/ns/.../sa/...) over mTLS, regardless of cloud. Setting STRICT mutual TLS in the PII-handling namespaces means a call without a valid workload certificate is refused — which is the actual, enforceable version of the control the auditor was asking about. The mesh also gives you uniform traffic management (retries, canaries, locality-aware routing) and golden-signal telemetry across the fleet, but the identity story is the load-bearing one for a regulated multi-cloud estate.

Interconnect for bandwidth and jitter; HA VPN for resilience and AWS reach — and they are not interchangeable. The on-prem path carries chatty, latency-sensitive reads of policyholder and claims data, so it wants Dedicated Interconnect: private, predictable single-digit-millisecond latency, and 10–100 Gbps. But Interconnect has weeks of lead time, no inherent encryption, and a single circuit is not redundant — so HA VPN is the day-one bring-up, the permanent encrypted backup (advertising the same prefixes at a lower BGP local-preference so failover is automatic and sub-minute), and the primary private path to AWS, where you do not have a colo cross-connect at all. Letting Cloud Router run BGP on every link is what makes all of this dynamic: routes propagate, failover is a routing decision rather than a ticket, and Network Connectivity Center turns the collection of links into a single transitive hub so on-prem and AWS reach each other through Google’s backbone instead of via a separate, hand-built mesh.

Implementation guidance

Stand up the fleet and its host project first (the management plane before the workloads).

GitOps repository layout (Config Sync) — get this shape right or you will fight overlays forever.

Identity wiring — fleet Workload Identity end to end, no static keys (this is the single biggest incident-class you design out).

Networking and DNS wiring (the data plane).

Mesh expansion across clouds. Bring the EKS and VMware members into Cloud Service Mesh with a shared trust domain so identities are comparable everywhere. Set mutual TLS to STRICT in the PII-handling namespaces (policyholder, claims, pricing) so any non-mTLS call is rejected — the enforceable form of the audit control. Use the mesh’s locality-aware routing so a service prefers an in-region/in-cloud backend and only spills cross-cloud on failure, which keeps the expensive cross-cloud hops rare.

A few IaC snippets (Terraform) capturing the wiring people most often get wrong. First, registering an EKS cluster as a fleet member and turning on fleet features as defaults — the part teams do per-cluster by hand and then drift:

# The fleet lives in the host project.
resource "google_gke_hub_fleet" "platform" {
  project      = var.fleet_host_project
  display_name = "helvetia-platform"
}

# Register the existing EKS cluster as an ATTACHED member (Connect Agent, outbound-only).
resource "google_gke_hub_membership" "eks_quoting" {
  membership_id = "eks-quoting-euw1"
  project       = var.fleet_host_project
  endpoint {
    # OIDC issuer of the EKS cluster — registration is keyless via Connect.
    edge_cluster { resource_link = null }
  }
  authority { issuer = "https://oidc.eks.eu-west-1.amazonaws.com/id/${var.eks_oidc_id}" }
}

# Turn Config Sync + Policy Controller ON for the whole fleet as a DEFAULT,
# so any future member inherits the governance posture automatically.
resource "google_gke_hub_feature" "configmanagement" {
  name     = "configmanagement"
  project  = var.fleet_host_project
  location = "global"
  fleet_default_member_config {
    configmanagement {
      config_sync {
        git {
          sync_repo   = "https://github.com/helvetia/fleet-config"
          sync_branch = "main"
          policy_dir  = "fleet"
          secret_type = "token"      # or gcpserviceaccount — never a static key in the manifest
        }
      }
      policy_controller {
        policy_controller_hub_config {
          install_spec = "INSTALL_SPEC_ENABLED"
          audit_interval_seconds = 60
        }
      }
    }
  }
}

Second, the redundant on-prem link — two Interconnect VLAN attachments plus an HA-VPN backup advertising the same routes at a lower priority, which is what makes failover automatic:

resource "google_compute_router" "onprem" {
  name    = "cr-onprem-euw6"
  region  = "europe-west6"
  network = google_compute_network.shared_vpc.id
  bgp { asn = 64512 }
}

# Interconnect VLAN attachment (one of two; the second is in a different edge domain).
resource "google_compute_interconnect_attachment" "vlan_a" {
  name                     = "vlan-zurich-a"
  region                   = "europe-west6"
  router                   = google_compute_router.onprem.id
  interconnect             = google_compute_interconnect.dedicated_a.id
  vlan_tag8021q            = 401
  candidate_subnets        = ["169.254.10.0/29"]
}

# HA VPN tunnel to on-prem: same prefixes, but advertised at a WORSE priority,
# so Interconnect wins and VPN takes over only on failure.
resource "google_compute_router_peer" "vpn_backup_peer" {
  name                      = "onprem-vpn-backup"
  router                    = google_compute_router.onprem.name
  region                    = "europe-west6"
  interface                 = google_compute_router_interface.vpn_if.name
  peer_ip_address           = var.onprem_vpn_peer_ip
  peer_asn                  = 65010
  advertised_route_priority = 200          # higher number = less preferred than Interconnect (100)
}

Third, a Policy Controller constraint (delivered through Config Sync) that bans public LoadBalancers in prod and is rolled out safely in audit mode first:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sNoPublicServices            # from the Policy Controller library / bundle
metadata:
  name: no-public-lb-in-prod
spec:
  enforcementAction: dryrun          # start in audit; flip to "deny" after reviewing the dashboard
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Service"]
    namespaceSelector:
      matchExpressions:
        - { key: env, operator: In, values: ["prod"] }

And the mesh PeerAuthentication that turns “trust the IP” into “prove your identity” for PII namespaces:

apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: strict-mtls
  namespace: pricing                 # PII-handling namespace
spec:
  mtls:
    mode: STRICT                     # any non-mTLS call is refused, on every member

Deployment and rollout. Stand up the fleet and the network spine first, then register clusters one at a time, each time confirming Config Sync reconciles green and the mesh sidecars come up before moving on. Roll out every Policy Controller constraint in dryrun first, read the violation report, remediate, then enforce — never enforce blind. Application workloads deploy through normal CI to Artifact Registry, with the promotion across clusters expressed as Git changes to the Config Sync overlays, so “ship to the next environment” is a reviewed pull request, not a kubectl against a member. The whole platform — fleet, memberships, features, VPC, Interconnect, HA VPN, NCC, DNS, IAM — is one Terraform state per environment, promoted dev → staging → prod with the same code.

Enterprise considerations

Security and Zero Trust. This architecture is a Zero Trust posture for a multi-cloud Kubernetes estate, expressed in two reinforcing layers. At the identity layer, every workload on every cluster has a SPIFFE identity from one shared CA, and STRICT mTLS in sensitive namespaces means service-to-service calls are authenticated and encrypted by identity — there is no implicit trust by network location, which is the whole point. Fleet Workload Identity (federated to AWS IAM) means no static keys anywhere, on any cloud, eliminating the most common breach vector (a leaked key) by construction — and explicitly closing the door that this codebase’s history of leaked DB credentials left open. At the policy layer, Policy Controller enforces guardrails (no public LBs, approved registries only, mandatory NetworkPolicies, pod-security baselines) at admission and continuously audits for drift, while Config Sync guarantees the declared security config is actually present on every member and reverts tampering. At the network layer, traffic rides Interconnect/HA VPN over Google’s private backbone rather than the public internet, Private Google Access / PSC keep API calls off the internet, and the GKE clusters are private (no public node IPs). Identity, policy, and network controls compound rather than substitute.

Cost optimization. The headline lever is that fleet management replaces N operational stacks with one — one monitoring stack, one policy pipeline, one identity model, one access path — so the operational cost of the thirty-first cluster is near-zero marginal, which is the real multi-cloud expense. On the network, be honest about the trade-offs: Cloud Interconnect is a fixed monthly port charge plus low egress, justified only by the bandwidth and jitter the on-prem data path actually needs — right-size the port (start at 10 Gbps, not 100) and remember HA VPN is far cheaper for links whose volume does not justify a circuit (the AWS path often starts as VPN-only). Cross-cloud egress is the silent budget killer: every GKE→AWS or GKE→on-prem call may incur egress, so the mesh’s locality-aware routing (prefer in-cloud backends, spill cross-cloud only on failure) is a cost control as much as a latency one — keep chatty service pairs co-located and only span the boundary when the data demands it. GKE Autopilot for new clusters bills per pod resource request (no paying for idle node headroom), and GKE Enterprise’s per-vCPU fleet pricing means you pay for governed capacity, not per-cluster licenses. Tag/label every cluster and workload by owning team to attribute both compute and egress.

Scalability. A single fleet governs hundreds of clusters and thousands of nodes across clouds; onboarding scales linearly because registration + fleet-default features is one operation regardless of count. Config Sync reconciles every member independently and in parallel, so the governance loop does not slow as the fleet grows. The mesh scales horizontally per cluster, and fleet scopes keep the blast radius and the cognitive load bounded as teams multiply. On the network, Cloud Router and NCC carry large route tables and many spokes, and Interconnect scales to 100 Gbps (or multiple circuits with ECMP) when the data-center data path grows.

Reliability and DR (RTO/RPO). Resilience is layered. The network path is the clearest story: two Interconnects in two edge domains survive an edge failure (99.99% SLA), and HA VPN advertising the same prefixes at lower BGP priority fails the on-prem path over automatically in under a minute with no operator action — an effective network RTO measured in seconds. Regional GKE control planes and multi-zone node pools survive a zone loss. For workload DR, the fleet makes active-active or warm-standby across clouds natural: the same Config Sync repo deploys the pricing service to both the GKE and (if desired) the on-prem cluster, and the mesh’s locality routing fails traffic over to the surviving member — so a regional GCP outage degrades to serving from on-prem/AWS rather than going dark. RTO/RPO for stateful data is then governed by the data tier riding over this fabric (Db2/Cloud SQL replication, BigQuery), not by the platform itself — model the cases explicitly: zone loss, region loss, full Interconnect loss, and a whole-cloud loss, each in a runbook.

Observability. GKE Enterprise gives one pane of glass across every member: the fleet dashboard shows cluster health, Config Sync status, and Policy Controller violations for on-prem, AWS, and GCP clusters together. Cloud Logging and Monitoring aggregate logs/metrics fleet-wide, Managed Service for Prometheus scrapes every member with no per-cluster Prometheus to operate, and Cloud Service Mesh emits golden-signal telemetry (latency, traffic, errors, saturation) and a service topology spanning clouds. The two governance vital signs to alert on are Config Sync drift/sync errors (a member that stopped reconciling is a member that is silently diverging) and Policy Controller violation count (anything in violation is a control gap to close) — plus the network’s BGP session state on every Interconnect/VPN link.

Governance. A clean resource hierarchy (org → folders for environments/business units → the fleet host project + service projects), with Organization Policy constraints enforced top-down (disable service-account key creation, restrict regions, domain-restricted sharing). The fleet is the governance object: scopes bound team access, Config Sync makes the Git repo the single auditable source of cluster state, and Policy Controller’s continuous audit is the live compliance report — so “show me the control across the estate” is a dashboard, and “prove it has been enforced over time” is the Git history plus the audit logs. Audit Logs (Admin Activity always on; Data Access on the sensitive services) flow to a logs bucket / BigQuery sink for retention and SIEM, and the Connect gateway means even on-prem and EKS kubectl access is logged through Google’s audit trail.

Reference enterprise example

Helvetia Risk ran the program over four quarters. Concrete decisions and numbers:

Estate at the start. 3 Kubernetes worlds, 22 clusters total: 6 on-prem (VMware, kubeadm), 9 on EKS (the acquired broker stack, eu-west-1), 7 on GKE (the new pricing/AI build-out, europe-west6). 3 RBAC models, 3 secret stores, 3 CI/CD pipelines, 3 monitoring stacks. The trigger was a regulator audit that asked Helvetia to demonstrate mTLS enforcement and consistent network policy for all PII-handling services across the estate — and the honest answer was “we can’t, in under two weeks, with confidence.”

What they built.

Decisions that paid off.

A failure they handled gracefully. A platform engineer enforced a new “no :latest image tags” Policy Controller constraint that they had not run in dryrun first; it would have blocked a legitimate deploy on the on-prem cluster. Because the constraint was delivered through Config Sync as a reviewed pull request, the pre-merge check (which lints constraints and runs them against a sample) flagged the impending violations, and the change was held and re-rolled in dryrun mode — the violation report showed the affected workloads, those were fixed, and only then was it enforced. The guardrail process caught the guardrail mistake; nothing in production broke.

Outcome. Steady-state GCP spend ~₹19 lakh/month (fleet/GKE Enterprise + Interconnect + the GCP workloads), with the operational headcount to run the estate flat despite the cluster count, because there is now one stack to operate instead of three. The audit finding closed on the strength of the dashboard. Cross-cloud service calls are identity-secured over a private, redundant spine; on-prem failover is automatic and proven; there are no static keys to leak; and onboarding a new cluster — anywhere — is a day, not a quarter. Helvetia now treats “multi-cloud” as an operating model they govern, rather than an accident they survive.

When to use it

Use this architecture when:

Be honest about the trade-offs:

Anti-patterns to avoid (each is a real incident waiting to happen):

Alternatives worth weighing: for a single cloud, plain GKE + Config Sync + Policy Controller on one project gives you the GitOps/policy benefits without the multi-cloud machinery — graduate to the full fleet when a second governed environment appears. For multi-cloud governance without Google’s fleet, a self-managed Argo CD / Flux + OPA Gatekeeper + Istio (multi-primary) stack across clusters is the DIY equivalent — more control and portability, but you now operate the reconciliation, the policy engine, the mesh CA, and the multi-cluster wiring that GKE Enterprise manages for you (and you lose the single fleet dashboard and Connect gateway). For connectivity specifically, if you have no on-prem at all and only span clouds, you can drop Interconnect and run HA VPN + Network Connectivity Center alone; conversely, very high-bandwidth data-center estates may run Cross-Cloud Interconnect to land a private circuit directly between Google and AWS rather than over VPN. And for the on-prem distribution, Google Distributed Cloud (software) assumes you run the cluster; the GDC connected/air-gapped appliances suit edge or sovereignty cases where Google ships the hardware. The shape — one fleet for management, a private redundant spine for the data path, shared identity and policy across every member — stays the same across all of these. Only the connectivity primitive and the on-prem footprint change.

GCPArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading