The hardest question in any multi-cloud strategy is not “how do I run a container on three clouds” — Kubernetes solved that years ago — it is “how do I operate, secure, and reason about thirty clusters spread across two data centers, AWS, and GCP as one thing, without thirty separate audit stories, thirty RBAC models, and thirty CI/CD snowflakes.” The moment you have more than a couple of clusters, the cluster stops being the unit of management and becomes a liability: every new one multiplies the policy drift, the patching surface, the “wait, which cluster is that service even on” confusion. The discipline that turns a sprawl of clusters into a governable platform is fleet management — treating a set of clusters as a single object with one identity model, one policy source, one service-mesh control plane, and one place to ask “is everything compliant right now.” On Google Cloud the product that does this is GKE Enterprise (the capability formerly and still widely branded Anthos): a fleet abstraction that spans GKE on Google Cloud, GKE on AWS/Azure, and Google Distributed Cloud (GDC) on-prem (VMware or bare metal), with Config Sync for GitOps policy, Policy Controller for guardrails, Cloud Service Mesh for identity-based traffic, and fleet Workload Identity so every workload on every cluster proves who it is the same way.
That fleet only means something if the clusters can actually reach each other and reach Google’s APIs over a private, reliable network — which is the other half of this article. A multi-cloud control plane riding over flaky public-internet tunnels is a demo, not a platform. So the connectivity spine here is Cloud Interconnect (Dedicated or Partner) for the high-bandwidth, low-jitter link to the data center, HA VPN as the encrypted day-one and permanent-backup path (and the primary link into AWS), Network Connectivity Center as the hub that ties the spokes together, and Cloud DNS for the hybrid name resolution that makes service discovery work across the boundary. The thesis of the article is that these two halves — the fleet (control) and the connectivity (data path) — are a single architecture: you cannot govern what you cannot reach, and reaching without governing is how you end up with the very sprawl you were trying to escape.
The business scenario
Helvetia Risk is a fictional but representative mid-market specialty insurer: roughly 3,400 employees, writing commercial property and marine cover, with around ₹4,100 crore (~USD 490M) in annual gross written premium. Their estate is the textbook “we ended up multi-cloud by accident, and now it’s strategy” shape that lands on a platform team three years into a cloud journey:
- A regulated core that cannot leave the building (yet). The policy-administration and claims systems run on VMware on-premises in a Zurich data center, next to a mainframe-adjacent actuarial system and a Db2 database the regulator’s audit cycle is built around. A full exit is a multi-year program; in the meantime new digital workloads must run next to this data, not 200 ms away across the public internet.
- A born-in-AWS acquisition. Eighteen months ago they bought a digital-broker startup whose entire stack — quoting APIs, a customer portal, a pricing engine — runs on EKS in AWS (eu-west-1). Re-platforming it onto GCP wholesale is not on the table; the team, the tooling, and the data-residency commitments to brokers are all anchored to AWS.
- A new GCP-native bet. The strategic build-out — a real-time pricing and fraud-scoring platform, a data/AI layer on BigQuery and Vertex AI — is going onto GKE on Google Cloud in
europe-west6(Zurich), chosen for the Swiss region and the data/AI services. - Three Kubernetes worlds, zero shared anything. On-prem runs a hand-rolled kubeadm cluster; AWS runs EKS with IAM-based RBAC and AWS Load Balancer Controller; GCP runs GKE. Three RBAC models, three secret stores, three ingress patterns, three CI/CD pipelines, three monitoring stacks. A single auditor question — “show me that mTLS is enforced for all PII-handling services across the estate” — takes two weeks and three teams to answer, and the answer is “mostly.”
- Policy drift is a standing finding. Each cluster’s network policies, pod-security settings, and allowed registries diverged the moment they were created. The security team’s “standard” exists in a wiki, not in the clusters. Every audit surfaces a cluster that quietly fell out of compliance.
- Latency and integration pain. The new GCP pricing engine needs the policyholder and claims data that lives on-prem in Zurich; the AWS quoting API needs to call the GCP fraud-scoring service. Today both cross the public internet with one-off VPN tunnels, variable latency, and no consistent service identity — so the security model is “trust the IP range,” which the security team rightly hates.
The mandate is therefore not “migrate to GCP.” It is: operate all three Kubernetes environments as one fleet with one policy source, one identity model, and one compliance dashboard; connect Zurich on-prem to GCP over a private, high-bandwidth, redundant link; bring the AWS workloads under the same governance and onto the same private network as the GCP and on-prem ones; give services a consistent, identity-based way to call each other across all three clouds with mTLS; and make “prove this control is enforced everywhere” a dashboard query, not a project. Every one of those maps onto a specific GKE Enterprise or connectivity primitive — that is what makes this a hybrid/multi-cloud architecture and not just “we have clusters in three places.”
Crucially the shape scales both ways. A 15-person team with one on-prem cluster and one GKE cluster deploys the identical pattern — a single fleet, Config Sync pointing at one Git repo, HA VPN to the data center, Cloud Service Mesh across two members — for a modest monthly spend, and grows into the Interconnect-backed, multi-cloud, dozens-of-clusters version without redrawing the diagram or relearning the model. That is what makes it a reference architecture rather than a megacorp special.
Architecture overview
The architecture has two planes that are easy to conflate and must be kept distinct: the management plane (the fleet — how clusters are registered, governed, and observed as one) and the data plane (the network — how packets and service calls actually traverse on-prem, AWS, and GCP). The fleet is “Google Cloud’s view of all your clusters”; the network is “how those clusters and Google’s APIs are privately reachable.” You design them together but you reason about them separately.
The management plane — one fleet, many members. Everything hangs off a single fleet (a fleet is scoped to one GCP project, the fleet host project). Into that fleet you register each cluster as a member, regardless of where it runs:
- GKE on Google Cloud clusters in
europe-west6register natively. - EKS clusters in AWS register through the Connect Agent — a small deployment in the cluster that establishes an outbound, mTLS tunnel to Google and makes the cluster manageable from the fleet without opening any inbound port on the AWS side. (This is Anthos clusters attached / multi-cloud registration.)
- On-prem clusters run as Google Distributed Cloud (software, on VMware), which is GKE’s on-prem distribution, and register the same way through Connect.
Once registered, every member gets the same fleet-wide capabilities, turned on as fleet features rather than per-cluster bolt-ons:
- Config Sync continuously reconciles each cluster against a Git repository (the GitOps source of truth) — namespaces, RBAC, network policies, quotas, the lot. The repo is the desired state; drift is auto-corrected. One repo (with per-cluster and per-scope overlays) governs all thirty clusters.
- Policy Controller (Google’s managed OPA Gatekeeper) enforces guardrails as admission policy — “no public LoadBalancer in prod,” “images only from our Artifact Registry,” “every namespace has a NetworkPolicy” — and continuously audits existing resources, surfacing violations on the fleet dashboard.
- Cloud Service Mesh (Google’s managed Istio) gives every member a service mesh with a shared root of trust, so a service on GKE and a service on EKS get SPIFFE identities from the same CA and can do mTLS to each other by service identity, not IP.
- Fleet Workload Identity establishes a common identity pool so a workload on any member authenticates to Google APIs (and, via federation, to AWS) as itself, with no static keys.
The data plane — a private spine, not the public internet. The fleet’s control traffic is outbound-only mTLS to Google, but the workloads need real network reachability across the three environments. That spine is built in layers:
- A Shared VPC in GCP (host project owns the network; service projects attach) is the GCP-side backbone. Workload subnets for
europe-west6live here, with Private Google Access so nodes reach Google APIs over internal IPs, and Private Service Connect endpoints for consuming Google and partner services privately. - Cloud Interconnect (Dedicated, 2 × 10 Gbps in two edge locations) to Zurich on-prem. This is the high-bandwidth, low-jitter, private link the pricing engine uses to read policyholder/claims data. Two interconnects in two metro availability domains give a 99.99% SLA; Cloud Router runs BGP to exchange routes dynamically between on-prem and the VPC.
- HA VPN as the always-on backup to on-prem and the primary path to AWS. HA VPN is a two-tunnel, 99.99%-SLA managed VPN. To on-prem it advertises the same prefixes as the Interconnect but at a lower BGP priority, so a total Interconnect failure fails over automatically with no operator action. To AWS, an HA VPN ↔ AWS VPN Gateway pair (two tunnels, BGP via Cloud Router and an AWS Transit Gateway / VGW) is the private path between the GCP VPC and the AWS VPC where EKS runs.
- Network Connectivity Center (NCC) is the hub that makes this a network, not a set of point links. The on-prem (via VLAN attachments) and the AWS (via the HA VPN) connections become spokes on an NCC hub, so on-prem, AWS, and GCP can route to each other transitively through Google’s backbone — rather than you hand-stitching every pair.
- Cloud DNS with forwarding and peering resolves names across the boundary: GCP private zones, inbound/outbound DNS forwarding to the on-prem resolver, and Cloud DNS for the mesh / GKE so a service in AWS can resolve and reach a GKE service by name.
How a real cross-cloud request flows. Trace a quote: a broker hits the customer portal on EKS (AWS). The portal needs a price, so it calls the pricing service on GKE (GCP). Because both clusters are members of one fleet with Cloud Service Mesh and a shared CA, the call is an mTLS connection authenticated by SPIFFE identity (spiffe://.../ns/pricing/sa/pricing-svc), not a trust-the-IP hop. The packet leaves the EKS pod, traverses the HA VPN tunnel (or the NCC spine) onto the GCP Shared VPC, and Cloud DNS resolves the pricing service to its mesh endpoint. The pricing service, in turn, needs policyholder history that lives on-prem in Zurich: that call goes out over Cloud Interconnect (BGP-routed via Cloud Router), reaching the Db2-fronting service in the VMware cluster — which is also a fleet member, so even that hop is mesh-secured and policy-governed. The response retraces the path. At no point does the traffic touch the public internet, and at every hop the calling workload proves its identity. Meanwhile, entirely out-of-band, Config Sync has guaranteed every one of those clusters is running the network policies and pod-security settings the Git repo mandates, and Policy Controller is continuously asserting none of them drifted.
The diagram in words. Picture three columns — On-prem (Zurich VMware), AWS (eu-west-1, EKS), GCP (europe-west6, GKE) — each containing one or more Kubernetes clusters. Across the top, spanning all three columns, sits a single horizontal band labeled Fleet (GKE Enterprise): Config Sync ← one Git repo, Policy Controller, Cloud Service Mesh (one CA), Fleet Workload Identity, and the unified Cloud Logging/Monitoring + fleet dashboard. Dashed lines run from every cluster up into that band (the Connect Agent’s outbound mTLS). Across the bottom, spanning all three columns, sits the network spine: GCP Shared VPC in the middle, a Cloud Interconnect (solid, fat) line to the on-prem column, an HA VPN (solid) line to the AWS column and a parallel HA VPN backup line to on-prem, all converging on a Network Connectivity Center hub, with Cloud DNS forwarding threaded across. Workload-to-workload calls are horizontal arrows along the bottom spine (mesh mTLS); governance flows are vertical arrows into the top band. The two bands never cross — that separation is the architecture.
Component breakdown
| Component | GCP / service | What it does here | Key configuration choices |
|---|---|---|---|
| Fleet | GKE Enterprise fleet (fleet host project) | The single object that groups all clusters; the unit of identity, policy, and observability | One fleet per org platform; fleet scopes to segment teams/environments; members named by purpose, not location |
| GKE clusters | GKE on Google Cloud (Autopilot or Standard) | The GCP-native compute for new strategic workloads | Private clusters (no public node IPs); regional control plane for HA; Autopilot where you want zero node ops |
| AWS clusters | EKS, registered as attached members (Connect Agent) | Brings existing AWS Kubernetes under one fleet without re-platforming | Connect Agent outbound-only; no inbound port; fleet WI federated to AWS IAM via OIDC |
| On-prem clusters | Google Distributed Cloud (software / VMware) | GKE’s on-prem distribution next to the regulated data | Bundled load balancing (MetalLB/Seesaw) or F5; registered via Connect; admin + user cluster split |
| GitOps engine | Config Sync | Continuously reconciles every member against one Git repo; auto-corrects drift | Unstructured repo with per-cluster/per-scope overlays; RootSync for cluster-scoped, RepoSync for namespace teams |
| Policy guardrails | Policy Controller (managed Gatekeeper/OPA) | Admission-time enforcement + continuous audit of constraints fleet-wide | Start in dryrun/audit, then enforce; use the policy bundles (CIS, PCI, pod-security) as a baseline |
| Service mesh | Cloud Service Mesh (managed Istio) | Identity-based mTLS, traffic management, telemetry across all members | Shared CA / trust domain across clusters; mesh expansion for the VMware/EKS members; STRICT mTLS in PII namespaces |
| Identity | Fleet Workload Identity | One identity pool so any workload on any member authenticates as itself, no keys | Federate the fleet WI pool to AWS IAM (OIDC) so GKE/EKS workloads assume AWS roles keylessly |
| GCP backbone | Shared VPC (host + service projects) | The GCP-side network all workloads and links attach to | Host project owns subnets/firewall/routing; Private Google Access; PSC for private service consumption |
| On-prem link | Cloud Interconnect (Dedicated) + Cloud Router | Private, high-bandwidth, low-jitter path to Zurich | 2 connections in 2 edge availability domains for 99.99%; BGP via Cloud Router; VLAN attachments |
| Encrypted / backup link | HA VPN + Cloud Router | Always-on encrypted backup to on-prem; primary private path to AWS | Two tunnels per gateway; same prefixes as Interconnect at lower BGP priority for auto-failover |
| Multi-cloud hub | Network Connectivity Center | Ties on-prem, AWS, and GCP into one transitive routing domain | On-prem + AWS connections as spokes; lets the three environments route through Google’s backbone |
| Hybrid DNS | Cloud DNS (private zones + forwarding) | Cross-boundary name resolution for service discovery | Inbound/outbound forwarding to the on-prem resolver; private zones peered to the Shared VPC |
| Observability | Cloud Logging + Monitoring (fleet) + Managed Prometheus | One pane of glass across all members; SLOs and dashboards | GKE Enterprise dashboards; Managed Service for Prometheus scrapes every member; alerting on policy/sync drift |
A handful of these choices carry the design and deserve the why, not just the what.
The fleet is the unit of management — not the cluster, not the project. The single most important mental shift is to stop operating clusters and start operating a fleet. Once a cluster is a member, you turn on Config Sync, Policy Controller, and the mesh as fleet-default configurations, so a newly registered cluster inherits the entire governance posture automatically — it joins the fleet already compliant rather than being configured into compliance afterward. Fleet scopes then let you carve the fleet into bounded sets (e.g. team-pricing, env-prod) so a team’s namespaces and the services they can see in the mesh are governed together across whatever clusters they span. The payoff is that “onboard the thirty-first cluster” and “onboard the third” are the same one-step operation, and the auditor’s “is X enforced everywhere” becomes a single dashboard, because everywhere is now a defined, enumerable thing.
Config Sync + Policy Controller is how “the standard” stops living in a wiki. The reason every cluster drifted is that the standard was documentation, not code — and documentation does not reconcile. Config Sync makes the Git repo the desired state: it continuously applies it and reverts anything that drifts, so a hand-edit on a cluster is undone within minutes. Policy Controller is the complementary half — it both blocks non-compliant resources at admission and continuously audits what already exists, reporting violations to the fleet dashboard. Together they convert governance from a periodic human review into a closed control loop: desired state declared once in Git, enforced and verified everywhere, with the dashboard as the live compliance report. This is the mechanism that makes the security team’s mandate (“prove the control is enforced”) a query rather than a quarter.
Cloud Service Mesh with a shared CA is what makes “trust the IP” go away. The reason cross-cloud calls today are secured by IP range is that there was no common identity across the three Kubernetes worlds. Cloud Service Mesh issues every workload a SPIFFE identity from one shared root of trust spanning all members, so a service on EKS and a service on GKE authenticate to each other by identity (spiffe://trust-domain/ns/.../sa/...) over mTLS, regardless of cloud. Setting STRICT mutual TLS in the PII-handling namespaces means a call without a valid workload certificate is refused — which is the actual, enforceable version of the control the auditor was asking about. The mesh also gives you uniform traffic management (retries, canaries, locality-aware routing) and golden-signal telemetry across the fleet, but the identity story is the load-bearing one for a regulated multi-cloud estate.
Interconnect for bandwidth and jitter; HA VPN for resilience and AWS reach — and they are not interchangeable. The on-prem path carries chatty, latency-sensitive reads of policyholder and claims data, so it wants Dedicated Interconnect: private, predictable single-digit-millisecond latency, and 10–100 Gbps. But Interconnect has weeks of lead time, no inherent encryption, and a single circuit is not redundant — so HA VPN is the day-one bring-up, the permanent encrypted backup (advertising the same prefixes at a lower BGP local-preference so failover is automatic and sub-minute), and the primary private path to AWS, where you do not have a colo cross-connect at all. Letting Cloud Router run BGP on every link is what makes all of this dynamic: routes propagate, failover is a routing decision rather than a ticket, and Network Connectivity Center turns the collection of links into a single transitive hub so on-prem and AWS reach each other through Google’s backbone instead of via a separate, hand-built mesh.
Implementation guidance
Stand up the fleet and its host project first (the management plane before the workloads).
- Choose a dedicated fleet host project (e.g.
helvetia-fleet-host) — this is not where workloads run; it owns the fleet, the Config Sync/Policy Controller/mesh features, and the unified observability. Keep it small and tightly controlled. - Register every cluster as a member. GKE-on-Google clusters register with one command/Terraform resource. EKS clusters register as attached members: deploy the Connect Agent, which dials out to Google over mTLS — confirm no inbound rule is needed on the AWS security groups, which is the property that keeps the AWS network closed. On-prem, install Google Distributed Cloud (software) on the VMware cluster and register the same way.
- Turn on fleet features as defaults, not per cluster: enable Config Sync, Policy Controller, and Cloud Service Mesh at the fleet level with fleet-default member configuration, so any future member inherits them. This is the line between “governance you apply” and “governance clusters are born with.”
GitOps repository layout (Config Sync) — get this shape right or you will fight overlays forever.
- Use one repo with a clear separation between fleet-wide policy (applied to all clusters via
RootSync— Policy Controller constraints, baseline RBAC, baseline NetworkPolicies, allowed-registry constraints) and per-scope/per-cluster overlays (Kustomize) for the legitimate differences (the on-prem cluster’s load-balancer class, AWS-specific annotations). - Give application teams their own
RepoSyncin their namespaces so platform owns cluster-scoped config and teams own their app config — least privilege in the GitOps layer itself. - Keep Policy Controller constraints in the repo too, starting every new constraint in
dryrun/audit mode, reviewing the violation report on the dashboard, then flipping toenforce— so you never break running workloads by enforcing a policy blind.
Identity wiring — fleet Workload Identity end to end, no static keys (this is the single biggest incident-class you design out).
- Every workload uses fleet Workload Identity: a Kubernetes service account is bound to a Google service account, and the pod gets short-lived, auto-rotated credentials — no exported service-account JSON keys anywhere, on any cluster. (This codebase has prior history with leaked database credentials committed to git; the hard rule here is that secrets live in Secret Manager and are reached via workload identity, never embedded in a manifest, image, or pipeline.)
- For the cross-cloud case, federate the fleet WI pool to AWS IAM via OIDC so a GKE (or on-prem) workload can assume an AWS IAM role keylessly to read an S3 bucket or an SQS queue the AWS side owns — and symmetrically, EKS workloads authenticate to Google APIs through the same pool. No long-lived AWS keys cross the boundary.
- Centralize human access with Cloud Identity / IAM and group-based fleet RBAC (Connect gateway lets operators
kubectlinto any member through Google’s auth, so on-prem and EKS access flow through the same SSO and audit trail as GKE — one access story, not three).
Networking and DNS wiring (the data plane).
- Build the Shared VPC in a host project; service projects (where GKE clusters live) attach. Enable Private Google Access on workload subnets and use Private Service Connect endpoints so clusters consume Google/partner APIs over internal IPs — nothing reaches Google over the public internet.
- Provision Dedicated Interconnect early (weeks of lead time): order two interconnects in two edge availability domains, create VLAN attachments, and attach a Cloud Router per region running BGP to the on-prem routers. Stand up HA VPN the same week as the day-one path and permanent backup, advertising the same on-prem prefixes at a lower BGP priority so the Interconnect is preferred and failover is automatic.
- For AWS, create an HA VPN gateway peered to an AWS VPN Gateway / Transit Gateway (two tunnels, BGP both sides). Then create a Network Connectivity Center hub and attach the on-prem and AWS connections as spokes, so the three environments form one transitive routing domain.
- Wire Cloud DNS: private zones for GCP services peered to the Shared VPC, outbound forwarding to the on-prem resolver for on-prem names, and inbound forwarding (a Cloud DNS inbound endpoint) so on-prem and AWS can resolve GCP/mesh names. Plan non-overlapping CIDRs across on-prem, AWS, and GCP up front — overlapping RFC 1918 ranges are the single most common, most painful hybrid mistake.
Mesh expansion across clouds. Bring the EKS and VMware members into Cloud Service Mesh with a shared trust domain so identities are comparable everywhere. Set mutual TLS to STRICT in the PII-handling namespaces (policyholder, claims, pricing) so any non-mTLS call is rejected — the enforceable form of the audit control. Use the mesh’s locality-aware routing so a service prefers an in-region/in-cloud backend and only spills cross-cloud on failure, which keeps the expensive cross-cloud hops rare.
A few IaC snippets (Terraform) capturing the wiring people most often get wrong. First, registering an EKS cluster as a fleet member and turning on fleet features as defaults — the part teams do per-cluster by hand and then drift:
# The fleet lives in the host project.
resource "google_gke_hub_fleet" "platform" {
project = var.fleet_host_project
display_name = "helvetia-platform"
}
# Register the existing EKS cluster as an ATTACHED member (Connect Agent, outbound-only).
resource "google_gke_hub_membership" "eks_quoting" {
membership_id = "eks-quoting-euw1"
project = var.fleet_host_project
endpoint {
# OIDC issuer of the EKS cluster — registration is keyless via Connect.
edge_cluster { resource_link = null }
}
authority { issuer = "https://oidc.eks.eu-west-1.amazonaws.com/id/${var.eks_oidc_id}" }
}
# Turn Config Sync + Policy Controller ON for the whole fleet as a DEFAULT,
# so any future member inherits the governance posture automatically.
resource "google_gke_hub_feature" "configmanagement" {
name = "configmanagement"
project = var.fleet_host_project
location = "global"
fleet_default_member_config {
configmanagement {
config_sync {
git {
sync_repo = "https://github.com/helvetia/fleet-config"
sync_branch = "main"
policy_dir = "fleet"
secret_type = "token" # or gcpserviceaccount — never a static key in the manifest
}
}
policy_controller {
policy_controller_hub_config {
install_spec = "INSTALL_SPEC_ENABLED"
audit_interval_seconds = 60
}
}
}
}
}
Second, the redundant on-prem link — two Interconnect VLAN attachments plus an HA-VPN backup advertising the same routes at a lower priority, which is what makes failover automatic:
resource "google_compute_router" "onprem" {
name = "cr-onprem-euw6"
region = "europe-west6"
network = google_compute_network.shared_vpc.id
bgp { asn = 64512 }
}
# Interconnect VLAN attachment (one of two; the second is in a different edge domain).
resource "google_compute_interconnect_attachment" "vlan_a" {
name = "vlan-zurich-a"
region = "europe-west6"
router = google_compute_router.onprem.id
interconnect = google_compute_interconnect.dedicated_a.id
vlan_tag8021q = 401
candidate_subnets = ["169.254.10.0/29"]
}
# HA VPN tunnel to on-prem: same prefixes, but advertised at a WORSE priority,
# so Interconnect wins and VPN takes over only on failure.
resource "google_compute_router_peer" "vpn_backup_peer" {
name = "onprem-vpn-backup"
router = google_compute_router.onprem.name
region = "europe-west6"
interface = google_compute_router_interface.vpn_if.name
peer_ip_address = var.onprem_vpn_peer_ip
peer_asn = 65010
advertised_route_priority = 200 # higher number = less preferred than Interconnect (100)
}
Third, a Policy Controller constraint (delivered through Config Sync) that bans public LoadBalancers in prod and is rolled out safely in audit mode first:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sNoPublicServices # from the Policy Controller library / bundle
metadata:
name: no-public-lb-in-prod
spec:
enforcementAction: dryrun # start in audit; flip to "deny" after reviewing the dashboard
match:
kinds:
- apiGroups: [""]
kinds: ["Service"]
namespaceSelector:
matchExpressions:
- { key: env, operator: In, values: ["prod"] }
And the mesh PeerAuthentication that turns “trust the IP” into “prove your identity” for PII namespaces:
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
name: strict-mtls
namespace: pricing # PII-handling namespace
spec:
mtls:
mode: STRICT # any non-mTLS call is refused, on every member
Deployment and rollout. Stand up the fleet and the network spine first, then register clusters one at a time, each time confirming Config Sync reconciles green and the mesh sidecars come up before moving on. Roll out every Policy Controller constraint in dryrun first, read the violation report, remediate, then enforce — never enforce blind. Application workloads deploy through normal CI to Artifact Registry, with the promotion across clusters expressed as Git changes to the Config Sync overlays, so “ship to the next environment” is a reviewed pull request, not a kubectl against a member. The whole platform — fleet, memberships, features, VPC, Interconnect, HA VPN, NCC, DNS, IAM — is one Terraform state per environment, promoted dev → staging → prod with the same code.
Enterprise considerations
Security and Zero Trust. This architecture is a Zero Trust posture for a multi-cloud Kubernetes estate, expressed in two reinforcing layers. At the identity layer, every workload on every cluster has a SPIFFE identity from one shared CA, and STRICT mTLS in sensitive namespaces means service-to-service calls are authenticated and encrypted by identity — there is no implicit trust by network location, which is the whole point. Fleet Workload Identity (federated to AWS IAM) means no static keys anywhere, on any cloud, eliminating the most common breach vector (a leaked key) by construction — and explicitly closing the door that this codebase’s history of leaked DB credentials left open. At the policy layer, Policy Controller enforces guardrails (no public LBs, approved registries only, mandatory NetworkPolicies, pod-security baselines) at admission and continuously audits for drift, while Config Sync guarantees the declared security config is actually present on every member and reverts tampering. At the network layer, traffic rides Interconnect/HA VPN over Google’s private backbone rather than the public internet, Private Google Access / PSC keep API calls off the internet, and the GKE clusters are private (no public node IPs). Identity, policy, and network controls compound rather than substitute.
Cost optimization. The headline lever is that fleet management replaces N operational stacks with one — one monitoring stack, one policy pipeline, one identity model, one access path — so the operational cost of the thirty-first cluster is near-zero marginal, which is the real multi-cloud expense. On the network, be honest about the trade-offs: Cloud Interconnect is a fixed monthly port charge plus low egress, justified only by the bandwidth and jitter the on-prem data path actually needs — right-size the port (start at 10 Gbps, not 100) and remember HA VPN is far cheaper for links whose volume does not justify a circuit (the AWS path often starts as VPN-only). Cross-cloud egress is the silent budget killer: every GKE→AWS or GKE→on-prem call may incur egress, so the mesh’s locality-aware routing (prefer in-cloud backends, spill cross-cloud only on failure) is a cost control as much as a latency one — keep chatty service pairs co-located and only span the boundary when the data demands it. GKE Autopilot for new clusters bills per pod resource request (no paying for idle node headroom), and GKE Enterprise’s per-vCPU fleet pricing means you pay for governed capacity, not per-cluster licenses. Tag/label every cluster and workload by owning team to attribute both compute and egress.
Scalability. A single fleet governs hundreds of clusters and thousands of nodes across clouds; onboarding scales linearly because registration + fleet-default features is one operation regardless of count. Config Sync reconciles every member independently and in parallel, so the governance loop does not slow as the fleet grows. The mesh scales horizontally per cluster, and fleet scopes keep the blast radius and the cognitive load bounded as teams multiply. On the network, Cloud Router and NCC carry large route tables and many spokes, and Interconnect scales to 100 Gbps (or multiple circuits with ECMP) when the data-center data path grows.
Reliability and DR (RTO/RPO). Resilience is layered. The network path is the clearest story: two Interconnects in two edge domains survive an edge failure (99.99% SLA), and HA VPN advertising the same prefixes at lower BGP priority fails the on-prem path over automatically in under a minute with no operator action — an effective network RTO measured in seconds. Regional GKE control planes and multi-zone node pools survive a zone loss. For workload DR, the fleet makes active-active or warm-standby across clouds natural: the same Config Sync repo deploys the pricing service to both the GKE and (if desired) the on-prem cluster, and the mesh’s locality routing fails traffic over to the surviving member — so a regional GCP outage degrades to serving from on-prem/AWS rather than going dark. RTO/RPO for stateful data is then governed by the data tier riding over this fabric (Db2/Cloud SQL replication, BigQuery), not by the platform itself — model the cases explicitly: zone loss, region loss, full Interconnect loss, and a whole-cloud loss, each in a runbook.
Observability. GKE Enterprise gives one pane of glass across every member: the fleet dashboard shows cluster health, Config Sync status, and Policy Controller violations for on-prem, AWS, and GCP clusters together. Cloud Logging and Monitoring aggregate logs/metrics fleet-wide, Managed Service for Prometheus scrapes every member with no per-cluster Prometheus to operate, and Cloud Service Mesh emits golden-signal telemetry (latency, traffic, errors, saturation) and a service topology spanning clouds. The two governance vital signs to alert on are Config Sync drift/sync errors (a member that stopped reconciling is a member that is silently diverging) and Policy Controller violation count (anything in violation is a control gap to close) — plus the network’s BGP session state on every Interconnect/VPN link.
Governance. A clean resource hierarchy (org → folders for environments/business units → the fleet host project + service projects), with Organization Policy constraints enforced top-down (disable service-account key creation, restrict regions, domain-restricted sharing). The fleet is the governance object: scopes bound team access, Config Sync makes the Git repo the single auditable source of cluster state, and Policy Controller’s continuous audit is the live compliance report — so “show me the control across the estate” is a dashboard, and “prove it has been enforced over time” is the Git history plus the audit logs. Audit Logs (Admin Activity always on; Data Access on the sensitive services) flow to a logs bucket / BigQuery sink for retention and SIEM, and the Connect gateway means even on-prem and EKS kubectl access is logged through Google’s audit trail.
Reference enterprise example
Helvetia Risk ran the program over four quarters. Concrete decisions and numbers:
Estate at the start. 3 Kubernetes worlds, 22 clusters total: 6 on-prem (VMware, kubeadm), 9 on EKS (the acquired broker stack, eu-west-1), 7 on GKE (the new pricing/AI build-out, europe-west6). 3 RBAC models, 3 secret stores, 3 CI/CD pipelines, 3 monitoring stacks. The trigger was a regulator audit that asked Helvetia to demonstrate mTLS enforcement and consistent network policy for all PII-handling services across the estate — and the honest answer was “we can’t, in under two weeks, with confidence.”
What they built.
- A single GKE Enterprise fleet in a dedicated
helvetia-fleet-hostproject. All 22 clusters registered as members: GKE natively, EKS as attached members via the Connect Agent (outbound-only, no inbound AWS rule), on-prem migrated from kubeadm to Google Distributed Cloud (software, VMware) and registered. - Config Sync against one
fleet-configGit repo (fleet-wide policy viaRootSync, per-cluster overlays for the legitimate on-prem/AWS differences, teamRepoSyncper namespace). - Policy Controller with the CIS + pod-security policy bundles, every constraint rolled out in dryrun first, then enforced: no public LoadBalancers in prod, Artifact-Registry-only images, mandatory NetworkPolicy per namespace.
- Cloud Service Mesh across all 22 members with a shared CA;
STRICTmTLS in the policyholder, claims, and pricing namespaces. - Fleet Workload Identity federated to AWS IAM via OIDC — the EKS quoting service assumes a GCP identity to call the GKE pricing service, and a GKE batch job assumes an AWS role to read the broker S3 bucket, both keyless.
- Network spine: Shared VPC in
europe-west6; 2 × 10 Gbps Dedicated Interconnect to Zurich in two edge domains (BGP via Cloud Router); HA VPN as the on-prem backup and the primary AWS path; Network Connectivity Center hub with on-prem and AWS as spokes; Cloud DNS inbound/outbound forwarding across all three.
Decisions that paid off.
- The audit question became a dashboard. “Show me mTLS and NetworkPolicy enforced for PII services across the estate” went from a two-week, three-team scramble to a single fleet-dashboard view: Policy Controller showing zero violations on the relevant constraints across all 22 members, and the mesh showing
STRICTmTLS in the PII namespaces. The auditor accepted the dashboard plus the Git history as evidence. - Drift stopped being a standing finding. With Config Sync reconciling continuously, the recurring “a cluster fell out of compliance” finding disappeared — a hand-edit on a member was reverted within ~1 minute. The “standard” now lives in Git and is enforced, not described in a wiki.
- Cross-cloud calls got a real security model. The EKS quoting → GKE pricing → on-prem policyholder path became end-to-end mTLS by identity over the private spine, replacing “trust the IP range.” The security team signed off on the cross-cloud data flow for the first time.
- No keys to leak. Every workload moved to fleet Workload Identity; the 3 separate secret stores collapsed toward Secret Manager + WI. The class of incident that previously bit this organization (a credential committed to a repo) was designed out, not merely patched.
- On-prem failover proved boring. A planned Interconnect-maintenance test dropped the primary circuit; the on-prem data path failed over to HA VPN in ~35 seconds with no operator action and no visible impact to the pricing engine’s policyholder reads — exactly because the VPN advertised the same prefixes at a lower BGP priority.
- Onboarding collapsed. Registering the broker’s next EKS cluster and bringing it fully under policy + mesh took under a day (register → Config Sync goes green → mesh sidecars up), versus the multi-week, bespoke effort each cluster used to require.
A failure they handled gracefully. A platform engineer enforced a new “no :latest image tags” Policy Controller constraint that they had not run in dryrun first; it would have blocked a legitimate deploy on the on-prem cluster. Because the constraint was delivered through Config Sync as a reviewed pull request, the pre-merge check (which lints constraints and runs them against a sample) flagged the impending violations, and the change was held and re-rolled in dryrun mode — the violation report showed the affected workloads, those were fixed, and only then was it enforced. The guardrail process caught the guardrail mistake; nothing in production broke.
Outcome. Steady-state GCP spend ~₹19 lakh/month (fleet/GKE Enterprise + Interconnect + the GCP workloads), with the operational headcount to run the estate flat despite the cluster count, because there is now one stack to operate instead of three. The audit finding closed on the strength of the dashboard. Cross-cloud service calls are identity-secured over a private, redundant spine; on-prem failover is automatic and proven; there are no static keys to leak; and onboarding a new cluster — anywhere — is a day, not a quarter. Helvetia now treats “multi-cloud” as an operating model they govern, rather than an accident they survive.
When to use it
Use this architecture when:
- You have Kubernetes in more than one place — multiple clouds, or cloud plus on-prem — and the operational and governance cost of running them as separate islands is the real pain (this is the clearest signal you need a fleet).
- You face a “prove this control is enforced everywhere” mandate (regulated industry, audit, security baseline) across a heterogeneous estate — fleet-wide Config Sync + Policy Controller is purpose-built for exactly this.
- You have regulated or gravity-bound data on-prem that new cloud workloads must sit next to with private, high-bandwidth, low-jitter connectivity — Interconnect + HA VPN is the spine.
- You acquired or inherited a born-in-another-cloud stack (EKS/AKS) you cannot or should not re-platform, but must bring under one identity, policy, and network model.
- You want consistent, identity-based service-to-service security (mTLS) across clouds, replacing IP-range trust — Cloud Service Mesh with a shared CA delivers it.
Be honest about the trade-offs:
- It is a platform, not a weekend project. You are running a fleet, GitOps reconciliation, a multi-cluster mesh, federated identity, and a hybrid network. Without the GitOps and policy discipline, you have built a more complicated way to drift. Commit to Config Sync, dryrun-first policy rollout, and the observability above, or do not adopt it.
- Cross-cloud is not free, in money or latency. Every boundary-crossing call can incur egress and tens of milliseconds. If your services are chatty and co-located logically but spread physically, the bill and the p99 will tell on you — design for locality and span the boundary only where the data requires it.
- It is more than a small, single-cloud problem needs. One cloud, a few clusters, one team, no on-prem data gravity? Plain GKE with Config Sync on a single project (even without the full multi-cloud fleet) is the right answer; adopt the cross-cloud spine when a second environment that must be governed together actually appears.
- Google Distributed Cloud on-prem is real infrastructure. Running GKE on your own VMware/bare-metal is operationally heavier than consuming managed GKE — staff for it, or keep on-prem minimal and exit-oriented.
Anti-patterns to avoid (each is a real incident waiting to happen):
- Operating clusters instead of the fleet. Configuring Config Sync, policy, and mesh per cluster by hand is how you recreate the very drift you adopted the fleet to kill. Turn them on as fleet defaults so members are born compliant.
- Enforcing policy blind. Flipping a Policy Controller constraint straight to
denywithout a dryrun pass will eventually block a legitimate deploy and look like an outage. Audit, review the dashboard, then enforce — always. - Overlapping CIDRs across environments. The most common, most painful hybrid mistake. Plan non-overlapping RFC 1918 across on-prem, AWS, and GCP before the first link, or you will be renumbering production later.
- A single Interconnect called “redundant.” One circuit is a single point of failure with weeks of repair lead time. The minimum is two Interconnects in two edge domains plus HA VPN backup; anything less is not a 99.99% path.
- mTLS without a shared trust domain. Turning on the mesh per cluster with different CAs gives you encryption but not cross-cloud identity — the auditor’s question stays unanswerable. One shared root of trust across all members is the whole point.
- Exported service-account keys (or any static cross-cloud keys). The most common breach vector. Use fleet Workload Identity federated to AWS IAM; set the org policy that disables key creation. No exceptions.
- GitOps drift you don’t watch. Config Sync silently failing to reconcile a member means that member is diverging unseen. Alert on sync errors and drift, or the closed loop is quietly open.
Alternatives worth weighing: for a single cloud, plain GKE + Config Sync + Policy Controller on one project gives you the GitOps/policy benefits without the multi-cloud machinery — graduate to the full fleet when a second governed environment appears. For multi-cloud governance without Google’s fleet, a self-managed Argo CD / Flux + OPA Gatekeeper + Istio (multi-primary) stack across clusters is the DIY equivalent — more control and portability, but you now operate the reconciliation, the policy engine, the mesh CA, and the multi-cluster wiring that GKE Enterprise manages for you (and you lose the single fleet dashboard and Connect gateway). For connectivity specifically, if you have no on-prem at all and only span clouds, you can drop Interconnect and run HA VPN + Network Connectivity Center alone; conversely, very high-bandwidth data-center estates may run Cross-Cloud Interconnect to land a private circuit directly between Google and AWS rather than over VPN. And for the on-prem distribution, Google Distributed Cloud (software) assumes you run the cluster; the GDC connected/air-gapped appliances suit edge or sovereignty cases where Google ships the hardware. The shape — one fleet for management, a private redundant spine for the data path, shared identity and policy across every member — stays the same across all of these. Only the connectivity primitive and the on-prem footprint change.