A national logistics and parcel carrier moves 11 million packages a day through a single Java monolith that has accreted features since 2011 — rating, label generation, manifesting, track-and-trace, customs, billing, and the driver dispatch API all in one 2.3-million-line WAR deployed to a fleet of VMs. The CTO’s ultimatum is concrete and arrives the week after a peak-season outage: a rating-engine bug in November took the entire platform down for 47 minutes during the highest-volume hour of the year, because rating, tracking, and dispatch share one process and one heap. Every team waits on a six-week release train. A change to the customs module requires regression-testing checkout. The business wants three things that the monolith physically cannot give them: independent deployability (ship tracking without re-certifying billing), fault isolation (a rating bug must not take down dispatch), and elastic scale (track-and-trace gets 40× its baseline traffic on the morning after a holiday, while billing is flat). The instinct in the room is “rewrite it on Kubernetes.” That instinct, executed literally, is how you get a two-year project that ships nothing and a big-bang cutover that fails on a Saturday night. This article is the pragmatic alternative: a strangler-fig decomposition onto Google Kubernetes Engine (GKE) that keeps the monolith earning revenue every day while microservices grow up around it and eventually replace it.
Why not a big-bang rewrite
The naive plan fails predictably, and naming the failure modes matters because someone will champion the rewrite in the first meeting.
A big-bang rewrite asks you to freeze the monolith’s features for the duration (the business never agrees), reproduce a decade of undocumented edge cases from scratch (you will miss the ones that matter), and cut over everything on one date (the riskiest possible deployment). The reason it is a rewrite is the reason it fails: there is no incremental value and no incremental risk reduction — you carry maximum risk until the last day.
A lift-and-shift of the WAR into a single container on GKE is the opposite mistake. It feels like progress and delivers almost none of the goals: it is still one process, so a rating bug still shares a heap with dispatch (no fault isolation), it still deploys as one unit (no independent deployability), and it still scales as one unit (you scale billing to scale tracking). You have paid for Kubernetes and bought a more expensive monolith.
The strangler fig — named for the vine that grows around a host tree until the original rots away inside it — threads the needle. You put a facade (a routing layer) in front of the monolith, then extract one capability at a time into a microservice on GKE, flipping a route so the new service serves that slice of traffic while everything else still hits the monolith. Each extraction ships independently, delivers value the day it lands, and reduces risk because the blast radius of the next extraction is one capability, not the platform. The monolith shrinks until what remains is small enough to retire or leave as a legacy core. You are never more than one route-flip from rolling back.
Architecture overview
The architecture has three layers that you should hold separately in your head: the edge and facade (where traffic is routed between old and new), the GKE service mesh (where extracted microservices run), and the data and async fabric (how services own their data and talk without coupling). The defining property of the whole topology is the facade: nothing the client sees changes during the migration. The carrier’s driver apps, the public tracking page, and the EDI integrations all keep calling the same hostnames; the facade decides, per route, whether a request is served by the monolith or by a new GKE service.
Request path, following the control flow:
- A client — a driver’s handheld, the public track-and-trace web app, or a shipper’s EDI gateway — hits Akamai at the edge for TLS termination, global anycast, WAF, and bot mitigation. The morning-after-a-holiday tracking spike is mostly absorbed here by Akamai’s CDN caching of tracking pages that change only when a scan event lands.
- Identity is brokered before routing. The carrier’s workforce (drivers, dispatchers, ops) authenticates through Okta as the workforce IdP, federated to Microsoft Entra ID where the org already runs Microsoft 365 — so internal tools see a first-class Entra token, and Okta conditional-access policies (device posture, geo) gate the driver app. External shipper APIs authenticate with OAuth client credentials issued by Apigee. This split — humans through Okta/Entra, machines through the API gateway — is deliberate.
- Traffic reaches the facade, which is Apigee for the externally-published API surface (shipper-facing rating, label, and tracking APIs that need quotas, monetization, developer portal, and versioning) and the Kubernetes Gateway API (Envoy-based, via GKE Gateway) for internal east-west and the web front end. Apigee owns the product API contract; Gateway API owns routing into the mesh. Both consult the same routing rules: a path like
/v2/track/**now points at the new tracking service, while/v2/rate/**still proxies to the monolith. - For routes that have been extracted, the request lands in the GKE mesh on the relevant microservice — Tracking, Rating, Label, etc. — each a Deployment with its own HPA, its own release cadence, and its own on-call. Cloud Service Mesh (managed Istio) provides mTLS between services, retries, circuit breaking, and the traffic-splitting primitive that makes canaries and the strangler cutover itself possible.
- For routes not yet extracted, the facade proxies straight through to the monolith, which still runs — but now inside GKE as a single large Deployment (lifted in early so old and new share one platform, network, and observability plane), talking to the original schema.
- Each microservice owns its data in its own Cloud SQL instance (or Spanner for the globally-distributed tracking ledger). Synchronous calls between services are the exception; the default boundary is asynchronous via Pub/Sub, so a scan event published by the dispatch path fans out to tracking, billing, and notifications without any of them being on each other’s critical path.
Control plane, independent of the request path: infrastructure is Terraform, node and OS configuration is Ansible for the few VM-based virtual appliances at the network edge, application delivery is Argo CD pulling declarative manifests from Git, and the entire request — across the old/new seam — is traced in Datadog so you can see a request hop from Apigee into a new service, out to Pub/Sub, and watch the monolith consume the event.
The seam: how old and new coexist
Most of the hard engineering in a strangler migration is not in the new services — it is in the seam between old and new, and three problems live there.
Routing and cutover. Each capability extraction follows the same ritual: deploy the new service alongside the monolith, replay or mirror a slice of production traffic to it to validate parity, then shift real traffic with a weighted route — 1%, 10%, 50%, 100% — using Cloud Service Mesh traffic splitting (or Apigee target weighting for the external surface). At every step the rollback is a single weight change. The new tracking service ran in shadow mode for two weeks, receiving mirrored production reads and having its responses diffed against the monolith’s, before it served a single real user.
Data ownership and the dual-write trap. The monolith has one giant schema; the microservices must own their data, or you have not decoupled anything. You cannot move all the data at once, so during the transition the new service and the monolith both need a consistent view. The wrong answer is dual writes (the app writes to both stores) — it has no transactional guarantee and silently drifts. The right answer is Change Data Capture: the monolith keeps writing its schema, Datastream captures row-level changes from the monolith’s Cloud SQL and streams them into the new tracking service’s store via Pub/Sub, so the new service has an eventually-consistent, read-correct copy of the data it is taking over — with no code change in the monolith. When tracking is fully extracted, writes flip to the new service and the CDC feed reverses or retires.
Distributed transactions become sagas. A single monolith transaction — “rate the shipment, reserve the label number, debit the shipper’s account” — spanned three modules and one ACID commit. Split across Rating, Label, and Billing services, that ACID transaction is gone; forcing a two-phase commit across services recreates the coupling you are trying to escape. Replace it with a saga: each service does its local transaction and publishes an event; a failure triggers a compensating action (release the reserved label number, reverse the debit). Pub/Sub carries the saga events; idempotency keys make retries safe.
| Concern | Monolith (before) | Microservices on GKE (after) | How the seam handles it |
|---|---|---|---|
| Deployment unit | One WAR, six-week train | One service, deploy on its own cadence | Argo CD per-service apps; facade route-flip |
| Fault isolation | Shared heap/process | Pod + namespace boundaries, circuit breakers | Cloud Service Mesh; HPA per service |
| Data | One shared schema | Cloud SQL / Spanner per service | Datastream CDC during transition; saga for writes |
| Cross-service calls | In-process method call | mTLS gRPC/REST or async event | Pub/Sub default; sync only when latency demands |
| Transaction | One ACID commit | Local commit + events | Saga with compensating actions, idempotency keys |
| Scale | Whole app scales together | Per-service HPA | Tracking scales 40×, billing stays flat |
Component breakdown
| Component | Service / tool | Role in the migration | Key choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, WAF, CDN caching of tracking pages | Cache scan-event pages; absorb holiday tracking spike at edge |
| Workforce identity | Okta + Microsoft Entra ID | Driver/dispatcher/ops SSO; Okta federated to Entra for M365 | OIDC federation; conditional access on the driver app |
| External API facade | Apigee | Shipper-facing API product: quotas, versioning, dev portal, OAuth | Target weighting for cutover; spike-arrest per shipper |
| Internal facade / routing | Kubernetes Gateway API (GKE Gateway) | East-west routing, web front end into the mesh | HTTPRoute per capability; weighted backends |
| Compute | GKE (regional, Autopilot for new services) | Runs extracted microservices and the lifted monolith | Autopilot for new svc; Standard node pool for monolith |
| Service mesh | Cloud Service Mesh (managed Istio) | mTLS, retries, circuit breaking, traffic splitting | Strangler cutover via VirtualService weights |
| Per-service data | Cloud SQL (Postgres/MySQL), Spanner | One database per service; Spanner for global tracking ledger | No shared schema; private IP only |
| CDC during transition | Datastream + Pub/Sub | Stream monolith row changes into new service stores | Read-correct copy without monolith code change |
| Async backbone | Pub/Sub | Event boundaries, saga choreography, fan-out | At-least-once + idempotency keys; dead-letter topics |
| Secrets | HashiCorp Vault | DB creds, API keys, signing keys for services | Dynamic Cloud SQL creds; Workload Identity auth; sidecar injection |
| CSPM / posture | Wiz + Wiz Code | Cloud posture, attack-path analysis, IaC scanning pre-merge | Agentless scan of GKE/Cloud SQL; Wiz Code blocks risky Terraform in PR |
| Runtime security | CrowdStrike Falcon | Container runtime threat detection on GKE nodes | Sensor as DaemonSet; detections to the SOC |
| Observability | Datadog | Distributed tracing across the old/new seam, APM, logs | OTel traces Apigee → service → Pub/Sub → monolith |
| ITSM / approvals | ServiceNow | Change approvals for each cutover; incident records | Change gate before a weight increase; auto-ticket on SLO breach |
| CI / CD / IaC | GitHub Actions + Argo CD + Terraform + Ansible | Build/test, GitOps delivery, infra, appliance config | OIDC to GCP (no keys); Argo CD sync; Ansible for edge appliances |
| Internal enablement | Moodle | Engineer training on the mesh, sagas, on-call runbooks | Decomposition playbook course; required before owning a service |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why lift the monolith into GKE early, before extracting anything. It is tempting to leave the monolith on its VMs and only put new services on GKE. Don’t — straddling two platforms means two networking models, two security baselines, two observability stacks, and a painful network hop across the seam on every proxied request. Lifting the monolith into GKE as one large Deployment on day one (no decomposition yet, just containerize and run) means old and new share one VPC, one mTLS mesh, one Datadog plane, and one Argo CD. The seam becomes an in-cluster route, not a cross-environment integration. This is the single highest-leverage early move.
Why Pub/Sub by default, sync by exception. Every synchronous call between two services couples their availability — if Billing is down, anything that calls it synchronously is also down, and you have re-created the monolith’s blast radius across the network. The default boundary is therefore an event on Pub/Sub: the dispatch path publishes a package.scanned event and is done; tracking, billing, and notifications each consume it on their own schedule. Synchronous gRPC/REST is reserved for the few cases where the caller genuinely needs the answer now (rating must return a price to the shipper in the same request). Async-first is what actually buys the fault isolation the CTO demanded.
Why one database per service, even though it is more work. A shared database is a shared schema, and a shared schema is a coupling that defeats independent deployability — change a column and you must coordinate every service that reads it. Each service owning its own Cloud SQL instance (with private IP, no public exposure) means a service’s schema is its private implementation detail. The transitional cost is real — CDC pipelines, eventual consistency, sagas instead of joins — and it is the price of the decoupling.
Implementation guidance
Sequence the extractions by value and decoupling, not by what is easy. Pick the first capability with three properties: high business value, relatively clean boundaries in the existing code, and a read-heavy or clearly-bounded data footprint. For this carrier that was track-and-trace — it is the most-hit, most spiky, most independently-scalable surface, it is largely read-mostly (scan events in, status queries out), and its data boundary is clean. Extracting it first delivered the elastic-scale win immediately and proved the seam machinery before touching anything transactional like billing. Rating came second; billing — the hardest, most transactional, most regulated — came last, when the team had earned the experience.
Provision with Terraform; gate the IaC with Wiz Code. The network is the first deliverable: a regional VPC, a regional GKE cluster (Autopilot mode for new services so you are not managing nodes; a Standard node pool for the heavier monolith), Cloud SQL instances on private IP only, and Private Service Connect for Pub/Sub and other Google APIs so no data-plane traffic traverses public IPs. Every Terraform change runs through GitHub Actions authenticating to GCP via OIDC Workload Identity Federation (no stored service-account keys to leak), and Wiz Code scans the plan in the pull request — a Cloud SQL instance proposed with a public IP, or an over-broad IAM binding, fails the check before merge, not after deploy.
A minimal Terraform shape for the GKE Autopilot cluster and a per-service Cloud SQL instance communicates the intent — private, no public surface:
resource "google_container_cluster" "primary" {
name = "carrier-prod-gke"
location = "asia-south1"
enable_autopilot = true # new services: no node ops
private_cluster_config {
enable_private_nodes = true # nodes have no public IPs
enable_private_endpoint = false # control plane reachable from authorized nets
}
workload_identity_config { workload_pool = "carrier-prod.svc.id.goog" }
}
resource "google_sql_database_instance" "tracking" {
name = "tracking-db-prod"
database_version = "POSTGRES_16"
region = "asia-south1"
settings {
tier = "db-custom-4-16384"
ip_configuration {
ipv4_enabled = false # NO public IP
private_network = google_compute_network.vpc.id
}
backup_configuration { enabled = true point_in_time_recovery_enabled = true }
}
}
Delivery is GitOps with Argo CD. Each microservice is an Argo CD Application that syncs declarative manifests from Git; the desired state of the cluster is the repo, so a deploy is a merged PR and a rollback is a git revert. Argo CD’s drift detection means a hand-edited cluster is automatically reconciled back to Git. The strangler cutover itself is GitOps too: the VirtualService weight that shifts traffic from monolith to new service is a manifest in Git, so a cutover and its rollback are auditable commits — which is exactly what the ServiceNow change gate references when approving each weight increase.
Identity: federate the humans, dynamic-lease the machines. Workforce SSO flows Okta → Entra: drivers and dispatchers authenticate with Okta (conditional access on device posture and geography), Okta federates to Entra over OIDC where the org’s M365 identities live, and tools consume the Entra token. Service-to-service auth inside the mesh is mTLS via Cloud Service Mesh — every pod gets a workload identity and certificates rotate automatically. The secrets that are not identities — Cloud SQL passwords, third-party carrier-integration API keys, JWT signing keys — live in HashiCorp Vault, which issues dynamic, short-lived Cloud SQL credentials per service (so a leaked credential expires in minutes and is scoped to one database) and injects them via the Vault Agent sidecar authenticated by GKE Workload Identity. No long-lived database password is ever written to a Kubernetes Secret.
Enterprise considerations
Security & Zero Trust. The mesh is Zero Trust by construction — mTLS everywhere, identity-based service-to-service authorization, no implicit trust between pods. Layer on top: Wiz runs continuous CSPM and attack-path analysis across GKE, Cloud SQL, and IAM, alerting the moment a resource drifts to public exposure or an IAM binding widens, while Wiz Code shifts that left into the pull request so misconfiguration is caught pre-merge. CrowdStrike Falcon sensors run as a DaemonSet on the GKE node pool for container runtime threat detection — anomalous process execution inside a tracking pod, a reverse shell, lateral movement — feeding the carrier’s SOC. A sustained SLO breach or a Falcon detection auto-raises a ServiceNow incident so security and ops have a ticket, not just a dashboard. Org Policy denies any Cloud SQL instance or load balancer created with a public IP, and Wiz independently verifies the policy is actually holding.
Cost optimization. Decomposition can increase cost if you are careless — more instances, more databases, mesh sidecar overhead — so engineer for it.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Autopilot for new services | Pay per pod resource request, not per node | No paying for idle node headroom |
| Per-service HPA | Scale tracking up 40× at peak, billing stays flat | Stop scaling the whole app to scale one capability |
| Spot/Preemptible for async workers | Pub/Sub consumers and batch on Spot node pool | ~60-80% off on interruptible work |
| Cloud SQL right-sizing | Size each service DB to its real load, not the monolith’s | Small services get small instances |
| CDN at the edge | Akamai caches tracking pages | Deflects the holiday read spike before it hits GKE |
| Committed-use discounts | CUDs on steady baseline GKE/Cloud SQL | Discount the predictable floor |
The honest framing for the CFO: the platform cost may rise modestly, but it is now attributable per service (each team’s GKE namespace and Cloud SQL instance is a cost line), peak scale is paid only where it is needed, and the cost of an outage — a 47-minute peak-hour platform failure — is the number the migration is really reducing.
Scalability. This is the headline win. Each service scales independently on its own signal: track-and-trace scales pods on request concurrency and CPU via HPA, absorbing its 40× holiday spike without anyone touching billing; Pub/Sub consumers scale on subscription backlog depth; billing stays at a flat baseline because its load is flat. The monolith, while it survives, scales as one unit — which is precisely the constraint each extraction relieves. Spanner under the global tracking ledger scales horizontally for the read-heavy status-query load that dominates traffic.
Failure modes, and what each one looks like. Name them before they page you.
- A cutover that regresses parity — the new service returns subtly different results than the monolith for some edge case. Mitigation: shadow/mirror traffic and response-diffing before the first real weight, and keep every cutover a single revertible weight change.
- Pub/Sub redelivery causing double-processing — at-least-once delivery means a saga step can run twice (a shipper double-debited). Mitigation: idempotency keys on every event handler and dead-letter topics for poison messages.
- A saga that fails mid-flight — rating succeeded, label reserved, billing failed; without compensation the system is inconsistent. Mitigation: explicit compensating actions per step and an orchestrator that drives them, monitored in Datadog.
- CDC lag during transition — the new tracking service reads stale data because Datastream fell behind. Mitigation: alert on replication lag; keep the monolith authoritative until the new service is consistent and writes have flipped.
- A synchronous dependency outage cascading — a sync call to Billing takes everything calling it down. Mitigation: prefer async; where sync is unavoidable, circuit breakers and timeouts in the mesh so a slow dependency sheds rather than cascades.
Reliability & DR (RTO/RPO). Decide the numbers per tier. The GKE cluster is regional (control plane and nodes across zones) so a zonal failure is transparent. Cloud SQL runs with a regional HA configuration and point-in-time recovery; the global tracking ledger on Spanner gives multi-region writes with near-zero RPO. Pub/Sub is regional and durable, buffering events through a consumer outage so nothing is lost. A pragmatic target for the carrier’s interactive surfaces: RTO 15 minutes, RPO 5 minutes, with the tracking ledger effectively zero-RPO on Spanner. Akamai health checks drive edge failover. Critically, during the migration the monolith remains the fallback for any not-yet-cutover route, so the strangler itself is a continuity strategy — you are never betting the business on the new path before it has earned trust.
Observability across the seam. This is non-negotiable for a strangler migration, because a single request can cross old and new. Instrument end-to-end distributed tracing in Datadog with OpenTelemetry: one trace covering Apigee → Gateway API → new GKE service → Pub/Sub publish → monolith consume → Cloud SQL, so when latency regresses you can see which side of the seam owns it. Emit the metrics that actually matter — per-route monolith-vs-service traffic share (so you can watch the fig strangle the tree), per-service p95 latency and error rate, Pub/Sub backlog and redelivery rate, saga completion vs compensation rate, and CDC replication lag. Datadog APM, log correlation, and SLO monitors drive the ServiceNow auto-ticketing on breach. New engineers learn the mesh, sagas, and on-call runbooks through a required Moodle course before they are handed the pager for a service — turning the architecture’s complexity into something teachable rather than tribal.
Governance. Pin everything: container images by digest (never latest), Helm/manifest versions in Git, and the cutover weights as auditable commits. Org Policy denies public IPs and requires private clusters; Wiz is the independent check the controls hold. Each capability extraction passes a ServiceNow change approval before its weight is raised — giving ops and risk a documented gate for each step of a migration that, in aggregate, reshapes a revenue-critical platform.
Explicit tradeoffs
Accept these or do not start. Microservices trade in-process simplicity for distributed-systems complexity, and the bill is real: network calls fail in ways method calls do not, eventual consistency replaces a clean ACID join, sagas replace transactions, and you now operate many services instead of one. Debugging a request means a distributed trace, not a single stack trace — which is exactly why Datadog tracing across the seam is load-bearing, not optional. The data decomposition (CDC pipelines, per-service stores, sagas) is the hardest part and the part teams underestimate. And the strangler approach itself has a cost: you run both the monolith and the new services simultaneously for the duration, paying for two worlds during the transition and carrying the seam’s complexity until the last capability is extracted.
The alternatives, and when they win. If your monolith is small, well-factored, and not under scaling or fault-isolation pressure, leave it alone — microservices are a tax you pay for independent deployability and isolation you do not yet need, and a tidy modular monolith is a perfectly good destination. If you genuinely can freeze features and the system is small, a rewrite is occasionally defensible — but almost never for a 2.3-million-line revenue platform. If you want service boundaries without the operational weight of many deployables, a modular monolith (enforced module boundaries in one process) gets you much of the maintainability with none of the distributed-systems cost, and is a legitimate stopping point. The strangler fig wins precisely the case the carrier is in: a large, business-critical monolith that cannot stop earning, under real and uneven scaling pressure, where the goals are independent deployability and fault isolation and the appetite for big-bang risk is correctly zero.
The shape of the win
For the carrier, the payoff is not “we use Kubernetes now.” It is that the next November, a rating-engine bug degrades rating for the shippers hitting that one service — and track-and-trace, dispatch, and billing keep running, because they are separate processes behind separate pods with circuit breakers between them. It is that the tracking team ships a fix on Tuesday afternoon without re-certifying billing. It is that the holiday-morning tracking spike scales tracking pods 40× while billing sits flat and the bill reflects exactly that. Every piece upstream — the Apigee and Gateway API facade that hid the migration from clients, the Cloud SQL-per-service stores fed by Datastream, the Pub/Sub boundaries carrying sagas, the Argo CD GitOps cutovers, the Datadog traces spanning old and new, the Vault-leased credentials, the Wiz and Falcon guardrails, the ServiceNow gates — exists to turn that 47-minute platform outage into a contained, single-service incident. The architecture here is the destination; the strangler fig is how you get there one revertible route-flip at a time, with the business running every single day in between.