A regional property and casualty insurer runs its entire policy-administration and claims engine on a 30-year-old IBM z/OS mainframe — eight million lines of COBOL, CICS transactions, and a DB2 database that is the legal system of record for every policy in force. The board’s pressure is concrete and arriving from three directions at once. The MIPS bill from IBM has crossed eight figures a year and rises with every renewal cycle. The last two COBOL programmers who understand the dividend-calculation modules are 61 and 63 years old, and there is no pipeline behind them. And the digital team cannot ship the real-time quoting experience a new direct-to-consumer brand needs, because every quote round-trips through a batch-shaped mainframe that was designed for green-screen agents, not a React app on a phone. The CIO has been told “modernize the mainframe” by a board that has also, in the same breath, made clear that a failed big-bang cutover — the kind that has bankrupted insurers before — is a fireable event.
This is the architecture for doing it the only way that survives that constraint: incrementally, with the mainframe still running and authoritative the whole time, offloading one capability at a time onto AWS until the mainframe is a hollow shell you can finally switch off. It is the strangler-fig pattern applied to a system that cannot have a maintenance window, built on AWS Mainframe Modernization with the DB2 system of record kept continuously in sync with cloud databases so that neither side is ever stale.
Why not the big-bang rewrite
Three obvious approaches each fail, and naming why matters because a vendor will pitch all three.
A big-bang rewrite-and-cutover — freeze the mainframe, rebuild everything in Java, flip over one weekend — is the approach with a graveyard of named failures behind it. Eight million lines encode four decades of regulatory edge cases, state-specific rating rules, and undocumented business logic that lives only in the code. Rewriting it all before shipping anything means a multi-year project with zero value delivered until the terrifying end, and a cutover where every policy, every claim, and every cent of reserves moves at once with no proven rollback. Insurers have died this way.
Pure emulation / lift-and-shift — recompile the COBOL to run on x86 in the cloud and call it done — gets you off the IBM hardware and the MIPS bill, which is real money, but you still have eight million lines of COBOL, the same two retiring engineers, and a batch-shaped architecture that cannot serve real-time quotes. You have changed the cost structure, not the problem.
Leaving it alone is the option the CFO secretly prefers right up until the day a 63-year-old hands in his notice or IBM end-of-lifes the hardware generation, at which point you are doing an emergency migration with a gun to your head instead of a planned one.
The strangler-fig threads the needle. You wrap the mainframe in an API facade, then peel off one bounded capability at a time — quoting first, then a claims sub-flow, then a reporting workload — reimplementing each on AWS while a routing layer sends that capability’s traffic to the new implementation and everything else still to the mainframe. The mainframe shrinks. Value ships every quarter. And because the system of record stays consistent on both sides throughout, you can roll any single capability back to the mainframe in minutes if it misbehaves. The vine grows around the tree until the tree is dead wood you remove.
Architecture overview
There are three planes to hold in your head, and confusing them is the most common way teams get this wrong. The facade plane is the routing layer that decides, per request, whether a capability is served by the mainframe or by AWS. The replatform plane is where offloaded COBOL workloads actually run on AWS. And the data-sync plane is the bidirectional change-data-capture that keeps the DB2 system of record and the AWS databases continuously consistent so the facade can route either way safely.
The defining property of the whole topology — the one that lets the CIO sleep — is this: DB2 on the mainframe remains the authoritative system of record until the very last capability is cut over. AWS databases are kept in lockstep with it by CDC, every offloaded workload writes back to DB2 (directly or via replicated streams) so the mainframe’s books are never wrong, and any capability can be routed back to the mainframe instantly because the mainframe never went stale. You are never betting the company on the new side being correct; you are running both and proving the new side before you trust it.
Control flow, following a quote request through the facade:
- A customer hits the new direct-to-consumer site. They land on Akamai at the edge — TLS termination, global anycast, WAF and bot mitigation — which protects the origin and absorbs the traffic spikes a consumer brand sees that an agent channel never did.
- The request reaches the API facade on Amazon API Gateway, the single front door that fronts every legacy capability. Identity is federated through Okta as the customer/workforce IdP (the insurer’s agents and staff) brokered into AWS, with Microsoft Entra ID covering the corporate side; API Gateway validates the OIDC/JWT token before any backend is touched.
- Behind the gateway sits the strangler facade / router — application logic on Amazon ECS (Fargate) that owns one decision per capability: is “quoting” live on AWS yet? It consults a feature-flag/routing config. If quoting has been strangled, the request goes to the new AWS quoting service. If not, it is proxied to the mainframe.
- The mainframe path reaches CICS through AWS Mainframe Modernization’s CICS/transaction connectors (or an MQ bridge), invoking the existing COBOL transaction exactly as the green-screen always did. The mainframe reads and writes DB2, which remains the source of truth.
- The AWS path lands on the reimplemented capability. Here you choose per workload between two AWS Mainframe Modernization runtimes: Replatform (Micro Focus / Rocket runtime), which recompiles the original COBOL to run managed on AWS — chosen when the logic is correct and you just want it off z/OS fast — or a refactored service in a modern language, chosen for the capabilities (like real-time quoting) where the batch-shaped COBOL has to become an event-driven API anyway.
- The AWS capability reads from Amazon RDS / Aurora (or DynamoDB for the high-volume quote-shopping path), which the data-sync plane keeps continuously hydrated from DB2. Writes are streamed back so DB2 stays authoritative.
- The cited, priced quote returns through the facade to the customer. Crucially, the customer cannot tell which plane served them — and that is the whole point.
Data-sync plane, the part that makes routing either way safe: change-data-capture runs in both directions. Precisely Connect (formerly Syncsort) Change Data Capture reads the DB2 transaction log on z/OS — log-based, low-overhead, no expensive triggers on the production database — and publishes every committed change as an event into Confluent (Apache Kafka). Kafka Connect sinks those events into RDS/Aurora so the AWS read models are never more than seconds behind DB2. For capabilities cut over to AWS, the reverse path streams AWS-side writes back into DB2 (via a write-back consumer that applies them through the same transactional connectors) so the mainframe’s system of record reflects business that now happens on the cloud. Kafka is the durable, replayable spine between the two worlds.
Component breakdown
| Component | Service / tool | Role in the modernization | Key configuration choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, WAF, bot mitigation for the new consumer channel | WAF rules for quote-scraping bots; origin shield to API Gateway |
| Identity / SSO | Okta + Microsoft Entra ID | Customer/agent SSO (Okta), corporate identity (Entra), brokered to AWS IAM | OIDC federation; JWT validated at the gateway; SCIM provisioning |
| API facade | Amazon API Gateway | Single front door fronting every legacy capability | JWT authorizer; usage plans; canary routing on stages |
| Strangler router | ECS Fargate | Per-capability routing: mainframe vs. AWS | Feature-flag config; sticky routing; instant rollback toggle |
| Mainframe runtime | AWS Mainframe Modernization (Replatform) | Recompiled COBOL/CICS running managed on AWS | Micro Focus/Rocket engine; lift correct logic off z/OS first |
| Refactored services | ECS / Lambda + Aurora | Reimplemented event-driven capabilities (real-time quoting) | For workloads that must become APIs, not batch |
| Legacy connectivity | M2 connectors / IBM MQ bridge | Invoke live CICS transactions from the cloud facade | Transactional CICS connector; MQ for async flows |
| System of record | DB2 on z/OS | Authoritative books until final cutover | Unchanged; log reader attached for CDC |
| Cloud data store | Amazon RDS / Aurora / DynamoDB | Read models + cutover-capability write store | Aurora for relational parity; DynamoDB for quote-shop scale |
| CDC capture | Precisely Connect CDC | Log-based capture of DB2 changes (no triggers) | Reads DB2 log on z/OS; minimal MIPS overhead |
| Streaming spine | Confluent (Apache Kafka) | Durable, replayable bidirectional change stream | Schema Registry; Connect sinks to RDS; exactly-once where it matters |
| Secrets | HashiCorp Vault | DB2 service creds, MQ creds, CDC and API keys | Dynamic DB leases; short-lived tokens; sidecar injection |
| CSPM / IaC scanning | Wiz + Wiz Code | Cloud posture, attack-path, IaC scanning of the Terraform | Agentless scan; PR gate on Wiz Code findings |
| Runtime security | CrowdStrike Falcon | Runtime protection on ECS tasks and connector hosts | Sensor on Fargate/EC2; detections to the SOC |
| Observability | Dynatrace / Datadog | End-to-end tracing across facade → mainframe and CDC lag | OneAgent/Datadog agent; CDC lag SLO; trace spans both planes |
| ITSM / change | ServiceNow | Cutover change approvals, CAB gates, incident records | Change gate before each capability flips; auto-ticket on CDC stall |
| CI / IaC | Jenkins + GitHub Actions + Argo CD; Terraform + Ansible | Build COBOL & services, provision AWS, GitOps deploy | OIDC to AWS; Argo CD app-of-apps; Ansible for connector hosts |
A few of these choices carry the architecture and deserve the why.
Why log-based CDC, not triggers or batch ETL. The DB2 system of record cannot take a performance hit — it is serving live policy and claims traffic — so you must not put triggers on its tables or run nightly bulk extracts that leave AWS hours stale and force a batch-shaped design forever. Precisely Connect reads the DB2 recovery log on z/OS, the same log DB2 already writes for its own durability, and turns committed transactions into a change stream with negligible added MIPS. That stream feeds Confluent, which gives you durability, replay (re-hydrate a new AWS read model from history), and a schema contract between the COBOL copybook world and the cloud. CDC lag becomes your single most-watched SLO, because if it grows, the AWS read models drift from the books.
Why a facade and a router, not point integrations. It is tempting to let each new service call the mainframe directly. Do not — you end up with a mesh of brittle CICS couplings and no single place to flip a capability or roll it back. The API Gateway facade plus the ECS router give you exactly one switch per capability. Routing is config, not a deploy: flip quoting to AWS at 9am, watch the dashboards, flip it back to the mainframe at 9:05 if the error rate moves — no code change, no panic.
Why keep DB2 authoritative the whole way. The alternative — making AWS authoritative early and syncing back to the mainframe — means that if the new side is wrong, your legal books are wrong, in an industry where the books are filed with state regulators. Keeping DB2 as the system of record until the final capability is proven means every intermediate state is recoverable, and “roll back to the mainframe” is always a safe sentence.
Implementation guidance
Provision with Terraform, configure connector hosts with Ansible, and treat the data-sync plane as the first deliverable — because nothing else can be proven correct until DB2 and RDS demonstrably agree.
- Stand up the VPC, private subnets, and AWS Direct Connect (or a redundant VPN) to the mainframe data center — the CDC stream and the live CICS calls both ride this, so it is on the critical path and needs redundancy from day one.
- Deploy Confluent (Confluent Cloud or self-managed on the cluster) with Schema Registry, then the Precisely Connect CDC agent on z/OS reading the DB2 log, publishing to Kafka topics keyed by table.
- Provision Aurora/RDS, stand up Kafka Connect sinks, and run the data-sync plane in observe-only mode for weeks: CDC flows DB2 → Kafka → RDS, and a reconciliation job continuously diffs row counts and checksums between DB2 and RDS. You do not route a single user to AWS until this reconciliation is boringly green.
- Build the facade (API Gateway + ECS router) and put all traffic through it while still proxying 100% to the mainframe. This is a no-op for users but proves the front door under real load before it ever routes to a new backend.
- Only then strangle the first capability.
A minimal Terraform shape for an AWS Mainframe Modernization replatform environment communicates the intent:
resource "aws_m2_environment" "policy_admin" {
name = "m2-policyadmin-prod"
engine_type = "microfocus" # replatform runtime for lifted COBOL
instance_type = "M2.m5.large"
publicly_accessible = false # private; reached only via the facade
high_availability_config { desired_capacity = 2 } # multi-AZ
}
resource "aws_m2_application" "quoting" {
name = "quoting-svc"
engine_type = "microfocus"
definition { content = file("${path.module}/quoting-app-def.json") }
}
The pipelines that drive this are deliberately split by what they build. Jenkins owns the COBOL build — it already understands the copybooks, the compile-and-test of the legacy modules, and the artifacts the Replatform runtime consumes. GitHub Actions builds the new cloud-native services and runs unit/contract tests, authenticating to AWS via OIDC so no static credentials sit in CI. Argo CD then does GitOps deployment of the containerized facade, router, and refactored services into the cluster via an app-of-apps pattern, so what runs is always what is in Git. Ansible configures the long-lived connector hosts (the MQ bridge, the CDC agent’s cloud side) that do not fit a container lifecycle. Every secret these pipelines need — the DB2 service account, MQ credentials, CDC and third-party API keys — is leased from HashiCorp Vault with short-lived dynamic credentials and sidecar injection, never written into a pipeline variable or a task definition.
The strangler cutover, capability by capability. For each capability the loop is the same, and it is this loop that the whole architecture exists to make safe:
- Shadow. Route a copy of live traffic to the new AWS implementation while the mainframe still serves the real response. Compare outputs offline — for quoting, does the AWS price match the COBOL price to the cent across millions of real quotes? This catches the undocumented edge cases that the rewrite missed.
- Canary. Use API Gateway canary routing to send 1%, then 5%, then 25% of live traffic to AWS, watching error rate, latency, and — critically — CDC lag and reconciliation diffs.
- Cut over. Flip the capability to 100% AWS, with the router’s instant-rollback toggle armed and a ServiceNow change record approved by the CAB.
- Decommission. Once a capability has run clean on AWS through a full business cycle (a month-end close, a renewal batch), retire its COBOL module from the mainframe. The tree loses a branch.
Quote first (high value, no legal-record risk, naturally an API). Reporting and analytics read-models next (they only consume the CDC stream, so they are pure upside). The DB2-writing, reserve-affecting capabilities — claims payment, policy issuance — last, when the data-sync plane and the team both have a year of proof behind them.
Enterprise considerations
Security & Zero Trust. Treat the cloud side as Zero Trust from the first commit: identity-based access only, least-privilege IAM per service, no public surface on the mainframe runtime or databases — they are reachable only through the facade. Layer on: (a) Okta and Entra ID federation so every actor at the gateway carries a verifiable token and corporate access is conditional-access governed; (b) HashiCorp Vault issuing short-lived DB2/MQ credentials so a leaked static mainframe password — the kind that has burned this team before — simply does not exist to leak; © Wiz running continuous CSPM and attack-path analysis across the AWS estate, with Wiz Code gating the Terraform and service repos in the pull request so a misconfiguration (a public RDS, an over-broad security group) is caught before it merges, not after it ships; (d) CrowdStrike Falcon sensors on the ECS tasks and the connector/MQ hosts for runtime threat detection feeding the SOC; (e) the Direct Connect link encrypted and the CDC stream carrying regulated policyholder PII end-to-end TLS with field-level handling for SSNs. A CDC stall or a reconciliation breach auto-raises a ServiceNow incident, because a silent drift between DB2 and RDS is the one failure that could let a wrong number reach a customer.
Cost optimization. The business case is the cost story — escaping the MIPS bill — so engineer the new side not to recreate it.
| Lever | Mechanism | Typical effect |
|---|---|---|
| MIPS retirement | Each decommissioned COBOL module reduces the z/OS workload and renewal MIPS | The core ROI; track $/capability retired |
| Replatform vs. refactor | Replatform the correct-but-boring modules; only refactor what must become an API | Avoids paying to rewrite logic that already works |
| Fargate right-sizing | Scale the facade/router on real concurrency, not peak guesses | Pay for traffic, not headroom |
| Aurora vs. DynamoDB | Relational read-models on Aurora; quote-shopping fan-out on DynamoDB | Match store to access pattern, not habit |
| CDC tiering | Only stream the tables a cloud capability actually reads | Smaller Kafka footprint, lower lag |
Pipe per-capability cost and the shrinking MIPS curve to Datadog or Dynatrace so the CFO sees the mainframe bill bending down quarter over quarter — the chart that funds the next phase.
Scalability. Each plane scales independently. The facade and router scale on Fargate by concurrency; the consumer quote channel — which can spike 50× the old agent volume during a campaign — fans out to DynamoDB and Lambda so a marketing push does not melt the mainframe behind it. Confluent scales by partitions and brokers; size CDC topics by the change rate of their source tables, not uniformly. The AWS Mainframe Modernization replatform environment scales multi-AZ with desired capacity. The natural ceiling early on is the Direct Connect bandwidth and the mainframe’s own capacity for the not-yet-strangled traffic — which is exactly why you strangle the highest-volume, most spike-prone capability (quoting) first, moving it off the mainframe’s back.
Failure modes, and what each looks like. Name them before they page you.
- CDC lag blow-out — the DB2 log reader falls behind under a heavy batch window, and AWS read models go stale; a customer sees yesterday’s coverage. Mitigation: CDC lag as a hard SLO with alerting well below the danger threshold, and capabilities that demand strict freshness reading through to the mainframe until lag is proven tight.
- Reconciliation drift — DB2 and RDS silently disagree on some rows after a schema or mapping change. Mitigation: the continuous diff/checksum job runs forever, not just during onboarding, and a drift auto-tickets and can trip the router back to the mainframe.
- Direct Connect outage — the link to the mainframe DC drops, killing both live CICS calls and the CDC stream. Mitigation: redundant Direct Connect plus VPN failover; Kafka buffers CDC so no change is lost, only delayed; the facade fails capabilities that still need the mainframe to a clear degraded mode, not a hang.
- Write-back conflict — a capability cut over to AWS and the mainframe both touch the same record during a partial cutover. Mitigation: strict per-capability ownership of write paths — exactly one plane owns writes for a given capability at a given time, enforced by the router, never both.
- Shadow-mismatch in production — a refactored module prices a quote a few cents off the COBOL on a rare edge case. Mitigation: this is why shadow mode runs against millions of real requests before any canary; the mismatch is found in a report, not a customer complaint.
Reliability & DR (RTO/RPO). Decide the numbers per plane. The mainframe keeps its existing DR posture — it is still the system of record. On AWS, Aurora with a cross-region replica and Kafka with replicated topics give a recoverable read side; because DB2 remains authoritative and the CDC stream is replayable from history, a total loss of the AWS read models is recoverable by re-hydrating from DB2 through CDC rather than by restoring a backup — the mainframe is the ultimate recovery guarantee until the very end. A pragmatic target through the migration: RTO 30 minutes, RPO near-zero for any cut-over capability, with the safety net that any capability can route back to the still-authoritative mainframe in minutes. Akamai health checks drive edge failover for the consumer channel.
Observability. Instrument one trace across both planes in Dynatrace or Datadog: a request entering the facade, the router’s mainframe-vs-AWS decision, the CICS call or the AWS service call, and the database hop — so when a quote is slow you can see which plane and which hop owns the latency. Emit the metrics this migration actually lives or dies on: CDC lag (seconds behind DB2), reconciliation diff count (must be zero), per-capability error-rate delta between mainframe and AWS during canary, shadow-mode mismatch rate, and the MIPS-retired / cost-down curve the board watches. Each capability cutover is a ServiceNow change with the CAB as a documented gate, and a guardrail breach (CDC stall, reconciliation drift) auto-raises a ServiceNow incident so operations has a ticket, not just a graph.
Skills and governance. The retiring-COBOL-engineer problem is a first-class risk, not a footnote: shadow mode doubles as executable documentation — comparing AWS output to COBOL output forces the undocumented business rules into testable code before the last expert leaves. Reskilling matters too; the insurer stood up COBOL-to-Java and AWS modernization courses on Moodle as the internal LMS so the existing team learns the new stack rather than being replaced by it — a governance and continuity control as much as a training one. Routing config and IaC live in version control, reviewable and instantly revertable; Argo CD guarantees the deployed state matches Git; and Wiz Code plus the CAB gate mean no capability flips, and no infrastructure changes, without a documented approval.
Explicit tradeoffs
Accept these or do not attempt it. The strangler-fig is slower to finish than a clean rewrite would be in the fantasy where the rewrite worked — you run two systems in parallel for the entire multi-year migration, paying for both the shrinking mainframe and the growing cloud estate at once, and bidirectional CDC plus a facade plus a router are real moving parts to build and operate. The data-sync plane is the hardest engineering in the whole design; CDC lag and reconciliation are a permanent operational concern, not a one-time setup. Keeping DB2 authoritative buys you safety at the cost of write-back complexity for cut-over capabilities. And the discipline the pattern demands — shadow, canary, reconcile, decommission, for every single capability — is organizationally exhausting in a way a single dramatic cutover is not.
The alternatives, and when they win. If your mainframe is small and well-understood — hundreds of thousands of lines, not millions, with current staff who know it — a planned big-bang replatform over a long weekend can genuinely be cheaper and faster, and you skip the dual-run cost entirely. If the goal is purely to escape the IBM hardware and the batch architecture is acceptable forever, a straight lift-and-shift recompile onto AWS Mainframe Modernization without the strangler ceremony gets you off z/OS in months. And if the application is commodity — a payroll, a generic ERP — the right modernization may be buy, not build: replace the COBOL with a SaaS package and use CDC only to migrate the data out. The strangler-fig earns its complexity precisely when the system is large, bespoke, business-critical, and cannot tolerate a failed cutover — which is exactly the regulated, eight-million-line insurance core this article is built around.
The shape of the win
For the insurer, the payoff is not “we moved to the cloud.” It is that the direct-to-consumer brand ships its real-time quoting experience this year on AWS while the mainframe still runs claims untouched; that the MIPS bill bends measurably downward every quarter as COBOL modules go dark one by one; that the day the last 63-year-old engineer retires, his knowledge is already encoded in the shadow-tested services that replaced his modules rather than walking out the door with him; and that at no point did the board have to bet the company on a single weekend. Everything upstream — the Precisely-to-Confluent CDC keeping DB2 and RDS in lockstep, the API facade over CICS, the per-capability router with its instant rollback, the Vault-held mainframe credentials, the Wiz Code gate, the ServiceNow CAB approvals, the cross-plane Datadog trace — exists to turn “modernize the mainframe” from the sentence that ends a CIO’s career into a controlled, reversible, quarter-by-quarter retreat from the tree. The architecture here is the destination. The tree comes down one branch at a time, and the business never notices the saw.