Architecture Multi-cloud

Zero-Downtime Multi-Cloud Landing Zone for a Universal Bank — Enterprise Reference Architecture

Executive summary

A multinational universal bank — present in 70-plus countries, anchored on four on-premises data centres in four sovereign jurisdictions, and already running a dual-cloud estate across Microsoft Azure and AWS — set its architects a single, unforgiving brief: build one foundation on which every line of business can operate, modernise, and grow without ever taking the bank offline. Behind that sentence sits an industrial amount of reality. More than 100 applications span the full breadth of a universal bank — retail banking, corporate and commercial banking, cards and payments, lending and mortgage, bancassurance, trading and capital markets, and wealth and private banking, all sitting on top of enterprise shared services for finance, risk, compliance, HR, and analytics, with SAP at the centre of the back office and a hybrid workforce signing in from branches, offices, and home. The headline mandate is the one that reshapes every decision downstream: a zero-downtime operating model in which critical services carry no maintenance windows, and releases and platform changes happen with no customer-visible downtime. A bank that promises a customer their salary will clear at 02:00 cannot ask that customer to wait while it patches a server.

Two convictions sit underneath the whole design and are worth stating before any topology is drawn. Identity is the perimeter — not the network, not the firewall, not the data centre wall — because in an institution where a single authenticated session can move money across borders, the question that gates every request is who, on what device, under what risk, not which subnet. And the ledger is the source of truth — the core banking general ledger is the immutable, strongly consistent record of money movement, and every channel, every analytic, every digital experience is a system of engagement that reads from and reacts to that record rather than competing to own it. One sentence of scope-setting is owed to the reader here: we treat the universal-bank framing as authoritative, and where the originating proposal carried a generic connected-asset and IoT requirement, we deliberately reinterpret it for banking — the connected estate that matters to this institution is its ATM and self-service fleet, its cash-in-transit and armoured-vehicle movements, and its branch-device and facility/security telemetry, not industrial sensors. Everything that follows is the target-state reference architecture for that foundation, and — because an enterprise of this consequence deserves proof rather than assertion — it is justified with roughly 35 architecture diagrams that show, layer by layer, how the bank stays open while it changes.

Target-state overview

The estate is best understood from a single vantage point before any one domain is opened up. Customers, employees, and partners enter from the top — retail customers on web and mobile, branch and contact-centre staff over workforce SSO, corporate treasurers over host-to-host APIs, and third-party providers under open-banking consent. Every public request first meets a third-party global edge — a content-delivery network, authoritative DNS with global traffic management and health-based failover, and a web application firewall — which normalises, inspects, and protects traffic before it ever touches a cloud. Beneath the edge, Azure and AWS each host enterprise landing zones built to financial-services patterns; the four on-premises data centres anchor the foundation and connect into both clouds over dual private circuits, carrying the systems that must stay on-premises for consistency or regulatory comfort. Running horizontally across the whole picture are the cross-cutting platforms that make a multi-cloud estate behave as one: Okta and Microsoft Entra ID for identity, Wiz for cloud security posture, CrowdStrike Falcon for workload and endpoint protection, Dynatrace for observability, and ServiceNow for operations and change. A dedicated lane carries connected-banking telemetry — ATM, cash-in-transit, and branch-facility signals — into the platform alongside everything else.

Target-State Overview — Zero-Downtime Universal Bank Estate

Two design decisions in this overview propagate through every section that follows, and both are deliberate departures from the path of least resistance. First, the global edge is a single, third-party, cross-cloud control point rather than a per-cloud feature. One CDN/DNS/WAF stack in front of both Azure and AWS gives the bank exactly one place to enforce OWASP protections, bot and API-abuse mitigation, FAPI schema validation, rate limiting, and geo/ASN policy — and, just as importantly, it lets customer traffic fail over between Azure and AWS origins, or between regions, without re-architecting each application or exposing origin addresses. For an institution whose availability budget is measured in single-digit minutes per year, removing the edge as a single point of failure and centralising its policy is not a convenience; it is a prerequisite. Second, the system-of-record / system-of-engagement split is the load-bearing boundary of the whole estate. The core ledger is kept single-homed — on-premises or pinned to one cloud — for strong consistency and regulatory comfort, and is exposed to the rest of the bank through an event backbone in a canonical ISO 20022 model; everything customer-facing and analytic lives as an active-active, cloud-native system of engagement that consumes those events. Drawing that boundary explicitly, rather than letting it blur, is what makes it possible to release the engagement tier continuously while the record of money movement stays unimpeachably correct.

Architecture principles and operating model

Seven principles govern the build, and the seventh is the one that distinguishes this estate from a merely good landing zone. Platform first — identity, networking, management, and security are established before any workload migrates. Segregation by design — platform and application boundaries never share a subscription or account, so a compromise or a mistake has a small blast radius. Zero trust — no user, device, workload, or path is trusted implicitly; access is evaluated continuously on identity, device, risk, and context, because identity, not the network, is where access is decided. Hybrid by default — every critical service accounts for the four data centres, SAP, and on-premises Active Directory rather than treating them as legacy to be tolerated. Multi-cloud consistency — one set of guardrails, naming, tagging, logging, and CI/CD renders identically into Azure and AWS, which is what stops a dual-cloud estate from drifting into two divergent ones. Resilience over convenience — regional redundancy, redundant circuits, and tested recovery beat the easy path every time they conflict. And the seventh, which sits above the others as the mandate the bank will accept or reject the platform against: zero-downtime by default — change is a normal, continuous operation, not an event that requires an outage, so every critical service is engineered for cell-based isolation, active-active data, and reversible, progressively delivered releases. The first six principles describe a well-governed cloud; the seventh insists that the well-governed cloud never has to close to be maintained.

These principles only hold if someone owns them, and on an estate this large ownership has to be structured rather than assumed. The operating model centres on a Cloud Center of Excellence (CCoE) that owns landing-zone standards, policy-as-code, reference architectures, and the onboarding patterns every team inherits — the single authority that keeps the two clouds converging rather than diverging. Around the CCoE sit six delivery teams with clear and non-overlapping remits: Cloud Platform Engineering (landing zones, subscription and account vending, shared tooling, the zero-downtime release tooling itself), Cloud Security and Identity (Okta and Entra federation, Conditional Access, PIM, key custody, Wiz and CrowdStrike, governance), Network and Connectivity (ExpressRoute, Direct Connect, transit, firewalls, DNS, the global edge, and remote-access VPN), Cloud Operations (monitoring, backup, incident response, and ServiceNow change), and Application Enablement (onboarding blueprints by workload type, so a payments service and a wealth workstation each land on a pattern built for it). A Security Operations Centre and a Service Desk operate alongside them as the always-on functions that the follow-the-sun, never-offline mandate demands. The organising intent throughout is that every pattern in this architecture is expressed as reusable code and documented runbooks rather than tribal knowledge — because a zero-downtime platform that only a handful of people can actually run to its targets has not, in any meaningful sense, met them.

Banking lines of business and domain segmentation

The institution is not a single application but a federation of businesses, and a landing zone that treats it as one homogeneous estate will either over-engineer the cheap workloads or under-protect the critical ones. The bank runs eight lines of business, each with a materially different non-functional centre of gravity: retail banking; corporate and commercial banking; cards and payments; lending and mortgage; insurance (bancassurance); trading and capital markets; wealth and private banking; and enterprise shared services (finance, risk, compliance, HR, analytics). The governing principle of this design is that the platform is segmented by domain, and each domain inherits its own workload tiering, resilience class, data classification and release controls rather than a one-size-fits-all baseline.

The differences are not cosmetic. Retail and payments are dominated by volume and availability — millions of small interactions that must never see a maintenance window, so they live in the highest customer-facing tier with cell-based partitioning and zero-downtime release patterns. Trading and capital markets are dominated by latency, where matching is measured in microseconds and tick-to-trade in nanoseconds; this domain trades geographic spread for co-location and deterministic, inline pre-trade risk. Corporate, commercial, lending and mortgage are dominated by throughput, integrity and audit — fewer but higher-value flows over host-to-host and ISO 20022 channels, where a lost or duplicated instruction is unacceptable and every change must be evidenced for ITGC and SOX purposes. Wealth, private banking and insurance are dominated by confidentiality — suitability data, portfolios and claims that are highly sensitive but tolerant of slightly lower availability. Enterprise shared services are the regulatory backbone — identity, finance, risk, compliance and analytics that every other domain depends on, and therefore the platform control plane engineered to the strictest tier of all. Because the non-functional profile drives the data classification (Public / Internal / Confidential / Restricted), the resilience tier and the controls that gate releases, segmentation by line of business is the organising spine of the whole landing zone.

Banking Lines of Business & Domain Segmentation

Domain Primary non-functional Resilience tier Data classification
Retail banking Volume and availability (no maintenance window) Tier-1 (99.99%) Confidential (PII)
Cards and payments Availability and sub-100 ms authorisation Tier-1 (99.99%) Restricted (cardholder data, CDE)
Corporate and commercial banking Throughput, integrity and audit Tier-1 (99.99%) Confidential / Restricted (account, ledger)
Lending and mortgage Throughput and data integrity Tier-2 (99.9%) Confidential (PII, credit)
Trading and capital markets Latency (microseconds, co-located) Tier-1 (99.99%) Restricted (market-sensitive)
Wealth and private banking Confidentiality and suitability Tier-2 (99.9%) Restricted (portfolio, PII)
Insurance (bancassurance) Confidentiality of claims data Tier-2 (99.9%) Confidential (claims, PII)
Enterprise shared services Regulatory backbone (identity, control plane) Tier-0 (99.995%) Restricted (audit, golden source)

Requirements and non-functional targets

Non-functional targets are pinned to a small number of engineered tiers so that every workload can be placed unambiguously, and so that the cost of resilience is spent where the business value justifies it. Availability is budgeted, not aspirational. The identity and shared-platform control plane — the Tier-0 spine on which all domains depend — is engineered to 99.995% monthly (≈26 minutes per year); customer-facing digital banking and all Tier-1 services (payments, core APIs, channels) to 99.99% (≈52 minutes per year); internal and analytics Tier-2 services to 99.9%; and Tier-3 services to 99.5%. Two rules sit above the numbers and constrain the whole design: there must be no single point of failure in identity, connectivity, ingress or shared services for any Tier-1 service, and planned changes must carry no customer-visible downtime — releases and platform changes land through zero-downtime patterns, never through maintenance windows.

Latency is tiered the same way, because the domains do not share a clock. Trading matching is microsecond-class on co-located infrastructure; card authorisation, with inline fraud scoring and sanctions screening, completes end-to-end in under 100 ms (the fraud slice consuming only 10–50 ms of that envelope); interactive channels meet a p95 under 300 ms in-region; and instant-payment posting completes within the 5-second scheme SLA. Security posture is equally explicit: MFA on 100% of identities with phishing-resistant factors for privileged access, zero standing privilege enforced through just-in-time elevation, a Wiz cloud-security-posture score of at least 85, and a patch SLA of 7 days for critical and 30 days for high findings. Agility is held to the same discipline — a subscription or account is vended in under one business day, a new application reaches production readiness in under two weeks via blueprint, and every production release is reversible. Recovery targets map onto the DR tiers: Tier-0 (identity, connectivity, security control plane) at RTO 1h / RPO 15m; Tier-1 (customer platforms, payments, SAP production, partner and core APIs) at RTO 2h / RPO 15m, with the ledger held at RPO ≈ 0 through synchronous quorum; Tier-2 at RTO 8h / RPO 4h; and Tier-3 at RTO 24h / RPO 24h.

Dimension Target How it’s met / measured
Control-plane availability 99.995% monthly (Tier-0, ≈26 min/yr) No SPOF in identity/connectivity/ingress; GSLB health-based failover
Customer-facing / Tier-1 availability 99.99% monthly (≈52 min/yr) Cell-based partitioning, zero-downtime releases, active-active
Tier-2 availability 99.9% monthly Paired-DR region, automated failover
Tier-3 availability 99.5% monthly Standard managed redundancy
Planned-change downtime Zero customer-visible downtime Blue-green / canary / traffic-shifting, no maintenance windows
Trading latency Microsecond-class matching Co-location, FPGA tick-to-trade, inline deterministic risk
Card authorisation latency < 100 ms end-to-end (fraud 10–50 ms) Co-located models/feature stores, inline fraud + sanctions
Interactive-channel latency p95 < 300 ms in-region Regional ingress, in-region service placement
Instant-payment posting < 5 s (scheme SLA) Payments hub, idempotent ledger posting, stand-in
Security posture MFA 100%, zero standing privilege, Wiz ≥ 85, patch 7/30 d Entra/Okta CA + PIM JIT; Wiz posture; patch SLA tracking
Recovery (DR tiers) T0 1h/15m · T1 2h/15m (ledger RPO≈0) · T2 8h/4h · T3 24h/24h Paired-DR per geography in both clouds; sync quorum for ledger
Agility Vend < 1 business day; app prod-ready < 2 weeks Subscription/account vending; blueprint onboarding

Requirements traceability

To make the design auditable rather than merely plausible, every headline requirement is traced to a concrete design response and to the evidence — a downstream section or diagram — that demonstrates it. The traceability matrix is the contract between the mandate and the architecture: a regulator, an internal auditor or an architecture review board can follow any single requirement from statement to realised control without taking the design on faith. The twelve requirements below are the load-bearing set; each is satisfied by a specific, named mechanism elsewhere in this document, and no requirement is left to be implied by general good practice.

Req Requirement Design response Evidence (section / diagram)
REQ-01 Zero-downtime releases (no maintenance windows) Blue-green, canary/progressive, cell-based deploy with expand-contract schema change Zero-downtime release patterns → bank-zero-downtime-patterns
REQ-02 Dual-cloud guardrails and tested exit Subscription/account factories + policy-as-code across Azure and AWS Landing zones / governance → bank-azure-landing-zone, bank-aws-landing-zone, bank-governance-hierarchy
REQ-03 Zero-trust identity Conditional Access + PIM JIT, one OIDC/FAPI broker, phishing-resistant privileged auth Identity and zero-trust control plane → bank-identity-zero-trust, bank-conditional-access-pim
REQ-04 Resilient hybrid connectivity Dual ExpressRoute + dual Direct Connect, active-active BGP, dual carrier Global hybrid connectivity → bank-global-connectivity
REQ-05 Core ledger integrity CP consensus store, synchronous quorum (RPO≈0), SoR single-homed vs SoE Core banking platform → bank-core-banking-sor-soe
REQ-06 Payments availability and exactly-once effect Payments hub normalising all rails + idempotent posting + issuer stand-in (STIP) Payments hub and ISO 20022 → bank-payments-hub
REQ-07 Card authorisation latency Inline fraud scoring within the < 100 ms authorisation envelope Cards / fraud → bank-cards-authorisation, bank-fraud-financial-crime
REQ-08 Multi-layer security Seven-layer security model plus global edge (CDN, WAF, origin cloaking) Security model → bank-multilayer-security, bank-global-edge-ingress
REQ-09 Regulatory compliance Control matrix mapping PCI-DSS, PSD2, DORA, Basel/BCBS 239, SOX and AML controls Compliance and control mapping → bank-compliance-control-map
REQ-10 Data residency and sovereignty Governance data-classification band + sovereign/region-pinned placement, BYOK/HYOK Data and integration platform → bank-data-platform
REQ-11 Onboarding 100+ applications Blueprints plus subscription/account vending and standardised landing zones Application onboarding → bank-azure-app-onboarding, bank-aws-app-onboarding
REQ-12 Operable after handover RACI ownership model plus DR and operational runbooks Operating model and RACI; DR runbooks

Azure landing zone

Azure carries the larger share of the bank’s regulated digital estate, so its foundation is not a generic enterprise-scale landing zone but the Microsoft Cloud for Financial Services FSI landing zone — the enterprise-scale pattern with the financial-services control overlay layered on top, deployed through Azure Verified Modules so the hierarchy itself is reproducible code. A Tenant Root Group anchors a management-group tree that pushes Azure Policy and RBAC down by inheritance, the single mechanism that lets a control authored once apply everywhere beneath it; in a bank operating in 70-plus countries, that inheritance is what keeps a residency or encryption rule from silently lapsing in a corner of the estate nobody is watching. Directly under the root sit a Platform management group, a Landing Zones management group, and two lifecycle groups — Sandbox for guarded experimentation and Decommissioned for accounts being offboarded — each with its own deliberately different policy posture.

Azure Landing Zone — FSI Management Groups & Platform Subscriptions

The Platform management group holds four dedicated platform subscriptions — Identity, Connectivity, Management, and Security — and the separation earns its keep by keeping shared services out of any single workload’s blast radius and letting the platform teams run each on its own change cadence. Identity carries the domain-controller footprint, Entra Connect, and private DNS support fronting the authoritative on-premises AD DS. Connectivity holds the regional hub VNets, Azure Firewall Premium, Application Gateway WAF v2, the dual ExpressRoute gateways, DNS Private Resolver, Bastion, and DDoS Network Protection. Management centralises Log Analytics, Azure Monitor, Automation, Update Management, and the Backup and Recovery Services vaults. Security runs Microsoft Sentinel, Key Vault Managed HSM, the two break-glass identities, and the integration anchors for Wiz and CrowdStrike Falcon. The Landing Zones group then splits into Corp (internal) and Online (internet-facing), each separated again into production and non-production subscriptions, so a customer-facing payments workload and an internal analytics workload never share a guardrail set sized for the wrong risk.

What makes this an FSI landing zone rather than a generic one is the control plane wrapped around that hierarchy. Azure Policy regulatory-compliance initiatives are assigned at the management-group level — PCI-DSS, SOC, ISO 27001, and the bank’s own custom initiative for residency and customer-managed-key enforcement — so every subscription beneath inherits the same audited posture and reports compliance against named frameworks rather than ad-hoc rules. Microsoft Purview governs data classification and lineage across the estate, which is the practical answer to BCBS 239 risk-data-aggregation expectations and to proving where Restricted data lives. Azure Arc projects that same policy estate onto the four on-premises data centres and onto AWS, so a single Azure Policy author can reach non-Azure resources and the bank’s control surface does not stop at the cloud boundary. Application subscriptions are never hand-built: a subscription vending factory places each new subscription under the correct management group, inherits policy, assigns RBAC, applies cost and data-classification tags, wires it to the regional hub, and enrols it in logging and posture management — so a workload arrives already governed, well inside the vend in under one business day target the platform is held to.

AWS landing zone

AWS mirrors the same intent through AWS Organizations with a multi-account model delivered by Control Tower and Account Factory for Terraform (AFT), so the second cloud is governed by the same philosophy — accounts as the unit of isolation, guardrails inherited from above — expressed in that cloud’s native primitives. A Management account sits at the apex purely as the organisation root and billing anchor, deliberately empty of workloads. Beneath it the organisation is partitioned into purpose-built organisational units whose boundaries follow blast-radius and audit lines, not org-chart lines. The Security OU isolates the accounts that must survive a compromise of everything else: a Log Archive account whose immutable S3 with Object Lock holds the tamper-evident record of the estate, and a Security Tooling account running detection and the AWS-side anchors for Wiz and CrowdStrike. The Infrastructure OU carries a Shared Network account per region — home to the regional Transit Gateway and the inspection and ingress/egress VPCs — alongside a Shared Services account. Workloads then split into Prod and Non-Prod OUs, with application accounts separated by criticality so a Tier-1 payments account and a Tier-3 internal tool never share an SCP boundary.

AWS Landing Zone — Organizations, OUs & Account Factory (FSI Lens)

Guardrails on the AWS side are enforced as Service Control Policies that inherit down the OU tree, expressing exactly the controls the Azure policy hierarchy enforces on the other cloud: approved regions and jurisdictions only, no public exposure of sensitive services, mandatory tagging, customer-managed-key encryption, and an unbreakable path from every account to the immutable Log Archive. Control Tower’s preventive and detective controls sit on top of these, and the whole posture is evaluated against the AWS Well-Architected FSI Lens — the financial-services lens that frames the design review around the same regulatory, resilience, and data-protection concerns the Azure FSI landing zone is built for, so both clouds are assessed against a recognised banking bar rather than a generic one.

The decisive pattern here is account vending as code. AFT provisions and customises every account through a GitOps workflow on Bitbucket, so an application account is never a manual ticket — it is a reviewed pull request that yields an account already wired with centralised logging to the Log Archive, Security Tooling enrolment, a Transit Gateway attachment, IAM Identity Center federation brokered through Okta, approved-region settings, baseline KMS keys, and the criticality-appropriate SCPs. That symmetry between Azure subscription vending and AWS account vending is what makes multi-cloud consistency real rather than aspirational: a workload team requests an environment the same way and receives the same governed baseline regardless of which cloud answers, which also satisfies the regulator-facing requirement that the bank can stand up — or rebuild — a clean, compliant environment on demand.

Governance hierarchy and policy inheritance

The two landing zones look different on the surface — Azure management groups on one side, AWS organisational units on the other — but they are deliberately governed by one guardrail set compiled into both, and that single-source-of-truth approach is the load-bearing decision of the entire foundation. The bank cannot afford two divergent control estates audited, reasoned about, and exit-tested separately; the EBA/ECB outsourcing expectations and DORA’s concentration-risk and tested-exit obligations effectively require that it can demonstrate equivalent control on either provider. So the controls are authored once, as policy-as-code in Terraform, and rendered into each cloud’s native enforcement: Azure Policy definitions and RBAC assignments that inherit down the management-group tree, and Service Control Policies that inherit down the AWS OU tree. The Terraform module library is the canonical artefact; the cloud-specific assignments are compiled outputs of it.

Governance Hierarchy & Policy Inheritance — Azure MGs + AWS OUs

Five shared guardrails travel down both hierarchies as the non-negotiable floor every account and subscription inherits. Approved regions and residency only pins workloads to the four strategic geographies and their paired-DR regions, which is how GDPR, Schrems II, and country localisation rules are enforced structurally rather than by policy memo. No public exposure of sensitive services denies internet-facing endpoints on data, key, and ledger-adjacent services, forcing all access through Private Link and PrivateLink. Mandatory tagging stamps owner, cost centre, data classification, and criticality on every resource, without which neither chargeback nor the BCBS 239 data lineage holds together. Encryption with customer-managed keys mandates BYOK against Key Vault Managed HSM and AWS KMS or CloudHSM, with HYOK and External Key Store reserved for the most sensitive paths so plaintext keys never leave bank-controlled hardware. And logging to an immutable archive guarantees that every account streams its audit trail to a write-once, Object-Lock store the workload team itself cannot alter — the evidentiary spine for SOX ITGCs and every financial-crime investigation.

Inheritance is the property that lets this operating model scale without growing a control team in proportion to the estate. Platform engineers change a guardrail in one Terraform module, and on the next pipeline run every subscription beneath the Azure tree and every account beneath the AWS tree conforms — no fleet of environments to revisit by hand, and no window in which half the estate runs an old rule. Crucially, the relationship is one-directional: workload teams are free to build inside the guardrails, but inheritance means they cannot weaken a control handed down from above — a non-production team cannot grant itself an unapproved region, and an application owner cannot open a sensitive endpoint to the internet, because the denial is asserted higher in the tree than they have authority to edit. That is exactly the asymmetry a zero-downtime bank needs: maximum delivery autonomy for the teams building services, combined with a control floor that a single misconfigured workload can never sink the whole estate beneath.

Global hybrid connectivity

The connectivity fabric is the foundation on which every zero-downtime claim ultimately rests: if the path between the bank’s on-premises cores, its two public clouds and its customers is not survivable, no amount of application-tier resilience will save a Tier-1 service. The design therefore treats wide-area connectivity as a Tier-0 concern engineered to the same 99.995% control-plane availability as identity, with an explicit mandate that there is no single point of failure in connectivity for any Tier-1 service.

Connectivity is organised around four strategic regions, each aligned one-to-one with one of the four on-premises data centres in four countries. This alignment is deliberate: it keeps the system of record physically close to the cloud landing zones that front it, bounds cross-border data flows to satisfy data-residency obligations under GDPR and the EU Data Act, and gives each geography a self-contained primary plus a paired-DR region in both clouds. Each region lands a symmetrical private circuit bundle — two ExpressRoute circuits into Azure and two Direct Connect connections into AWS — running active-active with BGP, terminated on two physically diverse carriers so that a single carrier outage, fibre cut or provider maintenance event degrades but never severs a region. There are no standby circuits waiting to be promoted; all four private paths per region carry production traffic continuously, and BGP local-preference and AS-path policies steer flows while preserving headroom to absorb the full load of any one failed circuit.

The two clouds are stitched together by topologies native to each provider. Azure follows a hub-and-spoke model: a regional hub VNet terminates the ExpressRoute gateways and centralises shared network services, with production and shared-service spokes peered into it. AWS is transit-centric, with a regional Transit Gateway acting as the hub for inspection, ingress, egress and application VPCs, and a Direct Connect Gateway binding the private circuits to the Transit Gateway. Crucially, the clouds are not casually meshed. Inter-cloud traffic is controlled, minimised and inspected — permitted only for a small set of governed flows (the ISO 20022 event backbone, cross-cloud DR replication for the few active-active critical functions, and shared identity/observability telemetry) and forced through firewalls on both sides rather than flowing over an open cloud-to-cloud peering. This keeps the multi-cloud estate a deliberate concentration-risk mitigation rather than an uncontrolled blast-radius amplifier.

Global Hybrid Connectivity — 4 Regions, Dual ExpressRoute + Dual Direct Connect

Azure regional network

Within each strategic region the Azure footprint is a classic enterprise-scale hub-and-spoke, sized and addressed for a regulated institution rather than a generic enterprise. The regional hub VNet occupies 10.10.0.0/16 in the primary region and consolidates every shared network function so that policy is enforced once, centrally, and inherited by all workloads.

The hub is carved into purpose-built subnets, each sized for its role and its growth curve. The GatewaySubnet (10.10.0.0/27) terminates the dual ExpressRoute gateways; the AzureFirewallSubnet (10.10.1.0/26) hosts Azure Firewall Premium as the central inspection and egress-control plane; the AppGatewaySubnet (10.10.2.0/26) carries the regional Application Gateway with WAF v2 for north-south ingress to web workloads; the AzureBastionSubnet (10.10.3.0/26) provides brokered, just-in-time administrative access with no public IPs on workload hosts; and a dedicated DNS subnet (10.10.4.0/27) runs the private resolver and conditional forwarders that knit Azure Private DNS zones together with on-premises and AWS name resolution.

Workloads never live in the hub. The production spoke (10.20.0.0/16) is peered to the hub and internally tiered — web at .1/24, application at .2/24, and data plus Private Endpoints at .3/24 — while shared services occupy a separate spoke at 10.30.0.0/16. Two controls make this topology defensible to an auditor. First, user-defined routes force all spoke egress through the Azure Firewall Premium in the hub: a 0.0.0.0/0 UDR on every workload subnet points at the firewall’s private IP, so there is no path to the internet or to another spoke that bypasses TLS inspection, IDPS and URL filtering — spoke-to-spoke traffic is hair-pinned through the hub rather than peered directly. Second, all PaaS consumption is via Private Endpoints: Key Vault, Storage, SQL, Cosmos DB and the rest are reached over private IPs inside the data subnet, public service endpoints are disabled, and the corresponding private DNS zones resolve those names to in-VNet addresses. The result is a network where confidential and restricted data never traverses a public path, and where the firewall is a mandatory waypoint rather than an optional one.

Azure Regional Hub-and-Spoke Network (Deep Dive)

AWS regional network

The AWS regional network mirrors the intent of the Azure hub-and-spoke but expresses it in the provider-native, transit-centric idiom built around a regional Transit Gateway. The Transit Gateway is the single routing core for the region; every VPC attaches to it, and route-table associations — not VPC peering — decide what may talk to what. A Direct Connect Gateway binds the two active-active Direct Connect connections to the Transit Gateway, so the private circuits reach all attached VPCs without a mesh of individual virtual interfaces.

Traffic is segmented across dedicated, single-purpose VPCs. The inspection VPC (10.100.0.0/16) centralises AWS Network Firewall; Transit Gateway route tables are configured so that east-west and north-south flows are steered through this VPC for stateful inspection, intrusion prevention and domain filtering before reaching their destination — the AWS analogue of the Azure Firewall hair-pin. A separate ingress VPC (10.101.0.0/16) terminates north-south entry on an internet-facing Application Load Balancer fronted by AWS WAF, deployed across two Availability Zones so that the loss of an AZ removes no ingress capacity. A dedicated egress VPC (10.102.0.0/16) provides controlled outbound internet through NAT gateways deployed per Availability Zone, giving deterministic, allow-listed egress for workloads that must reach external endpoints. Production application workloads run in their own app VPC (10.200.0.0/16), with shared services in 10.210.0.0/16, and both use /20 subnets spread across two AZs.

As on the Azure side, PaaS access is private by default: gateway and interface VPC endpoints keep S3, DynamoDB, KMS and similar service traffic on the AWS backbone, and PrivateLink exposes internal and partner services without crossing the public internet. The separation of inspection, ingress, egress and application concerns into discrete VPCs gives the network team independent blast radii and lets PCI-DSS cardholder-data segmentation be proven VPC-by-VPC rather than argued across a flat address space.

AWS Regional Transit-Centric Network (Deep Dive)

IP address management plan

Address space is treated as a first-class governed asset across the whole estate. The bank operates a single, centrally governed 10.0.0.0/8 supernet for cloud, with allocation enforced through Azure IPAM and AWS VPC IPAM so that no team can self-issue overlapping ranges and every block is attributable to an owner, a cloud, a region, an environment and a role. The unit of allocation is a /16 per (cloud, region, environment, role) — large enough to subnet generously for tiered workloads, small enough to keep the global map legible.

The Azure side follows a deterministic, human-readable 10.[role][region] scheme: the third octet encodes role in its first digit (1 = hub, 2 = production, 3 = shared services) and region in its second digit (0–3 for the primary regions, 4–7 for their paired-DR regions). So 10.10.x.x is the region-1 hub, 10.20.x.x is region-1 production, 10.14.x.x is the region-1 DR hub, and so on — the address itself tells an engineer exactly what it is and where it lives, which is invaluable during an incident. AWS uses a distinct, non-overlapping band (the 10.100.x.x / 10.200.x.x ranges) for its inspection, ingress, egress, application and shared VPCs.

Two further ranges are reserved and kept strictly out of the cloud supernet. The four on-premises data centres consume 172.16.0.0/12, partitioned as a /14 per DC — DC1 172.16/14, DC2 172.20/14, DC3 172.24/14, DC4 172.28/14 — advertised into the clouds over the private circuits. Finally, 192.168.0.0/16 is reserved and deliberately left unused at the enterprise level to avoid collisions with appliance defaults, lab kit and acquired-entity networks during integration. The master allocation below is the authoritative reference for the four-region, dual-cloud estate.

Global IPAM Allocation Map

Cloud Region / scope Role Environment CIDR
Azure Region 1 (primary) Hub Platform 10.10.0.0/16
Azure Region 1 (primary) Production Prod 10.20.0.0/16
Azure Region 1 (primary) Shared services Platform 10.30.0.0/16
Azure Region 1 (paired-DR) Hub / Prod / Shared DR 10.14.0.0/16 · 10.24.0.0/16 · 10.34.0.0/16
Azure Regions 2–4 (primary) Hub Platform 10.11.0.0/16 – 10.13.0.0/16
Azure Regions 2–4 (primary) Production Prod 10.21.0.0/16 – 10.23.0.0/16
Azure Regions 2–4 (primary) Shared services Platform 10.31.0.0/16 – 10.33.0.0/16
AWS Region 1 Network / inspection Platform 10.100.0.0/16
AWS Region 1 Ingress (ALB + WAF) Platform 10.101.0.0/16
AWS Region 1 Egress (NAT) Platform 10.102.0.0/16
AWS Region 1 Production application Prod 10.200.0.0/16
AWS Region 1 Shared services Platform 10.210.0.0/16
On-premises DC1 / DC2 / DC3 / DC4 Data centre Core 172.16/14 · 172.20/14 · 172.24/14 · 172.28/14
Reserved Enterprise-wide 192.168.0.0/16 (reserved, unused)

Capacity and sizing

Capacity is dimensioned to the bank’s strict NFRs — Tier-1 services at 99.99%, the shared platform and identity control plane at 99.995% — and, just as importantly, to the zero-downtime mandate, which means every tier is sized to absorb both a failure domain and an in-place release simultaneously without breaching its impact tolerance. The headline circuit standard is two 10 Gbps ExpressRoute plus two 10 Gbps Direct Connect connections per region, active-active across dual carriers; each path is provisioned so that any single circuit can carry the surviving load of its region, leaving roughly half the aggregate bandwidth as deliberate headroom rather than steady-state utilisation.

On the Azure side, Azure Firewall Premium provides the central inspection plane with TLS inspection and IDPS, and Application Gateway WAF v2 scales from 2 to 10 instances, zone-redundant, so ingress capacity tracks demand without a manual change and survives an availability-zone loss. Compute for stateless and bursty workloads runs on Virtual Machine Scale Sets using D4s_v5 and D8s_v5 SKUs with autoscale, while containerised platform and channel workloads run on AKS. On the AWS side, Network Firewall mirrors the central inspection role, application compute runs on m6i Auto Scaling Groups, and containers on EKS — giving symmetrical horizontal-scaling behaviour across both clouds so a region can be drained to its partner without capacity cliffs.

The data and domain tiers are sized to their consistency and latency obligations rather than to averages. The ledger runs on a CP consensus store with synchronous quorum (Raft/Paxos-class) so balances are never subject to last-write-wins, sized for RPO≈0 and 2PC cross-shard posting. The payments hub runs on EKS/AKS, scaled to sustain instant-payment posting within the sub-5-second scheme SLA and card authorisation within the sub-100 ms end-to-end envelope. SAP HANA is placed on M-series / High-Memory instances to hold its in-memory working set, with production sized for the Tier-1 RTO of two hours. The table below records the indicative sizing baseline by domain; each line is engineered to survive a failure domain and a concurrent zero-downtime release.

Domain Platform / service Sizing baseline Scaling / resilience
Private circuits ExpressRoute + Direct Connect 2 × 10 Gbps ER + 2 × 10 Gbps DX per region Active-active BGP, dual carrier, ~50% headroom
Azure ingress App Gateway WAF v2 2–10 instances Autoscale, zone-redundant
Azure inspection Azure Firewall Premium Central egress / TLS / IDPS Scale-out, zone-redundant
Azure compute (VM) VMSS D4s_v5 / D8s_v5 Autoscale group per tier Horizontal autoscale across AZs
Azure containers AKS Node pools per channel/domain Cluster autoscaler, multi-AZ
AWS ingress ALB + AWS WAF 2 Availability Zones Elastic, AZ-redundant
AWS inspection Network Firewall Central inspection VPC Scale-out, multi-AZ
AWS compute m6i Auto Scaling Group ASG per tier Horizontal autoscale across AZs
AWS containers EKS Managed node groups Cluster autoscaler, multi-AZ
Core ledger CP consensus store Synchronous quorum, multi-node RPO≈0, 2PC cross-shard, quorum survival
Payments hub EKS / AKS Autoscaled microservices Sub-5 s instant post, sub-100 ms card auth
SAP HANA on M-series / High-Memory In-memory working set Tier-1 RTO 2 h, paired-DR region

Identity and zero-trust control plane

Identity is the first and most consequential control plane in this estate. Across 70+ countries, a hybrid workforce, 100+ applications, and a dual-cloud footprint over Azure and AWS, identity — not the network perimeter — is the primary trust boundary, engineered to the Tier-0 availability target of 99.995% with no single point of failure. The institution does not start from a clean slate: four on-premises data centres anchor an established on-premises Active Directory Domain Services (AD DS) forest that remains the authoritative source for human and service identity. Rather than rip and replace, the design projects that authority outward through a layered topology, then governs every authentication decision through a single zero-trust policy engine.

Identity & Zero-Trust Control Plane

AD DS synchronises to Microsoft Entra ID via Entra Connect, and Entra ID becomes the cloud identity backbone for everything in the Microsoft estate — Microsoft 365, Intune device management, Conditional Access, Privileged Identity Management (PIM), and device-compliance signals. Alongside it, Okta is the strategic SaaS single-sign-on layer and the federation broker into AWS, giving the bank one consistent SSO experience across hundreds of SaaS applications and a clean, auditable federation path into AWS accounts without minting long-lived cloud credentials. Joiner-mover-leaver lifecycle is automated end to end: Workday is the system of record for the workforce and provisions identities into Okta over SCIM, so that a hire, a transfer, or a termination in HR propagates to entitlements within the same control plane rather than through manual tickets. This matters in a regulated bank because joiner-mover-leaver discipline is exactly what SOX ITGCs and the access-management expectations of NIST CSF 2.0 and the GLBA Safeguards Rule are auditing for — least privilege that actually tracks the employment relationship.

Every access request — workforce or workload — is evaluated by a Conditional Access engine that fuses four signal classes: user and sign-in risk, device state and compliance, network location and named geography, and the sensitivity of the application or operation being requested. The engine’s verdict is never a binary allow/deny: it grants, it forces step-up multi-factor authentication, it requires a compliant managed device, it constrains the session, or it blocks. MFA is mandatory for one hundred per cent of users, and privileged access is phishing-resistant — FIDO2 security keys and Windows Hello for Business — closing the door on the credential-replay and adversary-in-the-middle techniques that defeat OTP-based factors.

Privileged access is governed by zero standing privilege: no human carries a permanently active administrative role. Instead, PIM holds roles eligible-but-inactive; activation is just-in-time, gated by approval, time-boxed, and fully audited. Administrators operate exclusively from Privileged Access Workstations (PAWs) — hardened, Intune-managed devices isolated from email and general browsing — so that the workstation used to manage the core ledger or the payments hub is never the same one that opens an untrusted attachment. Exactly two break-glass accounts exist per tenant: excluded from Conditional Access, protected by long credentials in split custody, and alerted on at every use, they are the deliberate, monitored last resort when the policy engine itself is unavailable.

Two banking-specific extensions sit on top of this baseline. First, step-up authentication is enforced at the moment of consequence, not only at sign-in — releasing a high-value payment, changing a beneficiary, or entering an administrative plane re-challenges the operator regardless of how recently they authenticated, which aligns directly with SOX segregation-of-duties and the dual-control model used at the teller and operations desks. Second, for customer-facing channels the institution implements Strong Customer Authentication (SCA): multi-factor authentication with dynamic linking of the authentication to the specific payee and amount, satisfying PSD2 today and positioning cleanly for PSD3/PSR. Customer identity is brokered through an OIDC/FAPI identity layer so that retail, corporate, and open-banking third parties all authenticate against one standards-based broker with many step-up policies behind it. Workforce identity (Entra ID and Okta) and customer identity (the OIDC/FAPI broker) are deliberately kept as separate populations on a common zero-trust philosophy — never co-mingled.

Employee digital workplace

The workplace design follows from the same premise: the user’s identity and posture, not their location, determine what they may reach. A 70-plus-country hybrid workforce connects through three deliberately distinct access channels, all converging on one identity fabric, one consistent DNS view, and one Conditional Access policy set.

Employee Digital Workplace — Office, VPN & SaaS Access

The first channel is the office, where staff on the corporate network take a local internet breakout to Microsoft 365 and approved SaaS rather than hair-pinning that traffic back through a data centre. Breaking out locally keeps latency low for the collaboration tools people use all day, while sensitive application traffic still routes over the private backbone to the cloud landing zones. The second channel is work-from-home over the GlobalProtect VPN, with every tunnel fronted by Okta or Entra MFA before any internal resource is reachable. The bank treats classic full-tunnel VPN as a transitional control and is modernising deliberately toward identity-aware private access — per-application, continuously-evaluated connectivity that removes the implicit “inside the tunnel equals trusted” assumption rather than a single broad network on-ramp. The third channel is direct SaaS access: Microsoft 365, ADP, Workday, ServiceNow, and Bitbucket are reached straight over the internet, but only ever behind Okta SSO and Conditional Access, so a managed laptop at home and an unmanaged device in a branch are evaluated against the same policy before either touches payroll or the code repository.

What unifies the three channels is that device trust is a precondition of access, not an afterthought. Corporate endpoints are enrolled in Microsoft Intune for compliance — disk encryption, patch level, configuration baselines — and protected at runtime by CrowdStrike Falcon endpoint detection and response. Compliance and EDR posture feed back into the Conditional Access engine as live signals: a device that drifts out of compliance or shows an active Falcon detection is automatically stepped up or blocked from sensitive applications, independent of which of the three channels it arrived through. The result is a workplace where a single identity, a consistent DNS resolution path, one Conditional Access policy set, and verified device health deliver the same security outcome whether an employee is at a desk, at home, or on a SaaS app from an airport — and where modernising the WFH path toward identity-aware private access proceeds without re-architecting the controls.

Conditional Access and PIM policy model

The Conditional Access engine is where zero-trust principle becomes deterministic, testable policy. Every authentication is treated as untrusted until evaluated, and the same decision flow governs workforce, workload, and administrative access across both clouds.

Conditional Access & Privileged Identity Management Model

The decision flow takes a bundle of inputs — user and sign-in risk, device compliance state, network location and named country, the client application, and the sensitivity of the target resource — and passes them to the policy engine, which returns one of five outcomes: grant, grant only with step-up MFA, grant only from a compliant device, block, or grant under session controls. A consistent set of baseline policies expresses the institution’s risk appetite:

Baseline policy Condition Enforced control
Block legacy authentication Legacy/basic-auth protocols Block outright (no MFA bypass)
MFA for all users Any interactive sign-in Multi-factor required, 100% of users
Compliant device for sensitive apps Admin consoles, finance, HR, code Intune-compliant managed device required
Phishing-resistant MFA for privileged Any privileged role activation FIDO2 / Windows Hello only
Session controls for unmanaged devices Sign-in from a non-compliant device App-enforced restrictions, no download
Stronger controls for high-risk operations SAP administration; payments and finance operations Step-up + compliant device + PAW

The final row is the banking-specific reinforcement: SAP administrators and anyone performing payments or finance operations face the strictest combination — phishing-resistant step-up, a compliant device, and origination from a Privileged Access Workstation — because those planes touch the general ledger, the enterprise resource backbone, and money movement. This is also where the Conditional Access layer and the payment step-up controls described earlier reinforce one another: the same engine that admits an administrator to the SAP console is the one that re-challenges an operator releasing a high-value transfer.

Privileged Identity Management implements the zero-standing-privilege mandate as a closed loop. Roles are held eligible, not active; an administrator who needs to act follows a request → approve → time-boxed activation → audit cycle. Activation requires justification and approver sign-off, the elevation expires automatically at the end of a bounded window, and every step is written to an immutable audit trail consumed by Microsoft Sentinel and surfaced to the SOC. Together, the Conditional Access baselines and the PIM lifecycle give the bank a control model that is continuously evaluated, least-privilege by default, and demonstrably reversible — satisfying the access-management, segregation-of-duties, and immutable-audit expectations of SOX, PCI-DSS, DORA, and NIST CSF 2.0 while never granting privilege that outlives the task that justified it.

Global edge and per-channel ingress

Every public request to the bank — whether it originates from a retail customer’s phone, a corporate treasurer’s payment file, an ATM in a different time zone, or a third-party provider exercising open-banking consent — meets the same first line of defence before it is allowed anywhere near a cloud origin. That first line is a third-party global edge that sits in front of both Azure and AWS rather than belonging to either, and it is deliberately built as one cross-cloud control point so that protection, traffic management, and failover are decided once, consistently, and outside the trust boundary of any single provider. The edge has three jobs, and it does all three before a request reaches an origin. It is a content-delivery network that terminates TLS at the closest point of presence, absorbs volumetric load, and shields cache-eligible content. It is authoritative DNS with global server load-balancing, continuously running health checks against each cloud’s regional entry points and steering customers — by health, latency, and policy — across regions and, where the workload supports it, across clouds, so that the loss of a region or an entire provider becomes a routing decision rather than an outage. And it is a web application firewall that enforces OWASP-aligned prevention (the injection, traversal, and deserialisation classes), bot and credential-stuffing mitigation, API schema and FAPI validation so that open-banking and corporate API calls must conform to their published contracts, rate limiting per client and per token, and geo/ASN policy that can quietly drop or challenge traffic from networks the bank has no business serving.

Global Edge & Per-Channel Ingress — CDN, DNS, WAF to Cloud Origins

Two decisions in this layer are load-bearing and worth stating plainly. The first is origin cloaking: the cloud origins — Azure and AWS alike — never publish their addresses, accept connections only from the edge’s authenticated ranges, and are unreachable directly from the public internet, so that the WAF and DNS layer cannot be bypassed and a successful attack on one origin tells an adversary nothing about the others. Behind the global edge, each cloud still runs its own regional WAF as a defence-in-depth secondaryAzure Application Gateway WAF v2 fronting Azure workloads and AWS ALB with AWS WAF fronting AWS workloads — so that policy is enforced a second time, close to the workload, in case the global tier is ever misconfigured or circumvented. The second decision is the one that reframes the whole layer: ingress is not a single web stack but a portfolio of channel-specific edges. A web request, a mobile-SDK call, an ISO 8583 message from an ATM switch, a SIP leg from the contact centre, and a host-to-host mTLS session from a corporate treasury system have almost nothing in common at the protocol level, and pretending they share one front door would force every channel down to the weakest assumptions of all of them. Instead the bank operates distinct ingress paths per channel — the customer-facing HTTPS channels behind the full CDN/WAF/bot stack, the financial-message channels behind dedicated mutually-authenticated gateways, the card channels behind a hardened payment-switch ingress — while keeping one thing common across all of them: every request, every block, every health-check flap, and every failover event is forwarded to both Microsoft Sentinel (SIEM) for correlation and case management and Dynatrace for latency and availability telemetry. Security and operations therefore see one coherent picture of the bank’s front door even though that front door is, by design, many doors.

Banking channels architecture

Banking Channels Architecture — Nine Channels, One Identity Broker

Behind the edge, the bank does not have a channel; it has nine, and each is treated as a first-class domain with its own ingress, its own identity assurance, and its own resilience posture rather than as a skin over a single web application. What holds the nine together is a single architectural commitment: one identity broker, many step-up policies. All channels federate to one OIDC/FAPI identity broker — Okta as the strategic SSO and federation layer, Microsoft Entra ID for the workforce estate — so that there is exactly one place where authentication is reasoned about, tokens are minted, and session risk is evaluated. What differs between channels is not the broker but the policy the broker applies: the strength of authentication demanded, the step-up triggered by transaction value or risk, and the binding between session and device. A retail customer initiating a high-value transfer, a teller exceeding a counter limit, and a corporate user submitting a payment batch are all, underneath, the same broker enforcing three different step-up rules. This is what lets the bank reason about identity once while still treating a µs-latency markets order and a branch cash withdrawal as the genuinely different things they are.

The customer-facing digital channels — retail web and mobile — sit behind the full CDN/WAF/bot stack and authenticate through OIDC/OAuth2 with Strong Customer Authentication, device binding so that a session is cryptographically tied to a registered handset, and push-MFA for step-up, with dynamic linking applied to payment-initiation flows as PSD2 requires. The branch teller and operations channel runs on workforce SSO with dual-control for sensitive actions — a second authorised colleague must approve — and hard transaction limits enforced server-side, because the risk here is the trusted insider as much as the outsider. The contact-centre, IVR, and telephone channel ingests over SIP and layers voice biometrics for caller verification, falling back to knowledge-based step-up only when the voiceprint is inconclusive. The ATM and cash channel is the one that least resembles a web stack: it speaks ISO 8583 into an ATM switch performing “ATM driving”, relies on HSM PIN translation so that a customer’s PIN is never in the clear between the keypad and the issuer, and — critically for a zero-downtime estate — supports STAND-IN authorisation when the core is unreachable, so that cash dispensing continues within risk-bounded limits even during a core-banking disruption. The cardholder and merchant servicing portals are HTTPS/OIDC web channels behind the edge, scoped tightly because they touch cardholder data and therefore the PCI-DSS cardholder-data environment.

The remaining channels carry the bank’s institutional and specialist traffic. Corporate portals and host-to-host APIs exchange ISO 20022 pain and camt messages over mTLS, with the corporate organisation — not just the individual user — established as the trust anchor, so that a treasury system authenticates as itself before any human is involved. Wealth advisor and private-banking workstations combine workforce SSO with entitlements to market-data feeds and enforce suitability checks in the advice path, so that what an adviser can recommend is bounded by the client’s profile. Insurance and claims runs as a bancassurance domain with its own servicing and claims-intake ingress, and treasury and markets is the most latency-sensitive channel of all, fronting order entry and execution where the broker’s job is to authenticate and authorise fast enough not to intrude on a deterministic, sub-millisecond path. The table below sets the nine side by side along the three axes that define them — protocol, identity and step-up, and the resilience commitment each one makes.

Channel Protocol / standard Identity & step-up Resilience note
Retail web & mobile HTTPS/REST + mobile SDK OIDC/OAuth2 + SCA; device binding; push-MFA; dynamic linking on payments Behind CDN/WAF/bot; active-active across regions; Tier-1 99.99%
Branch teller & operations Workforce SSO (web/desktop) Entra/Okta SSO; dual-control; server-side txn limits Degrades to limited offline counter functions; regional failover
Contact-centre / IVR / telephone SIP / telephony Voice biometrics; KBA fallback step-up Multi-site SIP ingress; reroute on PoP loss; no single switch
ATM & cash ISO 8583 → ATM switch (“ATM driving”) Card + HSM PIN translation; issuer step-up STAND-IN (STIP) authorises within limits when core is down
Cardholder & merchant portals HTTPS/OIDC OIDC/OAuth2 + step-up; PCI-scoped session In CDE segment; edge-fronted; regional active-active
Corporate portals & host-to-host APIs ISO 20022 (pain/camt), REST/SFTP mTLS; org-level trust anchor; OAuth2/FAPI Mutually-authenticated gateways; idempotent file intake; queue-buffered
Wealth advisor & private-banking workstations HTTPS + market-data feeds Workforce SSO; entitlements; suitability gating Read-mostly resilience; cached market data on feed loss
Insurance & claims (bancassurance) HTTPS/REST OIDC/OAuth2 + step-up on payout/claim Tier-2 availability; async claims intake survives back-office lag
Treasury & markets FIX (+ binary iLink3/OUCH) Workforce SSO; deterministic pre-trade authz Co-located, µs-latency; deterministic fail-fast pre-trade checks

Read together, the edge and the channels express one coherent stance. The global edge gives the bank a single, third-party, cross-cloud place to inspect, rate-limit, and fail over every kind of traffic without exposing an origin, and the channel architecture refuses to flatten nine genuinely different domains into one web stack while still routing every one of them through a single identity broker. That combination — common protection and common identity at the perimeter, channel-specific ingress and resilience underneath — is precisely what lets the institution serve a markets desk and an ATM, a corporate treasury and a retail handset, from one foundation, and keep all of them open while the platform beneath them is continuously changed.

Core banking — system of record and system of engagement

Every other decision in this design ultimately defers to one fact: somewhere there is a ledger that records, to the penny, how much money exists and who owns it, and that ledger cannot be wrong, cannot be ambiguous and cannot be reconstructed by guessing. The core banking general ledger is the system of record — the immutable single source of truth for money movement — and it is the one component in the entire estate that is deliberately not made eventually-consistent, not made multi-master, and not made to resolve conflicts after the fact. Balances are posted under strong consistency with a synchronous quorum, engineered to an effective RPO of approximately zero, because the alternative — accepting a write on one node and reconciling later — means that for some window the bank does not know its own position. The cardinal rule of this section, and the one most often violated by teams who treat a ledger like any other database, is that balances are never resolved by last-write-wins: two concurrent debits against the same account must be serialised and one must lose, deterministically, not silently overwritten.

That consistency requirement is why the system of record is kept single-homed — pinned either on-premises or to a single cloud region with synchronous replication to its paired-DR site — rather than spread active-active across providers. A consensus ledger (Raft- or Paxos-class, 2PC across shards for cross-account movements) pays a latency and availability tax to guarantee correctness, and stretching it across two clouds would multiply that tax without buying integrity the business actually needs. The institution runs the tier-one incumbent cores in this role — Temenos Transact, Infosys Finacle, Oracle FLEXCUBE and TCS BaNCS — alongside the cloud-native, event-driven, API-first generation, Thought Machine Vault and Mambu, where the product allows. A defining property of these fourth-generation distributed cores is that they never close: there is no end-of-day batch shutdown, no overnight window during which the bank is unavailable to itself. That architectural fact is precisely what makes the zero-downtime mandate achievable at the ledger tier, where it is hardest to honour.

A single-homed system of record would be a bottleneck if every channel queried it directly, so it is not exposed by direct coupling. Instead the ledger publishes through an event backbone — Kafka with Flink for stream processing — emitting a canonical ISO 20022 internal model so that every consumer reads the same semantically-typed event rather than a core-specific record format. Around that backbone sits the system of engagement: the channels, digital front-ends, analytics and AI services that are cloud-native, active-active and multi-region, and that consume real-time ledger events rather than reaching into the core. This is the structural divide that lets the bank modernise without risk — the SoR optimised for correctness and stability, the SoE optimised for change velocity and customer experience, joined only by an asynchronous, well-typed event stream. The migration discipline that follows from this is strangler-fig coexistence, never big-bang: new capabilities are stood up in the SoE and around the backbone, traffic is moved capability by capability, and the legacy core is retired only when nothing depends on it. A flag-day cut-over of a bank’s ledger is not a strategy; it is an outage waiting for a date.

Core Banking — System of Record vs System of Engagement + Event Backbone

The practical consequence for the landing zone is that the two halves obey different placement rules. The SoR sits in the most controlled tier — single-homed, synchronously replicated, CP over AP every time — while the SoE inherits the cell-based, active-active, zero-downtime treatment described for the customer-facing estate. The event backbone is the contract between them, and because it carries the canonical ISO 20022 model it is also the seam at which payments, cards, fraud and analytics all attach. Get that seam right and the rest of the banking architecture composes cleanly; get it wrong and every downstream domain inherits a leaky, core-specific coupling that no amount of channel engineering can hide.

Payments hub and ISO 20022

Payments are where the bank meets the outside world’s plumbing, and that plumbing is gloriously inconsistent: every rail has its own message format, settlement cadence, cut-off times and failure modes. The organising decision here is to refuse to let that heterogeneity leak inwards. A payments hub normalises every rail to a single canonical internal model — ISO 20022 — so that the core and the channels speak one payments language regardless of whether an instruction arrived over Fedwire, Bacs, SWIFT or UPI. Rails are translated at the edge of the hub and nowhere else; downstream, a payment is a payment. This is the same architectural move as the core event backbone, applied to ingress, and it is what keeps the cost of adding the next rail — or absorbing the next scheme mandate — linear rather than combinatorial.

The hub spans four broad rail families, and the design treats each as a distinct ingress with its own resilience and timing profile. Real-time gross settlement — Fedwire, CHAPS and TARGET2/T2 — carries high-value, irrevocable, individually-settled instructions and has now fully migrated to ISO 20022. Batch rails — ACH, Bacs and SEPA Credit Transfer — move high volumes on scheduled cycles where throughput and integrity matter more than per-item latency. Cross-border runs over SWIFT correspondent banking using pacs.008 (customer credit transfer) and pacs.009 (financial-institution transfer), and its connectivity is confined to a hardened SWIFT Customer Security Programme secure zone attested annually against the CSCF. And instant 24/7 rails — FedNow and TCH RTP (each with a $10M transaction limit), the now-mandatory SEPA Instant, the UK Faster Payments Service, India’s UPI and Brazil’s Pix — demand sub-five-second posting at any hour, which is only survivable because the SoR never closes. A material milestone shapes the whole programme’s timeline: the ISO 20022 SWIFT MT/MX coexistence period ended on 22 November 2025, so cross-border traffic is MX-native, with structured-address enforcement following in November 2026.

The hardest correctness problem in payments is not format translation but exactly-once effect under retries. Networks time out, schemes resend, and operators reprocess; without protection, every one of those events risks a double payment. The hub solves this with idempotency keys persisted at ingress and idempotent ledger posting, so that a retried instruction reconciles to the original posting rather than creating a second one — the same key always yields the same financial outcome, exactly once. Layered on top is the issuer stand-in (STIP) capability: when the core is unreachable, stand-in auto-authorises within pre-agreed risk limits so that customers are not declined for an internal outage, with the resulting transactions reconciled to the ledger once it is reachable. The reference implementation follows a deliberately conventional, defence-in-depth pattern — API Gateway / APIM for authenticated ingress → a FIFO queue (SQS FIFO / Service Bus) to serialise and buffer → microservices on EKS/AKS → an idempotent state store → an immutable archive → with KMS-managed keys throughout — chosen precisely because each hop is independently scalable, observable and replaceable without a maintenance window.

Payments Hub & ISO 20022 — Multi-Rail Canonical Model

The rail taxonomy below is the contract the hub implements; it is worth reading as the definitive list of what “one canonical model” actually has to absorb.

Rail Type Settlement Standard
Fedwire / CHAPS / TARGET2 (T2) RTGS (high-value, irrevocable) Real-time gross, intraday ISO 20022 (MX-native)
ACH / Bacs / SEPA CT Batch (bulk credit transfer) Scheduled cycles, multi-day ISO 20022 / scheme formats
SWIFT correspondent Cross-border (FI-to-FI) Correspondent / nostro-vostro pacs.008 / pacs.009 (CSP secure zone)
FedNow / TCH RTP Instant 24/7 (≤ $10M) Immediate, real-time ISO 20022
SEPA Instant / UK FPS Instant 24/7 (mandatory SEPA Inst.) Immediate, < 5 s post ISO 20022 / scheme formats
UPI / Pix Instant 24/7 (domestic schemes) Immediate, real-time Scheme-native (ISO 20022-aligned)

Because the hub is the choke point through which money leaves the institution, it is also where the cross-cutting controls converge: idempotency for correctness, the CSP secure zone for SWIFT, inline sanctions screening before settlement, and the canonical model that lets fraud and AML attach once rather than per-rail. The payments hub is therefore not merely an integration component — it is the policy-enforcement boundary for outbound value, and it is engineered to that standard.

Cards and authorisation

Card payments run on a network topology that predates the cloud by decades and has its own fixed cast of participants. The institution sits inside the four-party model: a cardholder presents a card to a merchant, whose acquirer (or its processor) routes the transaction to the network switch — Visa or Mastercard — which delivers it to the issuer for a decision. The bank occupies two distinct economic roles in this flow that must not be conflated in the architecture: issuing (it owns the cardholder relationship and authorises spend against the account) and acquiring (it serves merchants and presents their transactions). They share scheme connectivity but have separate risk, settlement and data obligations, and the landing zone keeps them as separate trust domains rather than a single “cards” estate.

The flow itself decomposes into three operations on three different clocks, and the design’s central performance claim lives in the first of them. Authorisation is real-time and synchronous: a dual-message exchange that must complete in under 100 ms — Visa scores well over 150 billion transactions a year inside that envelope — and crucially, fraud scoring runs inline within that same 100 ms, consuming only 10–50 ms, not as a downstream batch job. Achieving that means the fraud models and their feature data are co-located with the authorisation path, because a network round-trip to a remote scoring service would blow the budget on its own. Clearing then runs as a batch process that exchanges the financial detail behind each authorisation, and settlement moves the actual net funds between institutions on a T+1 / T+2 cycle. Conflating these three — treating authorisation latency and settlement timing as one problem — is a common and expensive modelling error; the architecture keeps them on independent paths sized to their own SLAs.

Two protections do most of the work of shrinking the bank’s regulatory exposure. Tokenisation — Visa Token Service (VTS) and Mastercard Digital Enablement Service (MDES) — replaces the primary account number with a network token, so that wallets, merchants and downstream systems never hold the real PAN; this materially shrinks the PCI-DSS scope because systems handling only tokens fall outside the cardholder data environment. For card-not-present commerce, 3-D Secure (EMV 3DS v2.3.x) adds issuer-side authentication with risk-based, frictionless flows for low-risk transactions and a step-up challenge only when warranted — satisfying SCA without taxing every checkout. Underneath both sits the non-negotiable structural control: the PCI-DSS v4.0.1 cardholder data environment is a hard-segmented trust domain, network-isolated under Requirement 1.4, with its segmentation proven by penetration test every six months under Requirement 11.4. Everything that touches a real PAN lives inside that boundary; everything that can be kept out — through tokenisation — is kept out by design.

Cards & Authorisation — Four-Party Flow, Tokenisation & PCI CDE

The cards domain therefore inherits the strictest combination of constraints in the bank: a sub-100 ms latency envelope shared between authorisation and inline fraud, a settlement cadence on an entirely different timescale, and a Restricted-classified data environment that regulators expect to be demonstrably segmented and continuously attested. The discipline that makes it tractable is reduction of scope — tokenise aggressively so the CDE stays small, keep issuing and acquiring as distinct domains, and run fraud where the data already is — so that the hardest-to-protect surface in the estate is also the smallest.

Fraud and financial crime

Financial crime defences are the one place in the bank where a control sits directly in the path of money and is given only milliseconds to decide. The design draws a sharp line between two disciplines that the industry too often conflates. Fraud scoring is inline in the authorisation path — it runs inside the sub-100 ms card-authorisation envelope described in the cards section, not as an after-the-fact review — and is therefore engineered to a sub-100 ms p99 of its own, consuming only the 10–50 ms slice the latency budget allows it. Anti-money-laundering and sanctions screening is inline in the payment path — it runs before a payment is allowed to settle, to a sub-300 ms budget — so that a hit stops the instruction pre-settlement rather than generating a regret afterwards. The two share an architecture but answer different questions: fraud asks is this genuinely the customer, behaving normally, while financial-crime screening asks is this counterparty or payment something the bank is legally forbidden to process.

For the fraud question, the signal that makes inline scoring accurate is identity-of-behaviour rather than identity-of-credential. The platform combines device fingerprinting and behavioural biometrics — Feedzai for transaction-risk scoring and BioCatch for behavioural-biometric signals — so the model reasons not only about what is being done but about how the human is doing it: cadence of typing, the arc of a swipe, how a device is held. Because these models must answer in single-digit-to-tens of milliseconds, the feature stores they read are replicated and co-located with the scoring path, never fetched across a region boundary at decision time; the moment a fraud model reaches out to a remote store it has already blown its budget. This co-location of model and feature data is the same engineering discipline the card-authorisation envelope demands, and it is non-negotiable for a Tier-1 latency target.

Fraud & Financial Crime — Inline Fraud Scoring + AML/Sanctions

The most consequential design decision in this domain is the failure mode, because an inline control is also an inline single point of failure unless it is built not to be. The fraud engine is engineered with fail-safe defaults and the high availability of the Tier-1 service it protects: a fraud-scoring outage can neither block legitimate authorisations en masse nor silently wave everything through. Both extremes are unacceptable — one denies customers their own money, the other opens the bank to loss — so the path degrades to a deterministic, conservative policy (tighter limits, mandatory step-up on higher-risk patterns) that holds the line without depending on the scoring tier being live. In practice this means the scoring service is replicated active-active alongside the authorisation path it serves, and the absence of a score is itself a modelled, governed outcome rather than an unhandled exception. A fraud engine that takes the authorisation channel down with it has converted a risk control into an availability liability, and on a zero-downtime estate that trade is forbidden.

The financial-crime side runs to the same latency philosophy but a different cadence. Sanctions, PEP and adverse-media screening is performed real-time and inline, to a sub-300 ms budget, expressly to stop a payment before it settles — screening that arrives after settlement is an investigation, not a control. Sitting behind that real-time gate is transaction monitoring built as a deliberate hybrid: a streaming layer evaluates behaviour live as instructions flow, while a batch layer runs the heavier, look-back typologies — structuring, layering, mule-network patterns — that need a wider window than any inline budget can afford. Alerts from both layers converge into case management on NICE Actimize and Napier, where investigators trace, disposition and, where warranted, escalate to a Suspicious Activity Report or Suspicious Transaction Report (SAR/STR). The whole programme is anchored to the FATF 40 Recommendations and retains its audit data accordingly, so that customer due diligence, screening, monitoring and reporting form one evidenced chain rather than disconnected tools. The organising idea is that financial-crime controls are tiered by the time the question allows — milliseconds for the sanctions gate that must precede settlement, seconds-to-live for streaming monitoring, hours for batch typologies — and each layer is placed where its latency budget can actually be met.

Trading and capital markets

Trading is the domain that deliberately breaks the bank’s own rules, and saying so plainly is the point of this section. Everywhere else in this estate the organising goal is geographic spread — active-active across regions, paired-DR, no single point of failure — because the business value is availability. In trading the organising goal is latency, and latency is pinned to physics, so the same instinct that makes the rest of the bank multi-region makes a matching venue single-region and co-located. Spreading an order-matching path across regions to chase availability would add the one thing this domain cannot pay — distance, and therefore time. The architecture therefore treats trading as a justified exception to the zero-downtime topology rather than a violation of it, and the justification is measured in nanoseconds.

The numbers set the scale of the problem. Order entry and execution run over FIX, with binary protocols — iLink3, OUCH, ITCH and ETI — where every microsecond counts, because the venues themselves match in microseconds: CME Globex under 150 µs, Eurex T7 around 60 µs, and Nasdaq under 40 µs. To compete into matching engines that fast, the bank’s own tick-to-trade path is built on FPGA hardware delivering 100–500 ns — sub-microsecond reaction from market signal to order — and is physically co-located in the exchange data centre, because at these tolerances the speed of light over a cross-town fibre is itself a material disadvantage. This is why the domain is placed where it is: co-location is not a deployment preference, it is the determinant of whether an order is competitive, and no amount of cloud elasticity substitutes for sitting in the same rack hall as the matching engine.

Trading & Capital Markets — Low-Latency Execution + T+1 Settlement

Speed of this order does not relax control; it constrains how control must be built. Pre-trade risk checks are inline and deterministic — fixed-cost, bounded-time validations of limits, fat-finger thresholds and exposure that execute in the order path without introducing variable latency, because a risk check whose timing is unpredictable is, at these tolerances, indistinguishable from an outage. Determinism is the requirement: the check must cost the same nanoseconds every time, so the trading path’s latency is a known quantity rather than a distribution with a dangerous tail. This is the inverse of the fraud-scoring model elsewhere in the bank, which tolerates a probabilistic answer within a millisecond budget; on the matching path, predictability of latency outranks richness of analysis.

Where the front office is pinned to a single co-located region, the back office returns to the bank’s mainstream pattern. Post-trade settlement has compressed to T+1, and the design treats this as a hard, dated obligation rather than an aspiration: the United States moved to T+1 in May 2024, and the EU and UK are targeting 11 October 2027. Meeting a one-day cycle leaves no slack for manual repair, so clearing and settlement run as straight-through processing into DTCC, where trade capture, affirmation, allocation and settlement flow without re-keying. The shape of the domain is therefore deliberately asymmetric, and that asymmetry is the design: execution is latency-pinned, single-region and co-located; settlement is throughput-and-integrity-bound and rejoins the resilient, automated, dual-cloud estate that governs the rest of the bank. Recognising that a single architectural posture cannot serve both halves — that the microsecond world of matching and the T+1 world of settlement are genuinely different problems — is what lets the bank be fast where speed wins business and resilient where integrity protects it.

Multi-layer security model

A bank that has committed to a zero-downtime operating model cannot lean on the perimeter as its principal defence, because the perimeter is the one thing a zero-downtime estate deliberately keeps open and changing. The security model is therefore built on a single uncomfortable assumption: every layer must hold even when the layer outside it has already failed. There is no trusted interior. Each control is designed as if the attacker is already one step closer than the diagram suggests — inside the network, on a managed device, holding a valid token — and each layer answers the question what still stops them here. Seven layers compose the model, and the discipline that makes it auditable rather than aspirational is that every layer names an owner, a tool and a measurable control, so a gap is a missing metric rather than a matter of opinion.

Multi-Layer Zero-Trust Security Model

The identity layer is the new perimeter and is owned by Cloud Security & Identity. Assuming the network is hostile, every request is authenticated and authorised at the identity plane through Microsoft Entra ID and Okta, with Conditional Access evaluating user, device, location and risk on each call; the measurable control is MFA on 100% of identities, phishing-resistant factors for all privileged access, and zero standing privilege enforced through PIM just-in-time elevation. The device layer assumes a stolen or compromised credential and refuses to trust the credential alone: Intune compliance and CrowdStrike Falcon posture gate access so that only a healthy, managed, attested endpoint — or a Privileged Access Workstation for administrative paths — passes, measured as the proportion of sessions originating from compliant devices. The network layer assumes a foothold has been established somewhere inside and works to deny lateral movement: hub-and-spoke segmentation with default-deny, Azure Firewall Premium and AWS Network Firewall for central inspection, and the standing rule that there is no unrestricted east-west traffic, measured by segmentation coverage and the absence of any-any flows.

The application layer assumes the network controls were bypassed and protects the workload at its own front door — Application Gateway WAF v2 and AWS ALB with AWS WAF in OWASP-prevention mode, FAPI 2.0 schema validation and mutual TLS on partner APIs, owned by Application Enablement and measured by WAF coverage and blocked-attack rates against the OWASP categories. The workload layer assumes a service or container is already running malicious code and constrains what it can do at runtime: CrowdStrike Falcon runtime protection and Wiz admission and image controls enforce least-privilege execution, with the measurable control being critical findings remediated inside the 7-day critical / 30-day high patch SLA and zero criticals reaching production. The data layer is the last line and assumes everything above it has fallen, so it makes the data itself useless to the holder: encryption everywhere with customer-managed keys in Key Vault Managed HSM and AWS CloudHSM, HYOK / External Key Store for the most sensitive classes, and confidential computing for data in use — measured as the percentage of Restricted data under customer-controlled keys, which is held at 100%. Sitting across all seven is the monitoring layer, owned by the SOC: Microsoft Sentinel correlates signal from every layer into an immutable, Object-Lock log archive, and Wiz spans posture and exposure across both clouds end-to-end, continuously evaluating misconfiguration and attack paths against a posture score held at 85 or above. The point of the seventh layer is that detection assumes prevention will eventually be defeated, and an attack that defeats six controls silently is a far worse outcome than one that is seen.

Multi-layer network security

Network security inherits the same assumption-of-breach discipline and expresses it as defence in depth through six independent controls, no one of which is trusted to be sufficient. The first is the third-party global edge — CDN, authoritative DNS and WAF — which absorbs volumetric and bot traffic, enforces OWASP and FAPI schema rules, and cloaks the true origins so the regional ingress is never addressable directly. The second is the regional web-application firewall at the cloud edge: Application Gateway WAF v2 on Azure and AWS ALB with AWS WAF, a second enforcement point that assumes the global edge can be bypassed and re-checks every request closer to the workload. The third is central inspection — Azure Firewall Premium and AWS Network Firewall — through which all north-south and inter-spoke traffic is forced for IDPS and TLS inspection, so nothing crosses a trust boundary unseen. The fourth is micro-segmentation: NSGs and security groups with default-deny baselines and application-tier segmentation, so a compromise is contained to one tier rather than free to roam. The fifth is private connectivity — Private Endpoints and PrivateLink — which removes data, key and ledger-adjacent services from the public internet entirely. The sixth is the host and workload control itself, where CrowdStrike Falcon and host firewalls hold the line if every network layer above has been crossed. Two rules bind the whole design: there is no unrestricted east-west traffic anywhere in the estate, and all egress is policy-controlled and logged through the central firewalls — a workload cannot reach the internet, or another spoke, except by an explicit, inspected, recorded allow.

Azure network security (deep dive)

On Azure the controls compose around the regional hub. Inbound customer traffic terminates at Application Gateway WAF v2 in the dedicated AppGatewaySubnet, running OWASP managed rules in prevention mode with bot protection and per-URI rate limiting, so an application is never the first thing to see a raw request. Everything beyond it — north-south to the internet, and east-west between spokes — is routed by User-Defined Routes through Azure Firewall Premium, whose IDPS and TLS inspection mean encrypted lateral traffic is decrypted, inspected and re-encrypted rather than waved through; egress is governed by FQDN and application rules so a workload reaches only the named destinations its policy permits, and every flow is logged. Within and between spokes, NSGs and Application Security Groups enforce a default-deny posture: tiers are expressed as ASGs (web, app, data) and only the explicitly required tier-to-tier flows are opened, which is precisely the CDE segmentation PCI-DSS Requirement 1.4 expects and the segmentation a six-monthly penetration test under Requirement 11.4 must prove. Data, key and ledger services are reached only through Private Endpoints with public network access disabled on the resource itself, so even a stolen connection string resolves to a private address inside the bank’s network and a public route simply does not exist. Azure Bastion in its own subnet provides administrative access without exposing RDP or SSH to any network the workload teams can route to.

Azure Network Security Layers (Deep Dive)

AWS network security (deep dive)

AWS realises the identical intent in its own primitives. Customer traffic enters through an Application Load Balancer fronted by AWS WAF in the ingress VPC, running the same OWASP-prevention rule set as the Azure edge so both clouds present an equivalent bar. All traffic between VPCs and to the internet is steered by the regional Transit Gateway into a dedicated inspection VPC, where AWS Network Firewall runs in appliance mode — the mode that guarantees symmetric, stateful inspection of cross-VPC and egress flows so return traffic is examined on the same engine that saw the request, with Suricata-compatible IPS rules and domain-based egress filtering. Inside each VPC, security groups and network ACLs enforce default-deny: security groups express the stateful tier-to-tier allow-list and NACLs add a stateless subnet-level backstop, again sized to keep the cardholder data environment and the SWIFT secure zone segmented and provable. Data, key and ledger services are exposed only through VPC endpoints (PrivateLink), never public endpoints, with endpoint policies restricting which principals and accounts may use them. Administrative access uses AWS Systems Manager Session Manager rather than bastion hosts or open SSH — no inbound management ports, every session brokered, recorded and shipped to the immutable Log Archive. As on Azure, the two binding rules hold: no unrestricted east-west traffic between accounts or tiers, and all egress policy-controlled and logged through the central Network Firewall.

AWS Network Security Layers (Deep Dive)

Compliance, data residency and control mapping

A multinational universal bank does not answer to one regulator but to an overlapping mesh of them, and the architecture has to satisfy all of them at once without forking into a dozen incompatible estates. The control plane is therefore designed so that a single set of controls maps to many regulatory drivers simultaneously — one segmentation boundary serving PCI-DSS and SWIFT, one key-custody model serving GLBA, GDPR and the EBA cloud-outsourcing guidance — because a bank cannot maintain a separate control estate per framework and still move at zero-downtime speed. The regulatory reality this design is built against is explicit: PCI-DSS v4.0.1 (mandatory since 31 March 2025) governing the cardholder data environment; PSD2 with Strong Customer Authentication and its PSD3/PSR successor for European payment services; DORA (in force 17 January 2025) imposing ICT risk management, third-party and concentration-risk controls, resilience testing including threat-led penetration testing, major-incident reporting and tested exit plans; UK Operational Resilience under the PRA and FCA (31 March 2025) with its Important Business Services, impact tolerances and Critical Third Party regime; Basel III/IV with BCBS 239 for risk-data aggregation and reporting; SOX sections 302 and 404 for IT general controls over listed entities; AML/KYC/CDD and sanctions screening under the FATF 40 Recommendations; GLBA Safeguards for US customer non-public information; GDPR with Schrems II and the EU Data Act driving data residency and transfer-impact; FFIEC mapped to NIST CSF 2.0 for US-supervised institutions; and the EBA/ECB cloud-outsourcing expectations for tested exit, concentration and approved jurisdictions that underpin the multi-cloud rationale in the first place.

Underpinning the mapping is a four-band data classificationPublic, Internal, Confidential and Restricted — where Restricted captures cardholder data, personal data, the account and general ledger, and market-sensitive information, and the band an asset carries determines its encryption, residency, key custody and access controls. Key custody is the sharpest expression of the model and is deliberately graduated by sensitivity. The default for regulated data is BYOK against a customer-controlled HSM — Azure Key Vault Managed HSM and AWS CloudHSM, both FIPS 140-validated — so the bank, not the provider, holds the root of trust. For the most sensitive classes the design escalates to HYOK / External Key Store (XKS), where plaintext keys never leave bank-controlled hardware at all and the cloud must call back to the bank to perform a decrypt, structurally neutralising foreign-government access demands under instruments such as the CLOUD Act. And for data that must be processed rather than merely stored under those keys, confidential computing — Azure confidential VMs and AWS Nitro Enclaves — keeps it encrypted in use, closing the last gap where plaintext would otherwise be exposed in memory. Together these mean the bank can give an honest, evidenced answer to the hardest sovereignty question a regulator asks: who can read this data, and the answer is only the bank.

Compliance & Control Mapping — Regulation → Control → Tooling

The control matrix below is the operational heart of the compliance posture: it maps each control domain to the bank’s concrete, named control, to the security frameworks that shape it, to the regulation that drives it, and to the team accountable for it. The value of expressing it this way is that no control is orphaned and no regulation is unmapped — every row can be defended to an auditor with a real tool and a named owner rather than a policy aspiration.

Control domain This bank’s control Framework refs (PCI / ISO / NIST / CIS) Regulation driver Owner
Identity & access Entra ID + Okta Conditional Access, PIM JIT, phishing-resistant MFA, zero standing privilege PCI Req 7–8; ISO A.5/A.8; NIST CSF PR.AA; CIS 5–6 PSD2/SCA, SOX, GLBA, FFIEC Cloud Security & Identity
Privileged access PIM time-boxed approval, PAWs, two break-glass, Session Manager / Bastion brokered access PCI Req 7; ISO A.8.2; NIST PR.AA-05; CIS 5 SOX ITGC, FFIEC, DORA Cloud Security & Identity
Network segmentation Azure Firewall Premium / AWS Network Firewall, NSG/SG default-deny, segmented PCI CDE + SWIFT secure zone PCI Req 1.4 / 11.4; ISO A.8.20-22; NIST PR.IR; CIS 12-13 PCI-DSS v4.0.1, SWIFT CSP Network & Connectivity
Edge & app protection Third-party CDN/DNS/WAF + App Gateway WAF v2 / AWS WAF, FAPI 2.0, mTLS, origin cloaking PCI Req 6.4; ISO A.8.26; NIST PR.PS; CIS 13 PCI-DSS, Open Banking, PSD2 Application Enablement
Data-at-rest & key custody BYOK in Managed HSM / CloudHSM; HYOK / XKS for most sensitive; confidential computing in use PCI Req 3; ISO A.8.24; NIST PR.DS; CIS 3 GLBA, GDPR/Schrems II, EBA/ECB Cloud Security & Identity
Data residency & sovereignty Approved-region/jurisdiction guardrails, region-pinned placement, transfer-impact assessment ISO A.5.34; NIST GV.SC; CIS 3 GDPR + EU Data Act, EBA/ECB Cloud Centre of Excellence
Threat detection & SIEM Microsoft Sentinel correlation, SOC playbooks, 24×7 monitoring across both clouds PCI Req 10–11; ISO A.8.15-16; NIST DE.AE/DE.CM; CIS 8 DORA incident reporting, FFIEC, SOX SOC
Audit logging & retention Immutable Log Archive (S3 Object Lock / Azure immutable storage), centralised, tamper-evident PCI Req 10.5; ISO A.8.15; NIST PR.PS-04; CIS 8 SOX 404, AML record-keeping, DORA Cloud Operations
Posture & exposure mgmt Wiz CSPM/CNAPP posture ≥ 85 + Wiz Code, attack-path and misconfiguration scanning ISO A.8.8/A.8.9; NIST ID.RA / PR.PS; CIS 4-7 DORA ICT risk mgmt, FFIEC Cloud Security & Identity
Endpoint & workload runtime CrowdStrike Falcon EDR + runtime protection, patch SLA critical 7 d / high 30 d PCI Req 5–6; ISO A.8.7/A.8.8; NIST PR.PS / DE.CM; CIS 4-10 GLBA Safeguards, DORA, FFIEC Cloud Operations
Risk-data aggregation BCBS 239 data lineage via Microsoft Purview, golden-source governance, mandatory classification tags ISO A.5.12; NIST GV.SC / ID.AM Basel III/IV + BCBS 239 (RDARR) Cloud Platform Engineering
Operational resilience DORA resilience testing incl. TLPT (≥ every 3 yrs), chaos/game-days, tested multi-cloud exit plans ISO A.5.29-30; NIST GV.SC / RC.RP DORA, UK Operational Resilience, EBA/ECB Cloud Centre of Excellence
Financial-crime screening Real-time AML sanctions/PEP/adverse-media screening (sub-300 ms inline), case management, SAR/STR ISO A.5.34; NIST ID.RA AML/KYC + FATF 40 Recommendations Cloud Operations

The residency table closes the loop between classification and enforcement. Data residency is enforced structurally, not by policy memo — the approved-region guardrails inherited down both the Azure management-group tree and the AWS organisational-unit tree mean a workload simply cannot place Restricted data outside its permitted geography, and the key-custody escalation means that even where data transits a provider’s infrastructure, the plaintext is reachable only through bank-controlled keys. Each class below carries an explicit residency rule and the mechanism that makes it real.

Data class Examples Residency rule Enforcement
Restricted — cardholder PAN, track data, CVV within the CDE In-region CDE only; minimise via tokenisation (Visa VTS / Mastercard MDES) PCI-segmented network, BYOK in Managed HSM/CloudHSM, region-pinned guardrails, public access disabled
Restricted — account & ledger General ledger, balances, postings, SoR records Pinned to system-of-record home geography; no cross-border replication of balances SoR single-homed, HYOK/XKS keys, Private Endpoint/PrivateLink-only access, immutable audit
Restricted — market-sensitive Orders, positions, trade and execution data In-region, segregated; access on need-to-know with information barriers Segmented trading domain, confidential computing, least-privilege RBAC, full audit logging
Confidential — PII Customer personal data, KYC records, claims Resident in the customer’s jurisdiction; transfer-impact assessed before any movement Approved-region guardrails, CMK encryption, Purview classification, consent and access controls
Internal / Public Reference data, marketing content, public rates No residency constraint; protected against tampering and unauthorised change Standard encryption, integrity controls, WAF protection at the edge

Application onboarding for 100+ applications

An estate of more than a hundred applications across both clouds cannot be onboarded by hand, and a bank that promises no maintenance windows cannot afford a hundred subtly different security postures to patrol. The onboarding model therefore treats a new application environment as a product the platform vends on demand, and the load-bearing decision is that governance is applied at the moment of creation, never retrofitted afterwards. The journey begins where every other piece of bank IT governance begins — a ServiceNow request — which captures the application’s line of business, data classification, criticality tier, and target environments, routes it for the approvals the change policy requires, and only then triggers the automated vend. Nothing is provisioned by a console click: the request becomes the authoritative record, and the same workflow that opens the environment later carries its configuration item, its runbook, and its change history.

Azure Application Onboarding & Subscription Vending (Deep Dive)

On Azure the approved request drives the subscription vending factory, which places the new subscription under the correct management group — Corp or Online, production or non-production — so that the FSI policy hierarchy reaches it by inheritance before a single resource exists inside it. It then assigns RBAC against the requesting team’s Entra groups, stamps the mandatory owner, cost-centre, data-classification and criticality tags, peers the subscription into the regional hub VNet, and registers it for IPAM allocation so its address space is carved from the global plan rather than guessed. By the time the team receives credentials the subscription is already a governed citizen of the estate — policy-bound, network-attached, tagged, and visible to the platform — comfortably inside the vend in under one business day target.

AWS Account Factory Onboarding (Deep Dive)

On AWS the symmetrical path runs through Account Factory for Terraform. AFT turns the approved request into a reviewed pull request on Bitbucket, and the GitOps pipeline provisions an account that lands under the right organisational unit with its Service Control Policies already inherited, federated through Okta into IAM Identity Center, attached to the regional Transit Gateway, pinned to approved regions, and wired to stream its audit trail to the immutable Log Archive. That an Azure subscription and an AWS account arrive carrying the same baseline, requested the same way, is what makes multi-cloud consistency real rather than aspirational — and it is the evidence the bank offers regulators that it can stand up, or rebuild, a clean compliant environment on either provider on demand.

Whichever cloud answers, the environment is baselined at creation with a fixed set of controls the workload team neither chooses nor can opt out of. Centralised logging is wired to the immutable archive; monitoring is enrolled into Dynatrace and the native platform telemetry; Wiz and Wiz Code begin continuous posture and IaC scanning while CrowdStrike Falcon is registered to protect every workload that will run there; backup vaults and policies are attached to the declared tier; private DNS zones and Private Link/PrivateLink paths are established so the environment is reachable internally and exposed to nothing externally by default; and a CI/CD bootstrap lays down the Bitbucket repository, the Terraform and Ansible pipeline skeleton, and the deployment identities. The team inherits a secured, observed, recoverable foundation on day zero rather than assembling one over weeks.

On top of that common floor the factory applies a blueprint per application type, because a public web front end, a microservice API, an integration adapter, an SAP-adjacent component, an analytics dataset, and an overnight batch job have genuinely different network exposure, scaling, data-handling and resilience needs, and forcing them through one template would over-provision the simple ones and under-protect the sensitive ones. Every blueprint, regardless of type, instantiates production and non-production as separated environments — distinct subscriptions or accounts with their own guardrails, so that a non-production change can never reach into a production payments path.

Blueprint Typical landing-zone shape Resilience default
Web Edge-fronted front end behind regional WAF, blue-green slots, CDN-cacheable Tier-1, zero-downtime release
API Private-Link service behind the API gateway, FAPI/mTLS where partner-facing, canary rollout Tier-1, progressive delivery
Integration Event-backbone producers/consumers and adapters, ISO 20022 canonical mapping Tier-1/Tier-2 by flow
SAP-adjacent High-memory-certified compute peered to the SAP landing zone over private connectivity Tier-1, paired-region DR
Data Lakehouse workspace with classification and lineage enrolled from creation Tier-2, governed band
Batch Scheduled/queue-driven compute with idempotent, restartable jobs Tier-2/Tier-3 by criticality

An environment is provisioned long before it is allowed to carry customer traffic, and the gate between the two is a defined set of go-live criteria that ServiceNow enforces rather than leaves to goodwill. The application cannot enter production until it has a completed threat model, a passed security baseline, a Wiz posture score at or above the bank’s floor of 85, an established Dynatrace performance baseline against which future regressions will be judged, a confirmed DR classification with its RTO/RPO targets, and — most insisted upon here — a backup whose restore has actually been tested, not merely scheduled. Finally it must exist as a ServiceNow configuration item with an attached operational runbook, so that the moment it goes live the Service Desk and SOC know it exists, who owns it, and how to operate and recover it. These gates are the practical expression of the new application production-ready in under two weeks via blueprint target: fast because the path is paved, safe because it is gated.

SAP landing zone

SAP is treated as a business-critical shared domain in its own right rather than as one more application, because so many enterprise shared services — finance, controlling, procurement, parts of HR and the general-ledger feeds that reconcile against the core — depend on it, and an SAP outage propagates into month-end, regulatory reporting, and payment runs. The landing zone is engineered to Tier-1 expectations, and the load-bearing principle is that SAP availability is built into the topology rather than bolted on through maintenance windows.

SAP Landing Zone — HA/DR & Integration

Within the primary region, production SAP is stretched across two availability zones so that the loss of a single data hall is survivable without data loss. The central-services layer runs as a clustered ASCS/ERS pair with the enqueue replication server in the second zone, so the enqueue lock table — SAP’s single most fragile point — survives a zone failure intact. The database tier runs HANA System Replication in synchronous mode between the two zones, which holds the in-region recovery point at effectively zero, on high-memory certified compute sized to the bank’s HANA footprint. Around production, the QA, wider non-production and sandbox systems are kept in deliberately separated environments with their own guardrails and their own, lower, resilience expectations, so that a transport being tested in QA or an experiment in sandbox can never consume capacity from, or reach into, the production landscape.

Regional resilience extends into the paired DR region, where a warm-standby HANA is maintained through asynchronous replication. Asynchronous is the deliberate choice across the longer inter-region distance — synchronous at that range would tax production latency for little benefit — and it aligns SAP production to the Tier-1 objectives of a two-hour RTO and a fifteen-minute RPO, with the warm standby making recovery a controlled promotion rather than a cold rebuild. All this connectivity — between zones, to the DR region, and outward to the bank — runs over private connectivity on ExpressRoute and Direct Connect, never the public internet, keeping SAP traffic inside the trust boundary end to end.

SAP earns the “shared domain” label through its integrations, each made an explicit, private, governed path. It integrates with identity so access is brokered through the same Entra and Okta control plane and dual-control and segregation-of-duties are enforced consistently; with the banking platforms and APIs so postings, payments and reconciliation flow against the core through the canonical event and API layer rather than point-to-point spaghetti; with the data platform, into which SAP feeds the finance and master-data records that downstream risk and regulatory aggregation depend upon; and with ServiceNow for change, incident and configuration management, so the SAP landscape lives under the same operational discipline as the rest of the estate.

Data and integration platform

If the channels are the bank’s face and the core its ledger, the data and integration platform is its nervous system — built on a single decisive premise: that data moves as a stream of events first and comes to rest in a governed lakehouse second, so operational integration and analytical insight are served from the same well-governed flow rather than from divergent extracts nobody can reconcile. The backbone is an event-driven layer — Event Hubs and Kafka on the Azure side, Kinesis and Kafka on the AWS side — onto which the core, the payment hub, the channels and SAP publish business events in the ISO 20022 canonical model. Producers and consumers are decoupled, which is exactly what lets the bank deploy and release components independently without coordinated downtime: a consumer can be upgraded, replayed, or replaced while events keep flowing.

Data & Integration Platform — Medallion Lakehouse + BCBS 239 Governance

Data that lands from the backbone settles into a medallion lakehouse whose bands encode increasing trust. The raw band captures events exactly as received, immutable and replayable. The curated band cleanses, conforms and joins that data into validated, query-ready models. The governed band exposes certified, access-controlled products fit for regulatory and executive consumption. Alongside these as a first-class concern is a master-data band holding the bank’s golden records — customers, accounts, products and counterparties — the single authoritative versions against which every downstream calculation resolves, so a customer or a counterparty means exactly one thing across risk, finance, and the channels.

Band Role Trust property
Raw Events as received, immutable Replayable system-of-truth for arrival
Curated Cleansed, conformed, joined Validated and query-ready
Governed Certified, access-controlled products Fit for regulatory and BI consumption
Master data Golden customers, accounts, products, counterparties Single authoritative reference

Wrapping the lakehouse is a governance band that is not optional polish but the platform’s reason for being trustworthy: a catalogue so every data product is discoverable and owned, lineage tracing each field back through every transformation to its originating event, classification that tags Public, Internal, Confidential and Restricted data so residency and access controls follow the data automatically, and residency enforcement that keeps regulated data inside its approved geography. This band exists explicitly to satisfy BCBS 239 risk-data aggregation: the expectation that a G-SIB can prove the lineage of its risk figures, identify the golden source of every material data element, and demonstrate the accuracy, completeness and timeliness of aggregated risk data is met structurally — lineage from the catalogue, golden source from the master-data band, and accuracy/completeness/timeliness from the validation and freshness controls applied as data is promoted through the bands.

From the governed and master-data bands the platform serves two distinct consumer worlds. Internally it feeds analytics and business-intelligence, giving risk, finance, compliance and the lines of business certified, lineage-backed data for reporting and modelling. Externally it underpins secure partner and open-banking API exposure, where governed data products are published through the FAPI 2.0 gateway under OAuth2/OIDC, mTLS and consent management, so partners and open-banking participants receive exactly the data they are entitled to, governed by the same classification and residency rules that apply inside the bank. The same platform thus satisfies the regulator looking for provable risk data and the partner ecosystem looking for safe, contractual access — from one governed source of truth rather than two competing ones.

Zero-downtime release patterns

The headline mandate of this programme is unambiguous: critical banking services carry no maintenance windows, and platform changes — application releases, schema migrations, infrastructure upgrades — happen with no customer-visible downtime. A bank that asks its customers to tolerate a Sunday-night outage for a payments upgrade has already conceded the design argument. The zero-downtime operating model is therefore not a deployment convenience bolted on at the end; it is a property the entire estate is engineered to preserve, and it rests on two load-bearing ideas that recur through every pattern below.

The first is that deployment is not release. New code reaches production servers long before any customer traffic exercises its new behaviour. We decouple the two with feature flags: code ships dark, behind a flag defaulted off, and is activated by a runtime configuration change rather than a redeployment. This lets us deploy continuously and at low risk during the day, then turn behaviour on for a controlled cohort — a single cell, an internal staff segment, one percent of customers — and observe before widening. The second idea is that every production release is reversible. No release reaches the customer-facing tiers unless it can be withdrawn cheaply and quickly: a flag flipped off, traffic shifted back to the previous version, a canary aborted. Reversibility is an entry condition for change, not a contingency we improvise once an incident is underway.

These principles are realised through a small, deliberately constrained set of patterns, each matched to a workload class. Blue-green deployment governs internet- and customer-facing services: a fully provisioned green environment runs the new version alongside the live blue environment, smoke-tested in isolation, then cut over by traffic shifting at the global edge, with blue retained warm as an instant rollback target. Canary / progressive delivery — orchestrated through Argo Rollouts with Istio and Flagger for traffic management — exposes the new version to a rising fraction of traffic in steps, with each step gated on KPIs (error rate, p95/p99 latency, saturation, business signals such as authorisation success). A breach of any guardrail triggers KPI-based automatic rollback with no human in the critical path; the operator is informed, not consulted. Rolling updates serve stateless and managed workloads where instances can be replaced incrementally behind a load balancer without a parallel environment. For the most critical channels we use active-active cutover: the new version is brought up in one region of an active-active pair and validated under live load while its partner continues serving, so the cutover is a traffic-weight change rather than a switch with a gap.

Underpinning all of these for stateful services is the discipline that most often defeats zero-downtime ambitions — schema change. We never take a lock-acquiring, blocking migration against a live ledger or account store. Instead we apply database expand/contract (the parallel-change pattern): expand the schema by adding the new structure additively; dual-write to old and new shapes while backfilling historical data; run both shapes in parallel until the new path is proven; then contract by removing the old structure once nothing reads it. Each step is an additive, backwards-compatible, near-instant metadata operation (sub-50 ms, using tooling such as pgroll) rather than a long table rewrite, so migrations are lock-free and invisible to the customer. Application code and schema deploy independently, each tolerant of the other’s previous version, which is precisely what lets deploy and release stay decoupled.

The common thread is traffic shifting plus health-based rollback for every major release. Whether the unit of change is a blue-green environment, a canary weight, or an active-active region, the release advances by moving a controlled proportion of traffic and retreats automatically the moment health signals degrade — and because deployment is separated from release by flags, even an already-deployed change can be neutralised in seconds without redeploying anything.

Zero-Downtime Release Patterns — Blue-Green, Canary, Cell-by-Cell, Expand/Contract

Active-active multi-region data topology

Zero-downtime release depends on a topology that contains failure as readily as it contains change. The organising principle is cell-based architecture with strict bulkheading: the estate is composed of independent, full-stack cells, each a complete vertical slice of the platform. Customers are partitioned across cells by a customer-ID shard, and — this is the decisive constraint — there is no cross-cell replication. A cell is a blast-radius boundary; nothing it does can corrupt or saturate a sibling. In front of the cells sits a trivially simple, highly available cell router whose only job is to map a customer identifier to its owning cell. The router is deliberately thin because it is the one shared component on the critical path, and its simplicity is what makes it dependable. Cell size is bounded and tested, never allowed to grow past its proven envelope.

This topology pays its largest dividend at release time. Because cells are independent, we deploy cell by cell, so a bad release is contained to the fraction of customers in the first cell it touches rather than the whole population — the cell boundary that limits a fault’s blast radius is the same boundary that limits a release’s. Cells map onto availability zones and regions, and the entire model is replicated across both Azure and AWS, so provider diversity is expressed as additional cells rather than as a bolted-on failover scheme.

Within a cell, the data tier is split by consistency need rather than treated as one homogeneous store, because the two halves of a bank have irreconcilable requirements. The ledger and account balances are CP — strongly consistent, backed by a consensus store using synchronous quorum (Raft/Paxos-class replication, e.g. CockroachDB-, YugabyteDB- or Spanner-class engines) with RPO≈0, and two-phase commit for the rare cross-shard transaction. Money movement demands that a committed posting is durable and globally agreed before acknowledgement; we accept the latency cost of quorum to obtain it. The categorical rule here is that balances are NEVER reconciled by last-write-writer (LWW) — a topology that silently drops a conflicting write is acceptable for a UI preference and catastrophic for a ledger. By contrast, non-monetary state — sessions, preferences, counters and similar — runs active-active on CRDT / version-vector stores, which converge mathematically without coordination and so tolerate multi-region writes at low latency. Matching the consistency model to the data class is what lets engagement workloads be genuinely active-active without compromising the system of record.

Above the cells, GSLB with health-based failover steers customers to healthy regions and away from degraded ones, and the design’s overriding obligation is to remove single points of failure in identity, ingress and DNS — the three shared services whose failure would defeat every cell beneath them. Authentication, edge ingress and name resolution are each made redundant across regions and clouds so that no one component can take the institution offline.

Active-Active Multi-Region Data Topology — Cells, CP Ledger, CRDT State

Disaster recovery and resiliency

Resiliency is engineered, not assumed, and the engineering begins by classifying every service into a recovery tier and designing each tier to a stated objective. A flat DR posture either over-invests in services that can tolerate a day’s recovery or, far worse, under-protects the payments path. Each of the four strategic regions is paired with a dedicated DR region in the same geography, in both clouds, so a regional loss in Azure or AWS has a pre-built, in-jurisdiction destination that respects data-residency boundaries.

The recovery strategy is differentiated by tier. Edge and APIs run active-active, recovering by traffic shift rather than by restore. Stateful services where active-active is impractical run active-passive / warm standby, with capacity provisioned and data continuously replicated to the paired region so promotion is fast and predictable. The ledger is the special case: its RPO≈0 is delivered by the same synchronous quorum that gives it strong consistency, so a region loss costs no committed transactions rather than relying on asynchronous replication that could lose the last interval of postings.

Tier Scope RTO RPO
T0 Identity, connectivity, security control plane 1 hour 15 minutes
T1 Customer platforms, payments, SAP prod, partner/core APIs (ledger RPO≈0 via sync quorum) 2 hours 15 minutes
T2 Internal / analytics 8 hours 4 hours
T3 Everything else 24 hours 24 hours

A recovery objective is only as credible as the last test that proved it. The programme therefore treats DR testing as a standing obligation, not an annual ceremony: regulator-mandated threat-led penetration testing (DORA TLPT) exercises the institution’s resilience against realistic adversaries, and chaos engineering game-days — built on AWS Fault Injection Service (FIS) — deliberately inject regional, zonal and dependency failures to confirm that the cell boundaries, health-based failover and automated rollbacks behave as designed under genuine fault conditions. Resiliency that has not been exercised is a hypothesis; these tests convert it into evidence.

Disaster Recovery & Resiliency Tiers

Disaster recovery runbooks

A recovery target nobody has rehearsed is a guess dressed as a commitment. Impact tolerances and RTO/RPO numbers are meaningful only when a named owner has executed the procedure that meets them, against a realistic scenario, recently enough that the runbook reflects the current estate. The programme’s testing cadence is accordingly explicit: quarterly per-tier tests validate individual recovery procedures, an annual full-region game-day rehearses the loss of an entire region end-to-end, and DORA threat-led penetration testing stresses the institution against adversarial scenarios on the regulator’s mandated cycle. Each runbook below names its trigger, its procedure, its single accountable owner, the tier target it must meet, and — the step most often omitted — how recovery is validated before the incident is declared closed.

Scenario Trigger Procedure steps Owner Target Validation
Identity / control-plane recovery Identity provider or security control plane degraded/unreachable in primary region Confirm break-glass path; fail identity broker (Okta / Entra ID) and CA policy plane to paired region; restore PIM/privileged access; re-establish connectivity control plane Cloud Security & Identity T0 — RTO 1h / RPO 15m Privileged + customer auth succeed via DR region; CA + PIM enforced; break-glass accounts re-sealed
Connectivity / circuit failover Loss of a carrier or cloud on-ramp (ExpressRoute / Direct Connect) Verify active-active BGP withdrew failed path; confirm second carrier carrying load; shift inter-region traffic; validate hub routing and DNS resolution Network & Connectivity T0 — RTO 1h / RPO 15m End-to-end reachability on surviving circuits; no SPOF remaining; latency within tier
Core-ledger / payments region failover Loss of the region hosting payments / core APIs Engage issuer stand-in (STIP) so card auth continues; confirm ledger synchronous-quorum survivors hold committed state (RPO≈0); promote payments hub + core APIs in paired region; reconcile in-flight via idempotency keys Cloud Operations + Application Enablement T1 — RTO 2h / RPO 15m (ledger RPO≈0) Authorisations flowing under STIP; no lost postings; idempotent replay shows zero double-posting; rails reconciled
Edge / app cloud-to-cloud failover Loss of a cloud provider for customer-facing channels Shift GSLB / global edge weight to the other cloud’s active-active cells; scale surviving cells; confirm WAF + per-channel ingress healthy; widen cell capacity Cloud Platform Engineering T1 — RTO 2h / RPO 15m Channels served from surviving cloud; p95 within tolerance; cell router healthy; error budget intact
Data-platform recovery Loss of analytics / integration platform (event backbone, lakehouse) Fail event backbone to paired region; replay from retained streams; restore lakehouse from replicated/object storage; re-validate BCBS 239 lineage and golden-source integrity Application Enablement T2 — RTO 8h / RPO 4h Streams flowing; risk-data lineage complete and accurate; downstream reports reconcile
Ransomware / region loss Confirmed data-integrity compromise or total region loss Isolate affected scope; restore from immutable Object-Lock backups (write-once S3 / Log Archive); rebuild landing zone from Terraform; validate integrity before reconnecting; resume per tier Cloud Operations + SOC T1–T3 per service tier Restored data passes integrity checks; no reintroduced compromise; clean-state attestation before customer traffic resumes

Terraform and Ansible multi-stage CI/CD

In a zero-downtime estate the delivery pipeline is not a convenience — it is the mechanism by which change is made safe. Every platform mutation, from a network rule to a payments-microservice release, flows through one governed path, because the only way to guarantee no customer-visible downtime is to make change itself deterministic, reviewable, and reversible. The bank standardises on Terraform for infrastructure-as-code and Ansible for configuration-as-code, with Bitbucket as the single source of truth for both, so that what is running in any region of either cloud is exactly what is described, reviewed, and approved in version control — never a console click, never a snowflake.

Terraform + Ansible Multi-Stage CI/CD

A change begins as a pull request in Bitbucket and is met by a battery of automated gates before a human ever sees it. The Terraform stage runs fmt and validate for correctness, linting for convention, and policy-as-code security scanning through tfsec and Checkov so that an insecure construct — a public storage bucket, an over-broad security-group rule, an unencrypted volume — fails the build rather than reaching an environment. A secrets scan runs in the same stage, because the institution’s hard rule is that credentials never enter the repository and a leaked key fails the pipeline outright. The Ansible stage applies the same discipline to configuration: idempotent roles that converge a host to its declared state no matter how many times they run, driven by dynamic inventory sourced from each cloud’s APIs so that the set of targets is discovered at run time rather than maintained by hand. Both stages produce immutable, versioned artefacts — a planned Terraform change set and a tested role bundle — and it is those artefacts, not the source, that are promoted forward.

Promotion proceeds through four environments — Dev → UAT → Staging → Production — on a strict artefact-based promotion model: the exact artefact validated in one environment is the one advanced to the next, so a release is tested as the thing that will run, not as a re-build that might drift. The first two transitions are automated; the two that matter most are not. Manual approval gates sit before Staging and before Production, and each is tied to a ServiceNow change record — the gate cannot be cleared without an approved change, giving the bank the auditable separation of duties that SOX ITGCs and DORA change-management expectations demand. State is partitioned to match the blast radius the design is protecting: a separate, locked, encrypted remote state file per environment, per region, and per scope, so that an operation against one cell’s network in one region can never contend with, or corrupt, another’s. State locking serialises concurrent applies; encryption at rest protects what the state reveals.

What ties this pipeline to the zero-downtime mandate is what it actually executes at the Production gate. The pipeline does not push big-bang releases; it drives the progressive-delivery patterns described earlier in this document — blue-green for customer-facing channels, canary with KPI-based automatic rollback for everything that supports it — and it orchestrates expand-contract database migrations so that schema change is a lock-free, parallel-change sequence (add the new, dual-write and backfill, run both, drop the old) rather than a maintenance window. Because every release is shipped as a reversible, traffic-shiftable artefact behind a gate, a bad change is contained to a fraction of cells and rolled back automatically on a breached health signal. The pipeline, in other words, is where the institution’s promise of releases without downtime stops being an aspiration and becomes an enforced property of the delivery system.

DevSecOps software supply chain

The pipeline secures how change is delivered; the software supply chain secures what is delivered and where every part of it came from. For an institution running 100+ applications across two clouds, the governing principle is end-to-end traceability from a line of source to the artefact running in production — every component identified, scanned, signed, and accountable, with no unverified provenance anywhere in the path. The bank organises its repositories to make that traceability structural rather than aspirational.

DevSecOps Software Supply Chain

Source is segregated by purpose in Bitbucket: reusable Terraform modules, YAML pipeline templates, application repositories, governed artefact repositories, policy-as-code, Ansible roles, and test assets each live in their own governed space, so that a security control written once as policy-as-code is inherited by every consumer rather than re-implemented per team. Security testing is pushed as far left as it will go — SAST, secrets detection, software-composition analysis (SCA) for third-party and open-source dependencies, and infrastructure-as-code scanning all run early in the build, with Wiz Code providing the code-to-cloud security plane that connects a finding in a repository to the running resource it would become. A vulnerability is therefore cheapest to find and fix at exactly the moment it is introduced.

Dependencies are never pulled blindly from the public internet. The bank operates governed internal artefact repositories — npm, Maven, PyPI, NuGet, and container registries — as the only sanctioned source of packages, proxying and curating upstream so that what a build consumes is vetted, version-pinned, and auditable. Container images are scanned for vulnerabilities and cryptographically signed before they are admitted to a registry, and admission control downstream refuses anything unsigned, so provenance is enforced rather than trusted. Dynamic application security testing (DAST) exercises running services for the runtime-only classes that static analysis cannot see. Database change is treated as first-class code: schema-as-code following the expand/contract discipline is versioned, reviewed, and promoted on the same path as application code, which is precisely what makes the zero-downtime migration pattern repeatable rather than heroic. For the retail estate, the mobile build, scan, and distribution pipeline compiles the customer apps, subjects them to mobile-specific SAST and SCA, and distributes managed builds through unified endpoint management (UEM) to the workforce, with the same signing and provenance guarantees applied to a mobile binary as to a server image. Across all of it, the bank can answer the supply-chain question that regulators increasingly ask — what is in this artefact and where did each part come from — for any component in production, which is the foundation on which DORA’s third-party and concentration-risk obligations and SWIFT CSP supply-chain expectations rest.

Connected banking and IoT

The original proposal framed connected assets around a logistics fleet of trucks, containers, and pallets; for a universal bank that requirement is reinterpreted here as the telemetry of the bank’s own physical estate — its ATM and self-service fleet, its cash-in-transit movements, and its branch and facility sensors. What survives the reinterpretation is the architectural shape: thousands of distributed, physically exposed devices that must report continuously, prove their identity cryptographically, and keep operating when the link to the centre is lost.

Connected Banking IoT — ATM Fleet, Cash-in-Transit & Branch Telemetry

The estate falls into three device domains. The first is the ATM and self-service kiosk fleet, which streams health, cash-cassette levels, uptime, and tamper telemetry so that the bank predicts a cash-out before it happens, dispatches replenishment on evidence rather than schedule, and detects physical attack in real time — directly reinforcing the channel design’s promise that the ATM estate stays serviceable. The second is cash-in-transit and armoured-vehicle tracking, where GPS position, route adherence, and chain-of-custody events make the movement of physical currency observable end to end and exceptions — a deviation, an unexpected stop, a custody break — alertable immediately. The third is the branch-device and facility/security domain: physical access control, environmental sensors, and surveillance edge processing that keeps premises monitored without backhauling raw video to the centre.

The ingestion design uses each cloud for what it does best. AWS IoT Core handles fleet-scale device ingestion — the high-cardinality, high-volume stream of telemetry from tens of thousands of endpoints — while Azure IoT Hub serves the operational and control-tower plane, integrating device state into ServiceNow, SAP, and the wider banking operations estate so that a tamper alarm becomes a ServiceNow incident and a cash-low signal becomes a logistics work order without human transcription. Three security properties are non-negotiable, because these devices sit in lobbies, vehicles, and unattended vestibules. Every device carries a certificate-based identity and is admitted only on a valid, attestable credential. Each supports store-and-forward so that a kiosk or a vehicle that loses connectivity buffers locally and reconciles on reconnection — losing no telemetry and, crucially, no chain-of-custody record. And the entire fleet sits behind zero-trust device segmentation, isolated from the banking application networks so that a compromised ATM controller cannot become a path to the ledger. Telemetry is then tiered hot, warm, and cold — hot for the live operational view that drives dispatch and tamper response, warm for recent trend and capacity analysis, cold for the long-retention archive that audit, dispute, and forensic investigation require. The result is that the bank’s physical estate becomes as observable and as governed as its digital one, on the same zero-trust and integration foundations.

Observability and SOC integration

A zero-downtime operating model is only credible if the institution can see, at all times, whether it is meeting it — and can prove that a degradation was detected and acted upon before customers felt it. Observability and security operations are therefore designed as one closed loop rather than two adjacent functions, with a single strategic plane that fuses health, performance, and security signal across both clouds into one operational truth.

Observability & SOC Integration

Dynatrace is the strategic observability plane. It carries real-user monitoring (RUM) so that the bank measures the experience customers actually have, synthetic monitoring against the customer portals so that a broken login or a slow payment journey is caught by a robot before a human reports it, infrastructure and Kubernetes observability across the cell-based platform, and cloud-cost visibility so that spend is an engineered signal rather than a monthly surprise. Dynatrace does not collect in isolation: Azure Monitor with Log Analytics and AWS CloudWatch are the native telemetry sources that feed it, so that each cloud’s first-party metrics, logs, and traces are unified into one cross-cloud picture rather than examined in two separate consoles. This is what lets operations reason about a Tier-1 service that spans Azure and AWS as one service against one set of SLOs.

Security signal converges on its own correlation plane and then joins the same loop. Microsoft Sentinel is the SIEM, ingesting and correlating security events across the estate; Wiz contributes cloud and code posture, CrowdStrike Falcon contributes endpoint detection and response, and the WAF contributes edge attack telemetry, all feeding the Security Operations Centre (SOC). The loop closes through automation: a confirmed incident raises a ServiceNow incident automatically, enriched against the configuration-management database (CMDB) so that responders know immediately which business service, which application, and which owner are affected — and so that the same change-and-incident system that gated the release now governs the response to it. Detection, correlation, ticketing, response, and the feedback into engineering form one closed loop, which is exactly what DORA’s major-incident-reporting and UK Operational Resilience’s impact-tolerance regimes require an institution to be able to demonstrate. Observability proves the bank is meeting its availability and latency targets; the SOC integration proves it can detect, attribute, and remediate the events that threaten them — and together they give the zero-downtime mandate the evidence base without which it would be only a claim.

Cost model and TCO

The number that follows is a planning-grade estimate, and it is worth being precise about what that means. This is a steady-state run-rate model for the production estate, not a vendor quote — it sizes the recurring monthly cost of operating the target architecture once built, across four strategic regions in both clouds, so the architecture review board can reason about cost as a first-class design property rather than discover it after the fact. The figures are derived from the assumptions stated below and will move as commitments are negotiated and the estate fills; what they are engineered to be is directionally correct and internally consistent.

The assumptions are deliberately conservative and explicit. The model sizes a fleet of roughly 100 production applications across the dual-cloud landing zones, with non-production environments scaled down rather than mirrored — they run on a fraction of production capacity and shut down outside working hours. Compute is assumed to sit behind reserved instances and savings-plan / committed-use commitments covering the predictable baseline (around 70 per cent of steady demand), with on-demand and autoscale absorbing the peaks; uncommitted on-demand pricing is explicitly not the assumed posture. Third-party tooling — Dynatrace, Wiz, CrowdStrike Falcon, Okta, ServiceNow — is priced at enterprise list less an enterprise discount. The result is a run-rate; the one-time build and migration spend is governed and funded separately as project cost, not folded into this steady-state view.

Cost area Drivers ~USD/month
Hybrid connectivity 8× ExpressRoute + 8× Direct Connect circuits, dual-carrier, regional gateways and egress $45,000
Azure platform 4 hub VNets, Azure Firewall Premium, App Gateway WAF v2, Sentinel, Log Analytics, management and security subscriptions $60,000
AWS platform 4 Transit Gateways, Network Firewall, Control Tower, Log Archive, Security Tooling, shared-network accounts $45,000
Application compute and workloads 100+ production apps across both clouds (AKS/EKS, Functions/Lambda, managed data), non-prod scaled down $230,000
Core banking and payments platform CP consensus ledger, payments hub, cell-based topology, active-active across regions $140,000
SAP landing zone HANA across 2 AZs plus DR, application tiers, certified infrastructure $120,000
Data and integration Lakehouse, Kafka/Flink streaming, BCBS 239 risk-data aggregation and lineage $90,000
Fraud, AML and trading platforms Inline fraud scoring, sanctions/AML screening, case management, market-data and execution support $70,000
Observability and security tooling Dynatrace, Wiz + Wiz Code, CrowdStrike Falcon, Okta (enterprise list less discount) $100,000
Backup, DR and immutable storage Cross-region backup, DR capacity, immutable Object-Lock / write-once archives $40,000
Third-party edge Global CDN, authoritative DNS with GSLB, managed WAF $25,000
HSM and key management Key Vault Managed HSM, CloudHSM, External Key Store (XKS) $15,000
TOTAL Steady-state production estate, 4 regions, both clouds ≈ $980,000/month (~$11.8M/year)

That run-rate is a starting posture, not a ceiling, and the design assumes it is actively driven down through a standing FinOps practice. The single largest lever is commitment coverage: moving predictable baseline compute onto reserved instances and savings/committed-use plans delivers on the order of a 30 per cent reduction against on-demand for that portion, which is why the compute lines above are already modelled on the committed posture. Beyond commitment, autoscaling combined with disciplined non-production shutdown removes capacity the estate is not using, while right-sizing is made continuous and evidence-led through Dynatrace, whose utilisation telemetry drives a regular cadence of resizing over-provisioned workloads down to their observed working set. Storage lifecycle policies tier ageing data from hot to cool to archive and let immutable retention expire on schedule. Underpinning all of it, mandatory tagging — owner, cost centre, data classification, criticality — makes showback and chargeback real, so every cost above is attributable to a named line of business and the one-time build and migration programme remains governed under its own funding line throughout.

Bill of materials

Where the cost model answers what it costs to run, the bill of materials answers what is actually built, enumerating the major landing zones and platform domains with indicative resource counts so the estate can be provisioned and audited as a defined inventory. The organising principle is that everything here is vended from code rather than hand-built — the Azure subscription-vending factory and AWS Account Factory for Terraform produce the workload estate, and the platform footprint is rendered from the Terraform module library — so the counts below describe a reproducible target inventory. The symmetry between the two clouds is intentional and load-bearing: four regions, paired-DR per geography, and an equivalent control surface on each provider are what make the regulator-facing tested-exit and concentration-risk obligations demonstrable rather than asserted.

Landing zone / domain Key resources (count)
Azure platform 4 hub VNets; 4 Azure Firewall Premium; 4 App Gateway WAF v2; 8 ExpressRoute gateways; Microsoft Sentinel; Key Vault Managed HSM; Log Analytics workspace
Azure identity and management Identity subscription (Entra Connect, domain controllers, private DNS); Management subscription (Azure Monitor, Automation, Backup vaults); 2 break-glass identities
Azure workloads ≈100 vended application subscriptions across Corp/Online × prod/non-prod, each policy-inherited and hub-attached
AWS platform 4 Transit Gateways; 4 Network Firewall; Control Tower; Log Archive (immutable S3 + Object Lock); Security Tooling account; CloudHSM
AWS network and shared services Shared Network account per region (inspection, ingress, egress VPCs); Shared Services account; IAM Identity Center federation via Okta
AWS workloads ≈100 vended application accounts across Prod/Non-Prod OUs, each SCP-inherited and Transit-Gateway-attached
Core banking System of record (core ledger, single-homed) plus the Kafka/Flink ISO 20022 event backbone exposing it
Payments and cards Payments hub (rail normalisation); card authorisation switch; inline fraud and AML/sanctions screening platforms
SAP landing zone SAP HANA across 2 AZs plus paired-DR region; application tiers on certified infrastructure
Data platform Lakehouse, streaming, and BCBS 239 risk-data aggregation, lineage and golden-source governance
Zero-downtime topology Independent full-stack cells with cell router; CP consensus ledger replicated across regions, synchronous quorum
Third-party edge Global CDN; authoritative DNS with GSLB and health-based failover; managed WAF with origin cloaking
Shared SaaS Okta; Wiz + Wiz Code; CrowdStrike Falcon; Dynatrace; ServiceNow

Operating model and RACI

A zero-downtime estate is only as resilient as the operating model that runs it, so support, change, and patching are designed with the same rigour as the architecture itself. Support follows the sun on a 24×7 basis across the bank’s operating geographies, structured as a clear escalation chain: Level 1 is the ServiceNow-fronted service desk that triages and owns the ticket; Level 2 is Cloud Operations, which holds operational ownership of the running platform; and Level 3 is Platform Engineering or the relevant vendor, engaged for deep platform faults and product defects. The boundary between tiers is deliberate — escalation to engineering or vendor happens only when the fault is genuinely in the platform or product, so the deep-skill teams are protected for the problems only they can solve.

Change management is the discipline that lets a bank with no maintenance windows still alter its estate safely, and it is anchored end-to-end in ServiceNow. Every change is classified — standard, normal, or emergency — and the classification determines the gate. Standard changes are pre-approved, low-risk and templated; normal changes route through the Change Advisory Board for assessment and scheduling; emergency changes follow an expedited path with retrospective CAB review. Critically, even on an estate built for automated, zero-downtime release, manual approval gates are retained before promotion into Staging and Production — the pipeline can deploy without downtime, but a human still authorises the move into the environments that carry customers and money, which is how segregation-of-duties and SOX change-management expectations are satisfied without surrendering deployment safety to automation alone. Patching runs to a defined rhythm layered over this: a monthly maintenance cadence for routine updates plus critical out-of-band patching when a severe vulnerability demands it, governed by the patch SLA of seven days for critical and thirty days for high-severity findings, applied through the same zero-downtime release mechanics so that remediation never becomes its own outage.

Underpinning support, change and patch is an explicit allocation of accountability, because a control without a named owner is a control that fails silently. The matrix below assigns each major platform activity across the six teams using R (responsible), A (accountable), C (consulted), and I (informed). The load-bearing rule is one accountable owner per activity — there is never ambiguity about who answers for an outcome — while responsibility, consultation and information are shared as the work demands.

Activity CCoE Platform Eng Security & Identity Network Operations App Enablement
Landing-zone standards and blueprints A R C C I I
Account / subscription vending I A C R I C
Policy / guardrail change A R C C I I
Identity and PIM approval C I A I I C
Network and firewall change I C C A R I
Application onboarding C C C C I A
Zero-downtime release governance C A C I R R
Incident response I C C C A R
DR and TLPT resilience testing C R C R A C
Cost and FinOps governance A R I I C C

Migration and onboarding waves

Moving a universal bank present in seventy-plus countries onto a dual-cloud foundation is not a project that can be improvised application by application, and the design treats migration as an industrial capability rather than a sequence of one-off lifts. The estate is moved through a migration factory — a standing capability that pairs the Cloud Adoption Framework workstreams with reusable landing-zone blueprints, subscription and account vending, and the zero-downtime release tooling — so that the hundredth application onboards the same way the third did, on a pattern built for its workload type. The factory exists because the alternative does not scale: more than 100 applications across eight lines of business, four data centres and two clouds cannot each negotiate their own foundation, identity integration and network placement without the programme collapsing under bespoke effort and divergent risk. The factory standardises the how so the waves can argue only about the what and the when.

Every application that enters the factory is first assessed against the 6R disposition model, because the cheapest migration is frequently the one you do not do. The decision is deliberately conservative for a bank: a workload is retired if no longer needed, retained on-premises where consistency or regulation demands it, rehosted (lift-and-shift) when speed matters more than optimisation, replatformed (lift-and-reshape) to take managed resilience without a rewrite, repurchased where a SaaS or modern core supersedes the incumbent, and refactored only where the business case for cloud-native rearchitecture is real. The system of record is the clearest case of retain — the core ledger stays single-homed for consistency and regulatory comfort — while the systems of engagement around it are the candidates for replatform and refactor. No application crosses a wave boundary without passing an explicit entry and exit gate, so that a wave is a controlled increment of risk, not an open-ended phase, and the programme can always state precisely what is proven before the next, larger tranche begins.

Delivery Roadmap — CAF Workstreams & Six Migration Waves over 24 Months

The estate moves in six waves over roughly 24 months, sequenced so that risk rises only after the foundation that contains it has been proven. The shape is deliberate: the platform and its hardest guarantees come first; low-risk internal workloads prove the factory; customer channels and payments follow once the edge and identity are trustworthy; data, the core and the markets front office come last, when the surrounding estate is already operating to its targets.

Wave Months Scope (representative) Entry criteria Exit criteria
W1 — Foundation 0–3 Platform landing zones, identity control plane, dual private connectivity, policy-as-code guardrails, and 2–3 pilot applications Mandate and target-state signed off; tenancy, Organizations and subscriptions provisioned; carrier circuits ordered Platform baseline live; DR Tier-0 (identity, connectivity, security control plane) proven; first zero-downtime release demonstrated end-to-end on a pilot
W2 — Shared services 3–6 Shared services (logging, secrets, DNS, observability, ServiceNow integration) and low-risk internal applications W1 exit met; blueprints published for the relevant workload types; vending operating to SLA Shared-services baseline operating; internal apps live to Tier-2; onboarding factory throughput proven on real workloads
W3 — Customer channels 6–10 Customer web and mobile, payments channels, and edge cutover to the third-party global CDN/DNS/WAF W2 exit met; global edge and per-channel ingress in place; Conditional Access and PIM enforced Customer channels active-active behind the global edge; no SPOF in ingress/identity for Tier-1; blue-green and canary releases proven on customer-facing apps
W4 — Data and integration 9–12 Data and integration platform, event backbone, analytics, and BCBS 239 risk-data aggregation lineage and golden source W3 exit met; canonical ISO 20022 model agreed; data classification bands enforced in governance Integration backbone live; BCBS 239 lineage, golden-source and accuracy/completeness/timeliness evidenced; analytics consuming real-time ledger events
W5 — Core coexistence 12–18 Core-banking strangler-fig coexistence, cards, SAP landing zone, and remaining Tier-1 workloads W4 exit met; core sizing confirmed; vendor support in place; SAP Basis engaged; cards CDE segmented Core operating in coexistence with ledger RPO ≈ 0 via synchronous quorum; cards authorising < 100 ms; SAP production live to Tier-1; demonstrated zone and region failover
W6 — Markets and transition 18–24 Trading and capital markets, platform optimisation, and operating-model transition to run-state ownership W5 exit met; co-location and low-latency path validated; RACI and runbooks accepted Markets front office live (latency-pinned, co-located) with T+1 settlement on the resilient estate; cost and posture optimised; operations handed over to the six run-state teams

The 6R disposition is not applied once and forgotten; it is the lens that decides which wave a workload belongs to and what release pattern governs its cutover. Retained core systems never leave the foundation and are reached only through the event backbone, so they appear in the waves as integration points rather than migrations. Rehosted workloads dominate the early, low-risk waves because they prove the factory cheaply and quickly. Replatformed and refactored workloads cluster in the later waves, where the business value of managed resilience or cloud-native cell-based isolation justifies the additional effort and where the surrounding platform is mature enough to support them. Repurchased capabilities — a modern cloud-native core, a SaaS fraud or case-management tool — enter wherever their dependency graph allows, and retired applications are the quiet win that shrinks the estate before a single byte is moved. The disposition model is therefore the thread that ties the factory to the roadmap: it converts an inventory of 100-plus applications into a sequenced, gated programme in which every move is the right kind of move, made in the right wave, behind the right release pattern.

Architecture decision records

A design of this consequence earns trust by showing its working, and the architecture decision records are where the load-bearing choices are stated alongside the alternative that was rejected and the reason. Each ADR names a real trade-off rather than asserting a preference — a reviewer, an architecture board or a future maintainer can see not only what was decided but what was given up to decide it, which is the only honest basis on which a multi-year banking programme can be governed. The twelve records below are the decisions from which the rest of the architecture descends; everything else in this document is, in effect, their consequence.

ID Decision Alternative considered Rationale / trade-off
ADR-01 Dual-cloud estate on Azure + AWS Single-cloud (one hyperscaler) Regulator-mandated tested exit and cloud-concentration risk (DORA, EBA/ECB) outweigh the added cost and operating complexity of running two clouds; resilience and provider diversity for the few critical functions justify the overhead
ADR-02 Cell-based (bulkhead) architecture Monolithic regional deployment Blast-radius containment — customers partitioned by ID shard into independent full-stack cells so a bad release or fault hits a fraction first; the price is a more complex topology and a cell router that must be trivially highly available
ADR-03 CP consensus ledger (synchronous quorum) Eventually-consistent / last-write-wins store Money cannot diverge — balances demand strong consistency and RPO ≈ 0; the trade is higher write latency and the cost of 2PC across shards, accepted because an incorrect balance is never acceptable
ADR-04 System of record single-homed + event backbone Multi-homed active-active core Consistency and regulatory comfort — the core ledger stays pinned to one location and is exposed through a canonical ISO 20022 event stream; the trade is that the SoR is not itself multi-region, mitigated by Tier-1 DR and sync quorum
ADR-05 ISO 20022 canonical payment hub Per-rail point-to-point integrations Rail-agnostic normalisation — every rail maps to one internal model, so channels and the ledger are insulated from scheme specifics; the trade is the hub becoming a critical component that must itself be highly available
ADR-06 BYOK + HYOK with customer-controlled HSM Cloud-managed keys Sovereignty and CLOUD-Act neutralisation — plaintext keys never leave bank-controlled hardware (Managed HSM / CloudHSM / XKS) for the most sensitive data; the trade is added key-management operational burden and careful availability design for the key path
ADR-07 Blue-green + canary + expand/contract releases Scheduled maintenance windows Zero-downtime mandate — releases land progressively with health-based rollback and lock-free schema change; the trade is engineering and tooling investment in place of the simplicity of an outage window
ADR-08 Okta identity broker + Entra control plane Entra-only identity A large SaaS estate plus AWS federation is brokered through Okta while Entra governs M365, device compliance, Conditional Access and PIM; the trade is two identity planes to integrate, justified by SaaS SSO and cross-cloud federation reach
ADR-09 Third-party global edge (CDN/DNS/WAF) Per-cloud Azure Front Door + AWS CloudFront A single cross-cloud control point for WAF, bot/API protection and health-based failover between Azure and AWS origins; the trade is a third-party dependency, accepted to remove the edge as a SPOF and to fail over without re-architecting apps
ADR-10 Terraform + Ansible (cross-cloud IaC) Native Bicep + CloudFormation Cross-cloud consistency — one toolchain and one set of modules render identically into both clouds, preventing two divergent estates; the trade is forgoing some cloud-native tooling depth in exchange for a single operating model
ADR-11 Centralised egress inspection Per-VNet / per-VPC egress One policy point for outbound inspection, logging and filtering across the estate; the trade is a hub dependency on the egress path, mitigated by redundant, regionally paired inspection
ADR-12 Azure hub-and-spoke + AWS Transit Gateway Flat / peering mesh Central inspection and predictable routing per cloud rather than an unmanageable mesh; the trade is hub components to operate, accepted because a flat mesh cannot enforce consistent inspection or scale to the estate’s size

Risks, assumptions, issues and dependencies

No architecture survives contact with delivery unless the things that could derail it are named, owned and mitigated in the open, and the register below is the programme’s honest account of what it is carrying. A risk that is written down with an owner and a mitigation is a managed risk; one that is left implicit is a future incident. The register is maintained as a living artefact through every wave — items close as gates are met and new ones are raised as the estate reveals them — but the load-bearing set at design time is recorded here so that the architecture review board accepts the design with its eyes open.

Type Item Impact Mitigation / owner
Risk Cross-cloud skills gap Two clouds plus IaC, identity and payments expertise are scarce; delivery and operations may under-perform to target CCoE-led enablement, blueprints and pairing; standardise on one toolchain (Terraform/Ansible) — owner: CCoE / Cloud Platform Engineering
Risk Cloud-concentration risk (DORA) Over-reliance on one provider breaches DORA/EBA concentration and exit expectations Dual-cloud estate, tested exit plans, Register of Information, resilience testing — owner: Cloud Security & Identity / Risk
Risk Core-banking migration complexity Coexistence with the SoR ledger is the hardest cutover; error risks money movement Strangler-fig coexistence, never big-bang; ledger RPO ≈ 0; W5 gating and rollback tests — owner: Application Enablement / core vendor
Risk Payments-scheme SLA breach Missing an instant-payment or RTGS scheme SLA carries regulatory and reputational cost Canonical payments hub, idempotent posting, issuer stand-in (STIP); inline monitoring — owner: Payments / Cloud Operations
Risk Cost overrun Dual-cloud and Tier-1 resilience can exceed budget without active control FinOps tagging, BoM and TCO tracking, W6 optimisation; vend governance — owner: CCoE / Finance
Risk Data-residency breach A workload or key placed outside an approved jurisdiction breaches GDPR/Schrems II/local law Governance classification bands, region-pinned placement, BYOK/HYOK — owner: Cloud Security & Identity / Compliance
Risk Fraud-engine latency / availability An inline fraud control becomes an inline SPOF or blows its 10–50 ms budget Replicated co-located feature stores, fail-safe defaults, active-active scoring — owner: Fraud / Cloud Platform Engineering
Assumption Four data centres persist The four on-premises DCs remain the anchor for retained and single-homed systems Hybrid-by-default design; revalidated at each wave gate — owner: Network & Connectivity
Assumption Dual-cloud mandate fixed The Azure + AWS mandate is a fixed constraint, not a reversible preference Treated as an architecture invariant; ADR-01 records the rationale — owner: CCoE
Assumption Workday authoritative for HR Workday remains the authoritative source for joiner-mover-leaver identity events Workday → Okta SCIM JML provisioning; reviewed if HR system changes — owner: Cloud Security & Identity
Assumption Core-banking vendor support The incumbent core vendor supports the coexistence and target topology Confirmed at W5 entry gate; contingency in 6R repurchase option — owner: Application Enablement / vendor
Assumption Regulator approval of cloud workloads Supervisors approve the migration of the in-scope workloads to cloud Early engagement, exit plans and control evidence ahead of go-live — owner: Compliance / CCoE
Issue Core sizing to be confirmed Final core-banking capacity is not yet sized, blocking W5 firm planning Resolve at W5 entry; capacity and sizing section pending vendor inputs — owner: Application Enablement
Issue Country residency list incomplete The per-country data-residency list is not finalised across 70-plus countries Compliance to issue the authoritative list before W4 data placement — owner: Compliance
Issue SWIFT CSP attestation timing Annual CSP attestation (CSCF v2025) timing must align with go-live Schedule secure-zone hardening and attestation ahead of payments cutover — owner: Cloud Security & Identity
Dependency Carrier circuits Dual ExpressRoute / Direct Connect circuits gate connectivity in every wave Order at W1; dual-carrier, active-active BGP — owner: Network & Connectivity / carriers
Dependency Core-banking vendor Coexistence and cutover depend on vendor deliverables and support Tracked against W5 gate; contractual milestones — owner: Application Enablement / vendor
Dependency SAP Basis SAP landing zone and migration depend on Basis team availability Engage ahead of W5; SAP landing-zone pattern prepared — owner: SAP Basis / Application Enablement
Dependency Security-tooling licences Wiz, CrowdStrike, Sentinel and related licences gate the security baseline Procure before W1/W2 baselines; entitlement tracked — owner: Cloud Security & Identity / Procurement
Dependency Regulator sign-off Go-live for in-scope regulated workloads depends on supervisory sign-off Sequenced ahead of each affected wave; evidence packs prepared — owner: Compliance

Delivery roadmap and acceptance

The delivery model is anchored to the Cloud Adoption Framework workstreams, because a programme this large needs a shared vocabulary for what kind of work is happening at any moment, not just which wave it sits in. Strategy and Plan sets the mandate, the business case and the target state, and produces the disposition decisions that feed every wave. Ready stands up the platform — landing zones, identity, connectivity and guardrails — so that workloads land on a foundation rather than bare cloud. Adopt is the migration factory in motion: rehost, replatform, repurchase and refactor, wave by wave, behind the release patterns each workload warrants. Govern and Manage runs in parallel from the first day to the last, enforcing policy-as-code, cost control, security posture and operational ownership, and ultimately receiving the estate at the W6 operating-model transition. These workstreams are not sequential phases; they overlap continuously, which is precisely how the six waves and the 24-month timeline above are delivered without the foundation work and the run-state governance ever being treated as someone else’s problem. The waves are the increments; the CAF workstreams are the disciplines applied across all of them.

Acceptance is deliberately concrete, because a zero-downtime mandate is either demonstrated or it is merely claimed, and a bank cannot operate on a claim. The platform is accepted against a fixed set of criteria, each of which is a test that must pass rather than a statement that must be believed. There is no single point of failure in Tier-1 identity, connectivity or ingress — proven by removing each in turn and observing continuity. A tested zero-downtime release is demonstrated for every customer-facing banking application, through blue-green, canary and expand/contract patterns, with no maintenance window. Zone and region failover is demonstrated without interruption, not asserted from a runbook. Rollback is tested at every layer — infrastructure, application and database — so that every production change is genuinely reversible. Disaster-recovery restoration is exercised to the tier targets, Tier-0 through Tier-3, including the ledger at RPO ≈ 0 via synchronous quorum. Security baselines and immutable audit logging are in place before go-live, never retrofitted after it. And the service-level objectives are met as engineered — 99.995% for the Tier-0 identity and control plane, and 99.99% for customer-facing and Tier-1 services — measured monthly against the availability budget, not averaged into comfort.

The through-line of the entire document arrives here. Every wave has an entry and an exit gate; every load-bearing choice has an ADR with its rejected alternative; every risk has an owner; every acceptance criterion is a test rather than a promise. A design that can be drawn this precisely — decision by decision, gate by gate, control by control — is one a partner can build and a bank can operate, which is the only standard that matters for an institution that has promised never to close. The roughly 35 architecture diagrams that accompany these sections are not illustration; they are the evidence that each layer of that promise has been worked out to the point where it can be implemented, tested and run.

Multi-cloudBankingZero DowntimeAzureAWSEnterprise ArchitecturePaymentsZero Trust
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading