A global express, freight, and supply-chain carrier that moves parcels, time-critical freight, and temperature-controlled healthcare shipments through a worldwide air-and-ground network of sorting hubs, distribution centres, and a last-mile delivery fleet asked a deceptively simple question: “Give us one secure foundation we can build the next decade on.” Behind that question sits a brutal amount of reality — four on-premises data centres in four countries, a dual-cloud mandate across Azure and AWS, more than 100 applications, SAP at the centre of the business, a hybrid workforce signing in from offices and from home, and a fleet of trucks and containers that needs to report its location and condition in near real time. This article is the target-state reference architecture for that foundation: a secure, policy-driven, multi-cloud landing zone built on the Cloud Adoption Framework, enterprise-scale landing zones, and Well-Architected principles, and justified — not just asserted — with twenty-seven architecture diagrams.
The design brief is uncompromising. Identity is the perimeter, not the network. Platform services come before applications. Platform subscriptions and accounts stay separate from workload subscriptions and accounts so that a compromise or a mistake has a small blast radius. Guardrails, naming, tagging, logging, and CI/CD behave identically whether a team lands in Azure or AWS. And every critical service is engineered for regional and circuit redundancy rather than convenience. The result is not a flat trust zone stretched into the cloud; it is the same or better user and operator experience delivered through integrated identity, DNS, routing, and policy while zero-trust enforcement runs underneath every request.
Target-state overview
The estate is best understood from a single vantage point before drilling into any one domain. Users and partners enter from the top — office staff, work-from-home staff over GlobalProtect VPN, and external partners. Every public request first meets a third-party global edge (CDN, authoritative DNS with global traffic management, and a WAF) that normalises and protects traffic before it touches a cloud. Azure and AWS each host enterprise landing zones; the four on-premises data centres anchor the bottom and connect to both clouds over dual private circuits. Cross-cutting platforms — Okta, Microsoft Entra ID, Wiz, CrowdStrike Falcon, Dynatrace, and ServiceNow — operate horizontally across both clouds so security, observability, and operations are consistent. A dedicated IoT lane carries telemetry from trucks and containers into the platform.
Two design decisions in this overview drive everything that follows. First, the edge is a third-party global tier, not a per-cloud feature. A single CDN/DNS/WAF stack in front of both clouds gives one place to enforce OWASP rules, bot mitigation, API protection, and geo policy, and it lets traffic fail over between Azure and AWS origins without re-architecting each application. Second, the four on-premises data centres are first-class regions, not legacy to be tolerated. SAP dependencies, AD DS, Kerberos-bound systems, and country-level business continuity all assume the on-premises footprint persists, so the connectivity and identity designs treat hybrid as the default rather than a migration afterthought.
Architecture principles and operating model
Six principles govern the build. Platform first — identity, networking, management, and security are established before any workload migrates. Segregation by design — platform and application boundaries never share a subscription or account. Zero trust — no user, device, workload, or path is trusted implicitly; access is evaluated continuously on identity, device, risk, and context. Hybrid by default — every critical service accounts for on-premises integration, SAP, and AD DS. Multi-cloud consistency — one set of guardrails, naming, logging, and CI/CD across both clouds. Resilience over convenience — regional redundancy, redundant circuits, and tested recovery beat the easy path.
These principles only hold if someone owns them. The operating model centres on a Cloud Center of Excellence that owns landing-zone standards, policy-as-code, reference architectures, and onboarding patterns. Around it sit five teams with clear remits: Cloud Platform Engineering (landing zones, automation, shared tooling), Cloud Security and Identity (Okta/Entra federation, PIM, Conditional Access, Wiz, CrowdStrike, secrets, governance), Network and Connectivity (ExpressRoute, Direct Connect, transit, firewalls, DNS, edge, GlobalProtect), Cloud Operations (monitoring, backup, incident response, ServiceNow), and Application Enablement (onboarding blueprints by workload type). The consulting partner’s job is to make those teams self-sufficient after handover, which is why every pattern in this architecture is expressed as reusable code and documented runbooks rather than tribal knowledge.
Requirements and non-functional targets
A design is only as good as the targets it can be tested against, so before any topology is drawn the proposal pins down what the estate must do and how well it must do it. The functional scope is broad but deliberately bounded: a dual-cloud landing zone across Azure and AWS, a zero-trust identity fabric spanning on-premises Active Directory, Entra ID, and Okta, resilient hybrid connectivity from the four data centres, a factory to onboard 100-plus applications, a self-managed SAP landing zone, an event-driven data and integration platform, a fleet-scale IoT backbone, a location-transparent digital workplace, an end-to-end DevSecOps pipeline, and a unified observability and SOC capability. Each of those is a deliverable in its own right; what turns them from a wish list into an engineerable system is the set of non-functional targets that follow, because those are the numbers the platform will be accepted — or rejected — against.
The headline non-functional is availability, and it is engineered to the recovery tier rather than promised as a single blanket figure. The platform and Tier-0 control plane — identity, connectivity, DNS, and security tooling — carries a monthly SLO of 99.99%, because nothing else recovers until it does. Tier-1 services that the business sells against — customer logistics platforms, SAP production, and partner APIs — target 99.95%; Tier-2 line-of-business and analytics workloads target 99.9%; and Tier-3 development, test, and non-critical reporting relax to 99.5%. Latency is treated with the same specificity: edge-cached time-to-first-byte stays under 100 ms globally, the control-tower and tracking APIs hold a p95 under 300 ms in-region, and the real-time IoT alert path — geofence and threshold breaches that a dispatcher acts on — completes end-to-end in under 5 seconds. Scale targets are sized to the fleet the carrier actually runs: 500,000-plus connected devices, 50,000 messages per second at peak ingest, 10,000 tracking events per second sustained, and headroom to onboard 100-plus applications without re-architecting the platform.
Security and operational agility close out the envelope, and both are expressed as measurable commitments rather than intentions, because a posture that cannot be measured cannot be defended in an audit. Every user authenticates with MFA (100% coverage), privileged roles use phishing-resistant factors, and the estate runs with zero standing privilege — all administrative access is brokered just-in-time through PIM with approval and a time box. Posture is held to a Wiz score of at least 85 across both clouds, and the patch SLA is seven days for critical and thirty days for high findings, tracked to closure rather than to detection. On agility, a new subscription or account is vended in under one business day and a new application reaches production-ready in under two weeks through the blueprint — the deliberate counterweight to the governance, so that control does not become a synonym for slow. Recovery objectives are governed by the DR-tier table reproduced in the resiliency section, and the compliance posture — ISO 27001, SOC 2 Type II, PCI-DSS on payment paths, and GDPR with country data-residency — is detailed in the compliance section, with this document carrying only the obligation that the targets exist and are owned. The table below consolidates the non-functionals into the form the programme tracks them in: a dimension, a hard target, and the mechanism by which it is met and measured.
| Dimension | Target | How it’s met / measured |
|---|---|---|
| Platform / Tier-0 availability | 99.99% monthly | Active-active control plane, paired-DR region per geography, RTO 1h / RPO 15m |
| Tier-1 availability | 99.95% monthly | Customer platforms, SAP prod, partner APIs across 2 AZs + warm DR |
| Tier-2 / Tier-3 availability | 99.9% / 99.5% monthly | Single-region resilient with backup-driven recovery to tier RTO/RPO |
| Edge latency | TTFB < 100 ms globally | Third-party CDN anycast caching + health-checked DNS failover |
| API latency | Control-tower & tracking p95 < 300 ms in-region | Regional active-active, private back ends, Dynatrace RUM/synthetic |
| IoT alert latency | End-to-end < 5 s | Rules engine + stream processing on the hot path, geofence/threshold |
| Device & ingest scale | 500,000+ devices, 50,000 msg/sec peak, 10,000 events/sec sustained | AWS IoT Core + Azure IoT Hub, sharded streaming, load-tested |
| Security posture | MFA 100%, zero standing privilege, Wiz ≥ 85 | Conditional Access + PIM JIT, continuous CSPM scoring |
| Patch SLA | Critical 7 days / High 30 days | Monthly cadence + critical out-of-band, tracked in ServiceNow |
| Operational agility | Vend < 1 business day, new app < 2 weeks | Subscription/account factories + application blueprint |
Requirements traceability
Stating targets is necessary but not sufficient; the proof that the architecture answers them is traceability — every requirement mapped to a concrete design element and to the diagram that evidences it, so that nothing is asserted in prose and then quietly orphaned. The discipline matters most on a dual-cloud estate, where it is easy for a control to be real in Azure and merely assumed in AWS, or for a resilience claim to lack a rehearsal behind it. The register below is the spine that connects the two halves of the document: read left to right, each row begins with a business or platform requirement, moves to the design response that satisfies it, and ends with the section or diagram a reviewer can open to confirm the response is actually built rather than promised.
The mapping is deliberately complete across the estate’s load-bearing concerns, and it is bidirectional in spirit: every requirement has a design response, and — just as important — no major design element exists without a requirement that justifies it, which is how gold-plating is kept out of the architecture. Guardrail consistency is satisfied by the subscription and account factories that vend governed landing zones identically in both clouds; zero-trust identity by the Conditional Access and PIM model over Entra and Okta; resilient connectivity by the dual ExpressRoute and Direct Connect design; and scale, SAP resilience, multi-layer security, IoT throughput, observability, data residency, and operability each trace to their own design response and evidence. Where a requirement spans several diagrams — multi-layer security and application onboarding both do — the traceability points at the set, because the control only holds when every layer is present. The final row is the one that is easiest to omit and most expensive to skip: the estate must remain operable after handover, which traces to the RACI and the runbooks in the operating-model section, because a platform nobody can run to its targets has not actually met them.
| Req ID | Requirement | Design response | Evidence (section / diagram) |
|---|---|---|---|
| REQ-01 | Consistent dual-cloud guardrails | Subscription factory + Account Factory for Terraform, one policy-as-code set | Azure & AWS landing zone sections / governance hierarchy diagram |
| REQ-02 | Zero-trust identity, no standing privilege | Entra + Okta, Conditional Access engine, PIM JIT with approval and time-box | Identity & Conditional Access sections / identity & CA-PIM diagrams |
| REQ-03 | Resilient hybrid connectivity from 4 DCs | 2× ExpressRoute + 2× Direct Connect per region, active-active BGP | Global hybrid connectivity section / connectivity diagram |
| REQ-04 | Onboard 100+ applications | Application blueprints + factory vending, secure-by-default landing zones | Application onboarding section / onboarding diagrams |
| REQ-05 | SAP high availability and DR | HANA System Replication across 2 AZs + asynchronous warm-standby DR | SAP landing zone section / SAP diagram |
| REQ-06 | Multi-layer, defence-in-depth security | Seven-layer model + layered edge ingress, centralised inspection | Multi-layer security sections / security & network diagrams |
| REQ-07 | Fleet-scale IoT ingest | Split AWS IoT Core (vehicle telemetry) + Azure IoT Hub (control tower) | IoT connected logistics section / IoT diagrams |
| REQ-08 | Unified observability and SOC | Dynatrace strategic plane, Sentinel SIEM, automated ServiceNow incidents | Observability & SOC section / SOC diagram |
| REQ-09 | Data residency and localisation | Medallion lakehouse governance band enforcing in-region restricted data | Data & integration section / data platform diagram |
| REQ-10 | Operable to target after handover | 6-team RACI, tiered DR runbooks, follow-the-sun support model | Operating model & DR runbooks section / governance material |
Azure landing zone
Azure follows the enterprise-scale design. A Tenant Root Group anchors a management-group hierarchy that pushes policy and RBAC down by inheritance, so a control written once applies everywhere beneath it. The Platform management group holds four dedicated platform subscriptions — Identity, Connectivity, Management, and Security — which keeps shared services out of any single workload’s blast radius and lets the platform teams operate them on their own change cadence. Application workloads live under a Landing Zones management group, split into Corp (internal) and Online (internet-facing) and further separated into production and non-production subscriptions. Sandbox and Decommissioned management groups give experimentation and offboarding their own guarded lifecycles.
The platform subscriptions earn their separation. Identity carries domain controllers, Entra Connect, and private DNS support. Connectivity holds the regional hub VNets, Azure Firewall Premium, Application Gateway WAF, the ExpressRoute gateway, DNS Private Resolver, Bastion, and DDoS Network Protection. Management centralises Log Analytics, Azure Monitor, Automation, Update Management, and Backup vaults. Security runs Microsoft Sentinel, Key Vault HSM, break-glass identities, and the integration anchors for Wiz and CrowdStrike. Application subscriptions are vended, not hand-built: a subscription factory places each new subscription under the right management group, inherits policy, assigns RBAC, applies cost tags, and attaches it to the hub — so a workload arrives already governed.
AWS landing zone
AWS mirrors the same intent through AWS Organizations and a multi-account model delivered with Control Tower and Account Factory for Terraform. The Management account sits at the apex; Service Control Policies enforce the same guardrails the Azure policy hierarchy enforces — approved regions only, no public exposure of sensitive services, mandatory tagging. Security-relevant accounts are isolated in a Security OU (a Log Archive account with immutable S3 and Object Lock, and a Security Tooling account). Infrastructure lives in its own OU (a Shared Network account per region carrying the Transit Gateway, and a Shared Services account). Workloads split into Prod and Non-Prod OUs, with application accounts separated by criticality.
The decisive pattern here is account vending as code. Account Factory for Terraform provisions and customises accounts through a GitOps workflow, so an application account is never a manual ticket — it is a pull request that yields an account already wired with centralised logging, security-tooling enrolment, a Transit Gateway attachment, IAM Identity Center federation, approved-region settings, and KMS governance. That symmetry between Azure subscription vending and AWS account vending is what makes “multi-cloud consistency” real rather than aspirational: a workload team requests an environment the same way and receives the same baseline regardless of cloud.
Governance hierarchy and policy inheritance
The two landing zones look different on the surface — Azure management groups on one side, AWS organisational units on the other — but they are governed by one set of guardrails compiled into both. Policy-as-code in Terraform defines each control once; it is rendered into Azure Policy and RBAC assignments that inherit down the management-group tree, and into Service Control Policies that inherit down the AWS OU tree. A control authored once — approved regions only, no public exposure of sensitive services, mandatory tagging, encryption everywhere, logging to an immutable archive — therefore applies identically in both clouds, which is exactly what stops a multi-cloud estate from drifting into two divergent ones.
Inheritance is the mechanism that lets the operating model scale: the platform teams change a guardrail in one place and every subscription and account beneath it conforms on the next pipeline run, while workload teams stay free to build inside the guardrails without being able to weaken them.
Global hybrid connectivity
Connectivity is where the four-data-centre reality becomes architecture. Four strategic regions align to the four on-premises countries. Each region terminates two ExpressRoute circuits to Azure and two Direct Connect circuits to AWS, ideally across diverse providers, with active-active BGP so a single circuit or carrier failure fails over without human intervention. Azure regional hub VNets front the ExpressRoute gateways and route through Azure Firewall to their spokes; AWS regional Transit Gateways attach behind Direct Connect Gateways and fan out to VPCs. Controlled inter-cloud traffic between Azure and AWS routes either through the on-premises core or a dedicated interconnect, chosen per workload on latency, compliance, and cost.
Redundancy here is not gold-plating; it is the explicit requirement. A logistics control tower that loses connectivity to SAP or to its tracking back end stops the business, so every region carries two of everything and the routing is engineered to failback cleanly once a circuit recovers. The hub-and-spoke pattern in Azure and the transit-centric pattern in AWS both centralise inspection and egress control, which means the network security policy is enforced in one place per region instead of being re-implemented per workload.
That WAN view shows where the circuits land; the regional internals differ enough per cloud to deserve their own deep dives. In Azure, each region is a hub-and-spoke VNet: the hub carries the ExpressRoute and backup VPN gateways, Azure Firewall Premium, Application Gateway WAF, Bastion, and DNS Private Resolver in their own dedicated subnets, and spoke VNets for prod and shared services peer in — forced through the firewall by user-defined routes and reaching PaaS only over Private Endpoints.
AWS uses a transit-centric design: a regional Transit Gateway with separate route tables stitches together a dedicated inspection VPC (AWS Network Firewall in appliance mode), an ingress VPC (ALB + WAF across two Availability Zones), an egress VPC (a NAT gateway per AZ), and the workload VPCs — each reaching AWS services privately through Gateway and Interface VPC endpoints, with the Direct Connect Gateway terminating the dual circuits.
IP address management plan
Addressing on a multi-cloud estate is a governance problem long before it is a routing problem. The CIDR ranges that appear throughout the network diagrams are not chosen per project; they are issued from a single, centrally governed plan held in Azure IPAM and AWS VPC IPAM, so that every block is globally non-overlapping and an address can be read like a coordinate. The whole private estate lives inside the 10.0.0.0/8 supernet, allocated as a predictable /16 per (cloud, region, environment, role) — large enough that no workload ever has to renumber, and structured enough that a packet’s source address tells an operator which cloud, which region, and which tier it came from. That discipline is what makes the firewall rules, the route tables, and the on-premises BGP advertisements tractable across two clouds and four data centres.
The Azure plan encodes meaning into the second and third octets with the scheme 10.[role][region].0.0/16: the role digit is 1 for hub/platform, 2 for production, 3 for shared/non-prod, and the region digit runs 0–3 for the four primary regions and 4–7 for their paired-DR partners. So Region-1’s primary hub is 10.10.0.0/16, its production spoke is 10.20.0.0/16, and its shared/non-prod spoke is 10.30.0.0/16 — exactly the blocks the regional network diagrams carry — while the same region’s DR partner mirrors them at 10.14/10.24/10.34. Inside the Region-1 hub the subnets are themselves reserved by purpose (GatewaySubnet, AzureFirewallSubnet, AppGatewaySubnet, AzureBastionSubnet, and the DNS Private Resolver), and the production spoke is tiered into web, app, and data/Private-Endpoint subnets. AWS follows the same philosophy with its own ranges: a network/inspection VPC at 10.100.0.0/16, ingress and egress VPCs alongside it, a production application VPC at 10.200.0.0/16, and a shared-services VPC at 10.210.0.0/16, with two Availability Zones and /20 subnets per AZ per tier so capacity is never the constraint. On-premises occupies 172.16.0.0/12, partitioned cleanly into four /14 blocks — one per data centre — and 192.168.0.0/16 is held back entirely for lab, edge, and OT use so it can never collide with a routed range.
The master allocation below is the authoritative slice of that plan. Every row is a block the factories may draw from; everything outside it is, by policy, unallocated and therefore safe to grow into. The key invariant is that the address is the identity — because the role and region are baked into the prefix, segmentation rules and residency boundaries can be expressed against CIDR ranges directly, and a misrouted or unexpectedly sourced packet is visible the moment it crosses a firewall.
| Scope | Region / Env | CIDR | Purpose |
|---|---|---|---|
| Global supernet | All | 10.0.0.0/8 | Centrally governed private space (Azure IPAM + AWS VPC IPAM) |
| Azure hub | Region-1 primary | 10.10.0.0/16 | Connectivity hub — gateways, firewall, App Gateway, Bastion, DNS |
| Azure prod | Region-1 primary | 10.20.0.0/16 | Production workload spokes (web 10.20.1.0/24, app 10.20.2.0/24, data/PE 10.20.3.0/24) |
| Azure shared | Region-1 primary | 10.30.0.0/16 | Shared services / non-production spokes |
| Azure hub | Region-1 DR | 10.14.0.0/16 | Paired-DR connectivity hub (prod 10.24.0.0/16, shared 10.34.0.0/16) |
| Azure region-2 | Primary (hub/prod/shared) | 10.11.0.0/16 / 10.21.0.0/16 / 10.31.0.0/16 | Region-2 strategic geography |
| Azure region-3 | Primary (hub/prod/shared) | 10.12.0.0/16 / 10.22.0.0/16 / 10.32.0.0/16 | Region-3 strategic geography |
| Azure region-4 | Primary (hub/prod/shared) | 10.13.0.0/16 / 10.23.0.0/16 / 10.33.0.0/16 | Region-4 strategic geography |
| AWS network | Region-1 primary | 10.100.0.0/16 | Network / inspection VPC (Network Firewall, Transit Gateway) |
| AWS ingress/egress | Region-1 primary | 10.101.0.0/16 / 10.102.0.0/16 | Ingress VPC (ALB + WAF) and egress VPC (NAT per AZ) |
| AWS prod app | Region-1 primary | 10.200.0.0/16 | Production application VPC (two AZs, /20 subnets per tier) |
| AWS shared services | Region-1 primary | 10.210.0.0/16 | Shared-services VPC |
| On-premises | 4 data centres | 172.16.0.0/12 | DC1 172.16/14, DC2 172.20/14, DC3 172.24/14, DC4 172.28/14 |
| Reserved | Lab / edge / OT | 192.168.0.0/16 | Held back — never routed into the cloud estate |
Capacity and sizing
Capacity on this estate is engineered against the non-functional targets, not guessed at from a rule of thumb. Every tier is sized to the service-level objective and throughput it must sustain, and almost everything autoscales with a floor — a minimum instance count or zone-redundant footprint that holds availability up even at idle, scaling out under load rather than being provisioned for peak the whole month. The connectivity layer is the foundation: each region terminates two ExpressRoute circuits at 10 Gbps (Premium) and two Direct Connect circuits at 10 Gbps (dedicated) in active-active BGP, fronted by an Azure ErGw3AZ gateway with a VpnGw2AZ backup on the Azure side and a Direct Connect Gateway plus Transit Gateway on the AWS side. That is deliberate over-provisioning — a control tower that loses its path to SAP or to tracking stops the business, so the circuits carry headroom and fail over cleanly rather than running hot.
Above the network, compute is sized per cloud and per domain. The Azure edge and security plane runs Azure Firewall Premium (auto-scaling) and a zone-redundant Application Gateway WAF v2 that scales between 2 and 10 instances; web and application tiers run on VM Scale Sets and AKS node pools that scale from a resilient floor toward generous ceilings, backed by zone-redundant Azure SQL Business Critical and PostgreSQL Flexible Server. AWS mirrors the pattern with ALB and Network Firewall, m6i-class Auto Scaling groups and EKS node groups, and Aurora Multi-AZ or RDS for stateful data. SAP is the outlier that earns bespoke sizing: HANA runs on certified high-memory compute — Azure M-series (M128s, ~2 TB) or AWS High Memory / X2iedn — with clustered ASCS/ERS on D-series, HANA System Replication synchronous between the two production zones and asynchronous to the DR region. The IoT lane is sized to the hardest numbers in the brief — 500,000+ connected devices and a 50,000 msg/sec peak — across AWS IoT Core with throughput-sharded Kinesis into Timestream, and Azure IoT Hub with an Azure Data Explorer cluster, while the data backbone runs Event Hubs premium / Kinesis feeding a lakehouse on ADLS Gen2 and S3.
The table below maps each domain to the service and scale unit it is built on and the autoscale or high-availability posture that protects it. The discipline running through all of it is that resilience is a floor, not a feature toggled on under load — minimum instance counts, zone redundancy, and rehearsed replication are sized in from the start, and elasticity handles the variance above that line rather than the availability of the service itself.
| Domain | Service / SKU | Scale unit | Autoscale / HA |
|---|---|---|---|
| Hybrid connectivity | 2× ExpressRoute (Premium) + 2× Direct Connect (dedicated) per region | 10 Gbps per circuit | Active-active BGP; ErGw3AZ + VpnGw2AZ backup; DX Gateway + TGW |
| Azure edge security | Azure Firewall Premium | Per-region instance | Auto-scale, zone-redundant |
| Azure ingress | Application Gateway WAF v2 | 2–10 instances | Autoscale, zone-redundant |
| Azure web tier | VMSS Standard_D4s_v5 | 3–20 instances | Autoscale across zones |
| Azure app tier | VMSS Standard_D8s_v5 | 3–20 instances | Autoscale across zones |
| Azure containers | AKS node pool D8s_v5 | 3–30 nodes | Cluster autoscaler, zone-spread |
| Azure data | Azure SQL Business Critical / PostgreSQL Flexible | Per-database | Zone-redundant replicas |
| AWS workloads | m6i.xlarge / 2xlarge Auto Scaling; EKS m6i.2xlarge | Auto Scaling group / node group | Multi-AZ autoscale; ALB + Network Firewall |
| AWS data | Aurora (Multi-AZ) / RDS | Per-cluster | Multi-AZ replicas, automated failover |
| SAP HANA | Azure M128s (~2 TB) / AWS High Memory u-* / X2iedn | High-memory node per zone | HSR sync primary↔secondary AZ; async to DR |
| IoT ingestion | AWS IoT Core + Kinesis + Timestream; Azure IoT Hub + ADX | 500k devices, 50k msg/sec peak | Sharded to throughput; scaled units, ADX cluster |
| Data & integration | Event Hubs premium / Kinesis; lakehouse on ADLS Gen2 + S3 | Throughput units / shards | Elastic streaming; Databricks/Synapse + Glue/EMR |
Identity and zero-trust control plane
Identity is the primary security perimeter, and the design is deliberately identity-first. On-premises AD DS remains authoritative for domain-joined workloads and Group Policy-bound systems; Entra Connect synchronises it into Microsoft Entra ID, which becomes the control plane for Microsoft 365, Intune, Conditional Access, PIM, guest access, and device compliance. Okta is the strategic SSO provider for SaaS and the federation broker into AWS. Workday drives joiner-mover-leaver automation into Okta over SCIM, so access is provisioned and — critically — deprovisioned from the HR source of truth. AWS access flows through Okta-federated roles; Azure administrative access flows through Entra with RBAC and PIM.
The enforcement point is a Conditional Access engine that evaluates every authentication against user risk, device compliance (Intune plus CrowdStrike posture), location, and application sensitivity, then grants, demands step-up MFA, or blocks. MFA is mandatory for everyone; phishing-resistant MFA is mandatory for privileged roles and high-risk apps. Standing privilege is eliminated: administrators activate roles just-in-time through PIM with approval and a time box, the most sensitive operations run only from Privileged Access Workstations, and a pair of monitored, cloud-only break-glass accounts exist for genuine emergencies. This is the zero-trust trinity in practice — verify explicitly, grant least privilege, assume breach.
Employee digital workplace
The hardest part of “as if you’re on the local network” is delivering the experience without extending flat network trust. The design uses three access channels. Office users get local internet breakout to Microsoft 365 and approved SaaS — protected by identity and endpoint policy rather than by backhauling everything to a data centre — plus direct private-app access. Work-from-home users reach private applications over GlobalProtect VPN with Okta/Entra MFA, and selected apps are modernised toward Microsoft Entra Private Access for per-app, identity-aware connectivity that does not rely on legacy VPN. The third channel is direct SaaS — Microsoft 365, ADP, Workday, ServiceNow, Bitbucket — all behind Okta SSO.
What ties the three channels into one experience is unified identity, consistent DNS resolution across office and VPN, and Conditional Access that is invisible to a compliant user but decisive when risk changes. ADP and every enterprise SaaS app follow the same model: SSO through Okta, mandatory MFA, role-based provisioning from HR attributes, and step-up authentication for payroll and admin actions. Device trust — Intune compliance and CrowdStrike posture — gates access to sensitive systems, and unmanaged devices are restricted to web-only sessions for Microsoft 365 or blocked outright for sensitive workloads. The user never needs to know whether a service runs on-premises, in Azure, or in AWS; application publishing hides the location.
Conditional Access and PIM policy model
Because identity does the heavy lifting, the policy model deserves its own view. Conditional Access is expressed as a decision flow: inputs (user and group, device state, location, sign-in risk, application sensitivity) feed a policy engine that outputs grant, require-MFA, require-compliant-device, block, or session control. The baseline policies are non-negotiable — block legacy authentication, require MFA for all cloud apps, require a compliant or hybrid-joined device for admin portals and sensitive data, enforce phishing-resistant MFA for privileged roles, apply session controls to unmanaged-device Microsoft 365 web access, and demand stronger controls for SAP administration and finance or operations data.
PIM complements Conditional Access on the privileged path. A role is eligible rather than active; an administrator requests activation, an approver grants it, the elevation is time-bound and auto-expires, and every step is written to an immutable audit log. The combination matters: Conditional Access decides whether you may authenticate and from what device, and PIM decides whether — and for how long — you may wield privilege once you are in. Together they remove the two biggest enterprise weaknesses at once: weak authentication and standing administrative access.
Global edge and ingress
Public ingress is a layered funnel that the proposal mandates be third-party and cross-cloud. A request resolves through third-party authoritative DNS with global traffic management, health checks, and automatic failover. It lands on a third-party CDN for global anycast ingress, caching, and TLS. It passes through a third-party WAF that applies OWASP rules in prevention mode, bot management, API security with schema validation, rate limiting, country and ASN allow-deny lists, and reputation filtering. Only the CDN/WAF egress addresses are permitted to reach the origins — origin cloaking that removes the option of bypassing the edge. Regional Azure Application Gateway WAF and AWS ALB-plus-WAF act as a secondary enforcement and app-delivery tier before traffic reaches the application and its private data services.
The win is defence in depth with a single global control point. A new bot-mitigation rule or a country block is applied once at the edge and protects every public application in both clouds, while the regional WAF tier catches anything cloud-specific and keeps origins private. Every transaction, log, and security signal from the edge is forwarded to the SIEM and to Dynatrace, so the same request can be followed from the user’s browser all the way to the database.
Multi-layer security model
Zero trust is not a single product; it is seven layers that each assume the layer outside them may already be compromised. Identity (Okta, Entra, MFA, PIM, Conditional Access, SCIM) is layer one. Device (Intune, CrowdStrike Falcon, encryption, EDR) is layer two. Network (segmented hubs, firewalls, private endpoints, restricted peering, inspection) is layer three. Application (secure SDLC, Wiz Code, SAST, dependency scanning, secrets management, WAF, API protection) is layer four. Workload (hardened images, vulnerability management, image signing, runtime detection) is layer five. Data (encryption, classification, tokenisation, DLP) is layer six. Monitoring (Dynatrace, cloud logs, SIEM, ServiceNow) is layer seven. Spanning all of them is Wiz for cloud security posture and exposure management.
The reason to draw it as layers is operational: each layer has an owner, a tool, and a measurable control, and a gap in one is caught by the next. If a credential is phished (layer one), device compliance and runtime detection (layers two and five) still stand in the way; if a workload is exploited (layer five), network segmentation and data encryption (layers three and six) limit what the attacker can reach and read. Posture management ties the picture together by continuously checking that the controls are actually present and configured, not merely documented.
Multi-layer network security
The network expresses the same defence-in-depth as a path through six controls. Layer one is the third-party DNS/CDN/WAF for public ingress. Layer two is the Azure Application Gateway WAF and AWS regional ingress that protect origins and route at the app level. Layer three is Azure Firewall Premium and AWS Network Firewall behind a centralised inspection VPC/VNet, enforcing policy-controlled egress, optional TLS inspection, threat intelligence, and east-west rules. Layer four is the workhorse — NSGs, security groups, route tables, and subnet segmentation implementing a default-deny posture between application environments. Layer five is private connectivity to PaaS through Private Endpoints, Private Link, and VPC endpoints with private DNS. Layer six is the host and workload itself, hardened and running with least-privilege identity.
The same six layers look different inside each cloud, so each earns its own deep dive. In Azure, ingress lands on Application Gateway WAF v2, egress and east-west are forced through Azure Firewall Premium (with IDPS and TLS inspection) by user-defined routes, NSGs and Application Security Groups segment the spoke subnets default-deny, and PaaS is reachable only over Private Endpoints with public access disabled.
In AWS, ingress is an ALB behind AWS WAF, all east-west and egress traffic is steered through a centralised AWS Network Firewall in an appliance-mode inspection VPC, security groups and network ACLs enforce default-deny, and services are reached privately through Gateway and Interface VPC endpoints — administration via Session Manager rather than open SSH.
Two rules make this design defensible rather than merely layered. First, no unrestricted east-west routing — application environments cannot talk to each other unless a rule explicitly allows it, which contains lateral movement. Second, internet egress is policy-controlled and logged, so an exfiltration attempt has to pass a firewall that is recording it. Administrative access traverses only approved secure paths (Bastion and jump services), and flow logs from every layer feed central monitoring so the network is observable as well as segmented.
Compliance, data residency and control mapping
A global carrier that handles card payments at the point of booking, personal data for consignees across the European Union, and temperature-controlled pharmaceutical shipments under cold-chain custody does not get to assert that it is secure — it has to prove it, repeatedly, to auditors who do not take diagrams on trust. Four frameworks are therefore in scope, each treated as a distinct obligation rather than one undifferentiated “compliance” bucket. ISO/IEC 27001:2022 is the umbrella information-security management system the whole estate is certified against; SOC 2 Type II evidences that the security, availability, and confidentiality controls operated effectively over a period, not merely on audit day; PCI-DSS v4.0 applies only to the payment paths, which are deliberately tokenised and network-segmented so the cardholder-data environment stays small and the rest of the estate falls out of scope; and GDPR, with country-specific data-residency and localisation rules, governs how EU personal data and other regulated records are stored, moved, and erased. The design philosophy throughout is that one control set, expressed as policy-as-code and inherited by every landing zone, satisfies several frameworks at once — so the carrier maintains a single estate of controls and maps it outward to ISO, NIST, and CIS rather than running a separate stack per certification.
The pivot that makes residency tractable is a four-level data classification baked into the platform: Public, Internal, Confidential, and Restricted, where Restricted covers exactly the data the regulators care about most — personally identifiable information, payment data, and pharma cold-chain custody records. Classification is not a spreadsheet exercise; it is enforced where data lives. Restricted data stays in-region by default, the medallion lakehouse governance band carries the catalog, lineage, and classification tags that pin each dataset to an allowed geography, and regulated data crosses a border only over a governed, consented path the governance band itself authorises. That single rule — residency enforced at the data layer rather than asked of every application — is what lets a business operating in 70-plus countries share data globally without quietly breaching a localisation law somewhere in the network.
The control matrix below is the spine of the evidence pack: each row is a single auditable line per control domain, restating the control this estate actually runs in the language each assessor speaks — ISO 27001:2022 Annex A, NIST CSF plus the 800-53 family, and CIS Controls v8 — alongside the team that owns it. Crucially, every control names a real tool from this estate, because an auditor asks “show me” and the answer has to be Entra, Sentinel, Wiz, or an immutable Log Archive, not an aspiration.
| Control domain | This estate’s control | ISO 27001:2022 | NIST CSF / 800-53 | CIS v8 | Owner |
|---|---|---|---|---|---|
| Identity & access | Entra ID + Okta SSO/federation, Conditional Access, PIM JIT (approval + time-box), MFA for all and phishing-resistant for privileged, SCIM leaver flow from Workday | A.5.15, A.5.16, A.5.17, A.5.18 | PR.AA / IA-2, IA-5, AC-2, AC-6 | CIS 5, 6 | Cloud Security & Identity |
| Network security | Azure Firewall Premium + AWS Network Firewall behind centralised inspection, default-deny segmentation, Private Endpoints, no unrestricted east-west | A.8.20, A.8.21, A.8.22 | PR.IR / SC-7, AC-4 | CIS 4, 12 | Network & Connectivity |
| Data protection | Encryption in transit and at rest (Key Vault HSM / KMS), four-level classification, payment-path tokenisation, DLP | A.8.10, A.8.11, A.8.12, A.8.24 | PR.DS / SC-28, SC-13 | CIS 3 | Cloud Security & Identity |
| Logging & monitoring | Microsoft Sentinel SIEM, Azure Monitor / Log Analytics + CloudTrail / CloudWatch, immutable Log Archive (S3 Object Lock) | A.8.15, A.8.16 | DE.CM, DE.AE / AU-2, AU-6, AU-9 | CIS 8 | Cloud Operations |
| Vulnerability management | Wiz exposure scanning, Defender for Cloud + AWS GuardDuty, patch SLA (critical 7 days / high 30 days) | A.8.8 | ID.RA / RA-5, SI-2 | CIS 7 | Cloud Platform Engineering |
| Backup & disaster recovery | Geo-redundant and cross-cloud backup, recovery-tier RTO/RPO model, immutable restore points, rehearsed failover | A.8.13, A.8.14 | RC.RP / CP-9, CP-10 | CIS 11 | Cloud Operations |
| Change management | Mandatory PR review, policy-as-code gates (tfsec / Checkov), CAB approval tied to ServiceNow change | A.8.32 | PR.PS / CM-3, CM-4 | CIS 4 | Cloud Platform Engineering |
| Endpoint security | Microsoft Intune compliance, CrowdStrike Falcon EDR, full-disk encryption, device trust gating Conditional Access | A.8.1, A.8.7 | PR.PS / SI-3, SI-4 | CIS 1, 2 | Cloud Security & Identity |
| Cloud posture | Wiz CSPM with posture score ≥ 85 enforced as an onboarding gate, drift and exposure detection across both clouds | A.5.36, A.8.9 | ID.IM, PR.PS / CA-2, CM-6 | CIS 7 | Cloud Security & Identity |
| Software supply chain | Wiz Code (SAST / SCA / IaC scanning), governed internal artifact repositories, container scanning and image signing | A.8.28, A.8.30, A.8.31 | PR.PS / SA-11, SA-12, SR-3 | CIS 16 | Cloud Platform Engineering |
| Application security | Secure SDLC, App Gateway WAF v2 + AWS WAF, third-party edge WAF (OWASP, bot, API schema), runtime-injected secrets | A.8.25, A.8.26, A.8.27 | PR.PS / SC-7, SA-15 | CIS 16 | Application Enablement |
| Compliance & audit | ISMS governance, SOC 2 Type II evidence collection, PCI-DSS scope management on segmented payment paths, ServiceNow CMDB system of record | A.5.31, A.5.34, A.5.35 | GV.OC, ID.GV / CA-7, PM-9 | CIS 17, 18 | Cloud Center of Excellence |
Two patterns in that matrix do the heavy lifting. First, identity and posture are the domains where a single control answers to every framework at once — Conditional Access plus PIM is simultaneously ISO A.5.15–A.5.18, NIST PR.AA, and CIS 5/6, and a Wiz posture score above threshold is at once an ISO A.5.36 expectation, a NIST ID.IM signal, and a CIS 7 measurement — so the carrier earns disproportionate audit coverage from getting those two right. Second, the immutable Log Archive with Object Lock is the keystone of the evidence chain: it satisfies ISO A.8.15, NIST AU-9, and the SOC 2 criteria for tamper-evident audit trails with the same artefact, and it is the reason a ransomware actor cannot erase the very logs that would prove what they did.
Residency then gets its own table, because the rule differs by data class and so does the enforcement point: Public material can sit on the global edge, while Restricted records cannot leave their geography without an explicit, governed crossing. The table makes the boundary precise.
| Data class | Examples | Residency rule | Enforcement |
|---|---|---|---|
| Public | Marketing pages, public tracking status, published rate cards | No restriction; served globally from the edge CDN | Third-party CDN/WAF; no regulated content permitted at this tier |
| Internal | Operational dashboards, internal knowledge, non-sensitive telemetry | Regional preference; cross-region replication allowed for resilience | Lakehouse governance band tags; storage account / S3 region policy |
| Confidential | Commercial contracts, partner agreements, pricing models, business analytics | Stored in approved business regions; access on least-privilege and need-to-know | RBAC + Conditional Access; encryption with Key Vault HSM / KMS; DLP |
| Restricted — personal data (GDPR) | Consignee PII, employee records, contact and address data | Stays in-region (EU data in EU); cross-border only over governed, consented paths | Governance-band classification + residency rules; Private Endpoints; lineage and consent tracking |
| Restricted — payment & cold-chain | Cardholder data, payment tokens, pharma temperature and chain-of-custody records | Payment data tokenised and confined to the segmented PCI scope; cold-chain records pinned in-region | Tokenisation + network segmentation (PCI-DSS v4.0); Object Lock retention; in-region storage only |
The thread tying the section together is that compliance here is a property of the platform, not a quarterly scramble. Because classification, residency, encryption, logging, and posture are all enforced by inherited policy and named tooling — and because every one maps cleanly to ISO, NIST, and CIS through the matrix above — the carrier can hand an auditor a single coherent control estate and demonstrate, with evidence rather than assertion, that a parcel’s payment, a consignee’s personal data, and a pharma shipment’s cold-chain record are each governed exactly as the regulator demands.
Application onboarding for 100+ applications
A foundation is only as good as the rate at which workloads can land on it safely. The onboarding model turns 100-plus applications into a repeatable pipeline rather than 100 bespoke projects. A request is raised in ServiceNow; an automated vend (Azure subscription factory or AWS Account Factory) creates the environment under the correct management group or OU with policy, RBAC, tags, and hub or Transit Gateway attachment; a baseline is applied at creation — logging, monitoring, Wiz and CrowdStrike enrolment, backup, private DNS linking, and CI/CD bootstrap. A blueprint per application type (web, API, integration, SAP-adjacent, data, batch) supplies the right network and platform pattern, always with production and non-production separation.
The vend itself is cloud-specific but symmetrical. In Azure, a subscription is created under the right management group so it inherits Azure Policy and RBAC, then wired with diagnostic settings, Defender for Cloud, Wiz and CrowdStrike enrolment, a Recovery Services backup, hub peering, and Private DNS links.
In AWS, Account Factory for Terraform provisions the account into the right OU under its SCP guardrails, then applies CloudTrail and Config logging, GuardDuty, Security Hub, Wiz and CrowdStrike enrolment, AWS Backup, a Transit Gateway attachment, and IAM Identity Center permission sets.
Crucially, nothing reaches production until it passes a fixed set of gates: a completed threat model, a passed security baseline, a Wiz posture score above threshold, an active Dynatrace observability baseline, a DR classification, a tested backup and restore, and a registered ServiceNow configuration item with a runbook. Because the blueprint, the gates, and the pipeline templates are all reusable, the platform team improves the standard once and propagates it to every team through controlled version upgrades — agility for application teams, governance for the platform.
SAP landing zone
SAP is treated as a business-critical shared domain with its own landing zone, not as just another application. Production runs across two availability zones: clustered ASCS/ERS application servers and a SAP HANA database replicated synchronously between zones with HANA System Replication, on high-memory certified compute. QA, non-production, and sandbox environments are separated from production with their own change windows. A paired DR region holds a warm-standby HANA replicated asynchronously, with a documented and rehearsed failover path. Connectivity to on-premises systems and business partners runs privately over the same ExpressRoute and Direct Connect circuits the rest of the estate uses.
The integration surface is as important as the runtime. SAP connects to on-premises identity and legacy services, to the customer-facing logistics platforms and their APIs, to the data platform for reporting and analytics, and to ServiceNow for CMDB and change. Each of those is a private, governed path with its own privileged-access controls, because SAP sits on the critical-data tier and earns the strongest authentication and segmentation in the estate. Dedicated backup, monitoring through Dynatrace, and a separate change cadence keep SAP’s stringent availability and performance requirements from being diluted by general-purpose platform operations.
Data and integration platform
End-to-end visibility, shipment and fleet tracking, partner integration, and sustainability reporting all depend on treating data as a platform capability rather than something each application hoards locally. Sources — SAP, the 100-plus applications, IoT telemetry, and partner feeds — flow into an event-driven integration backbone (Azure Event Hubs, Kafka, or AWS Kinesis) carrying booking, tracking, telemetry, and exception events. A medallion lakehouse organises the data into raw, curated, and governed zones separated by domain, with a master-data band for customers, assets, routes, depots, and partners. Serving spans analytics and BI, sustainability and fleet-efficiency datasets, and secure partner API exposure through the edge WAF and API management.
The governance band — catalog, lineage, classification, and regional data-residency controls — is what lets a global business share data without breaking country rules. Event-driven integration decouples producers from consumers, so a new analytics use case or a new partner feed can subscribe to existing event streams without touching the source systems. That decoupling is the difference between a data platform that scales with the business and a tangle of point-to-point integrations that calcifies it.
Disaster recovery and resiliency
Resilience is designed by classifying workloads into recovery tiers and engineering each tier to its target. Tier 0 — identity, the network control plane, DNS, VPN, and security tooling — carries an RTO of one hour and an RPO of fifteen minutes because nothing else recovers until it does. Tier 1 — customer logistics platforms, SAP production, and partner APIs — matches that fifteen-minute RPO with a two-hour RTO. Tier 2 (line-of-business apps and analytics) and Tier 3 (dev, test, and non-critical reporting) relax to hours. Each strategic geography runs a primary and a paired DR region in both Azure and AWS; edge services and selected APIs run active-active behind global DNS failover, while stateful back ends use active-passive or warm standby where active-active is impractical.
| Tier | Example workloads | RTO | RPO |
|---|---|---|---|
| Tier 0 | Identity, connectivity, security control plane | 1 hour | 15 minutes |
| Tier 1 | Customer platforms, SAP prod, partner APIs | 2 hours | 15 minutes |
| Tier 2 | Internal apps, analytics, collaboration | 8 hours | 4 hours |
| Tier 3 | Dev, test, non-critical reporting | 24 hours | 24 hours |
The redundant ExpressRoute and Direct Connect circuits, cross-cloud and cross-region backup, and regular DR testing — identity recovery, DNS failover, circuit failover, and SAP recovery — turn these numbers from aspiration into something the business can actually rely on. A recovery target that has never been rehearsed is a guess; the operations model bakes the rehearsals in.
Disaster recovery runbooks
A recovery target that nobody has rehearsed is not a commitment, it is a guess. The DR tiers earlier in this document set the contractual envelope — Tier 0 at RTO 1 hour / RPO 15 minutes, Tier 1 at RTO 2 hours / RPO 15 minutes, Tier 2 at RTO 8 hours / RPO 4 hours, and Tier 3 at RTO 24 hours / RPO 24 hours — but the envelope is only credible if every scenario behind it has a written procedure with a named trigger, ordered steps, a single accountable owner, and an explicit validation that proves the workload is actually serving traffic again. The runbooks below are deliberately operational rather than aspirational: each one is the script an on-call engineer follows under pressure at 03:00, not a paragraph of intent. They presume the dual-cloud, four-region estate already in place — paired-DR regions in both Azure and AWS, asynchronous HANA System Replication to the DR region, immutable backup with Object Lock in the AWS Log Archive, and third-party DNS health-checking in front of both clouds — so recovery is a matter of promotion and re-pointing, not rebuilding.
Procedures decay unless they are exercised, so the operating model treats DR as a calendar event rather than an emergency-only capability. The carrier runs quarterly per-tier component tests — a rotating schedule in which one tier’s recovery path (identity failover, a SAP DR promotion, a data-replay, an edge origin swap) is rehearsed in isolation against the paired region, with results, timings, and any deviation from RTO/RPO logged as a ServiceNow change and fed back into the runbook. Once a year the programme runs a full-region failover game-day: an entire strategic region is declared lost and the dependent workloads are failed over in tier order — Tier 0 first to re-establish the identity and connectivity control plane, then Tier 1, and so on — with business stakeholders signing off recovery before a controlled failback. Game-day findings are the single most reliable source of runbook corrections, because they surface the dependencies and sequencing assumptions that component tests in isolation never expose.
| Scenario | Trigger | Procedure steps | Owner | Target | Validation |
|---|---|---|---|---|---|
| Identity & control-plane recovery (Tier 0) | Loss of Microsoft Entra ID / AD DS authentication or security control plane in primary region | Detect via Sentinel/Dynatrace alert → activate a monitored break-glass account → fail Entra ID and on-prem AD DS services over to the paired region → trigger third-party DNS failover for identity endpoints → confirm Conditional Access and PIM are serving → controlled failback once primary is healthy | Cloud Security & Identity | RTO 1 hour / RPO 15 minutes | Test interactive sign-in, MFA challenge and a PIM JIT elevation against the recovered control plane |
| Connectivity / circuit failover | Loss of an ExpressRoute or Direct Connect circuit in a region | Active-active BGP withdraws the failed path automatically → on-call confirms the surviving circuit is carrying the load and is not saturated → raise carrier ticket → restore and re-balance BGP on circuit return | Network & Connectivity | Sub-minute (automatic) | Synthetic reachability probe across the surviving path; confirm no asymmetric routing |
| SAP DR (Tier 1) | Loss of the SAP primary region or unrecoverable HANA primary | Declare DR → promote the asynchronously replicated HANA secondary in the paired DR region → start ASCS/ERS in DR → re-point application servers and integration to DR endpoints → run smoke tests → obtain business sign-off → controlled failback when primary is restored | Cloud Platform Engineering + SAP Basis | RTO 2 hours / RPO 15 minutes | Execute the agreed SAP transaction set (order-to-cash / tracking posting) and reconcile against last committed state |
| Edge / application failover | DNS health checks mark a region’s origin unhealthy | Third-party DNS + health checks automatically fail the affected service over from Azure origin to AWS origin (or vice versa) → on-call confirms the healthy origin is scaled and serving → monitor error budget → revert when the failed origin recovers | Network & Connectivity + Application Enablement | Automatic | Synthetic end-to-end user journey (login → track shipment) against the failed-over origin |
| Data platform recovery (Tier 2) | Loss or corruption of streaming / lakehouse storage in a region | Replay event streams from retained offsets (Event Hubs / Kafka / Kinesis) → restore the lakehouse from geo-redundant storage in the paired region → re-run curation to rebuild governed and master-data bands → resume downstream consumers | Cloud Operations | RTO 8 hours / RPO 4 hours | Record-count and checksum reconciliation between source offsets and restored curated datasets |
| Ransomware / full region loss | Confirmed destructive compromise or total region outage | Isolate the affected accounts/subscriptions → stand up (or vend) a clean landing zone → restore from immutable, Object-Lock backup into the clean estate → recover workloads in strict tier order (0 → 1 → 2 → 3) → forensics before any re-connection | Cloud Security & Identity + Cloud Operations | Per affected tier | Integrity verification of restored data against backup hashes; clean-state attestation before traffic is admitted |
Terraform and Ansible multi-stage CI/CD
Every layer above is delivered as code through a promotion pipeline that moves the same reviewed change from Dev to UAT to Staging to Production. Source lives in Bitbucket — Terraform modules, environment definitions, Ansible roles, and deployment manifests — behind mandatory pull-request review. Static validation runs first: Terraform fmt, validate, and lint, policy-as-code with tfsec or Checkov, and secrets scanning, all before a plan is allowed. Terraform applies against separate, locked, encrypted remote state per environment, region, and scope so a change in one environment can never corrupt another. Ansible then enforces OS, middleware, and SAP-host configuration from a dynamic Azure/AWS inventory using idempotent roles, with secrets injected at runtime rather than stored in code.
Two controls make this production-grade. Artifact-based promotion means the exact change validated in Dev is what reaches Production — environments stop drifting because they are no longer rebuilt differently each time. And manual approval gates before Staging and Production, with security and platform sign-off, give change governance a real enforcement point that ties into ServiceNow change records. The subscription and account factories feed the same pipeline, so even the creation of new landing zones is a reviewed, auditable code change.
DevSecOps software supply chain
The pipeline above is one slice of a larger software supply chain that the proposal asks to be engineered end to end. Reuse is enforced through a deliberate repository strategy: dedicated projects for Terraform modules, for YAML pipeline templates, for application code, for artifacts (npm, Maven, PyPI, NuGet, containers, generic), for policy-as-code, for Ansible roles, and for shared test assets. An application repository composes the shared modules and templates rather than duplicating them, so the platform team raises the standard once and every application inherits it on a controlled upgrade. Security runs across the whole path — SAST, secrets, SCA, and IaC/policy scanning (Wiz Code) early; build and dependency resolution through governed internal artifact repositories; container build, scan, and signing; Terraform plan and Ansible validation; functional, API, smoke, DAST, UI, integration, load, and resilience testing across the environments; database schema-as-code migrations that are versioned, reviewed, and approved like any other change; and mobile build, scan, and distribution through UEM test rings.
The principle running through it is traceability from source to production: every release path includes layered scanning and quality gates, every artifact is built once and promoted rather than rebuilt, and every production change — including database schema and mobile builds — carries provenance, approval, and a rollback plan. Treating endpoints and mobile delivery as pipeline-integrated release domains, and database changes as code, closes the two gaps that most enterprise pipelines leave open.
IoT connected logistics
The fleet is where the business meets the physical world, and IoT is a core platform capability rather than an add-on. Device and edge classes range from truck telematics and GPS to container low-power trackers, BLE/RFID asset tags, environmental sensors for temperature-sensitive cargo, and depot edge gateways that aggregate where direct cloud connectivity is unreliable. Every device carries a certificate-based identity and supports store-and-forward for intermittent links. Connectivity spans cellular, LPWAN, Wi-Fi, and cross-country roaming for 70-plus countries. Ingestion is deliberately split across both clouds: AWS IoT Core handles fleet-scale vehicle telemetry with its device registry, mutual TLS, and rules engine, while Azure IoT Hub carries the operational and control-tower telemetry that integrates with business processes.
Telemetry then fans out across three processing paths: a real-time path (Stream Analytics or Kinesis) for geofence, route-deviation, threshold-breach, and anomaly alerts; a near-real-time path feeding control-tower and customer visibility; and a historical path landing in the lakehouse for trend analysis, maintenance planning, and sustainability reporting. The platform integrates with SAP for shipment and asset context, with customer tracking portals, with ServiceNow for asset and incident workflows, and with notification services. The same zero-trust principles apply down to the device: unique identity per device, mutual authentication, certificate lifecycle management, a quarantine path for anomalous devices, and segmentation between device ingestion, management, analytics, and business networks. Hot, warm, and cold data zones keep live dashboards fast while preserving years of history for claims and emissions analysis.
Each cloud’s ingestion path earns its own deep dive. AWS is the primary for fleet-scale vehicle data: devices connect through AWS IoT Core with its rules engine, IoT FleetWise collects vehicle signals, Device Defender watches for anomalies, and Kinesis with Managed Service for Apache Flink drives the real-time path into Timestream and the lakehouse.
Azure is the primary for operational and control-tower integration: IoT Hub and the Device Provisioning Service handle device identity and twins, Stream Analytics runs the geofence and threshold rules, Logic Apps push events into SAP and ServiceNow, and Azure Data Explorer holds the queryable time-series history.
Observability and SOC integration
A platform this large is only operable if it is observable end to end. Dynatrace is the strategic observability plane, tracing a single request from the user through the CDN and WAF, into cloud ingress, the application, the database, and downstream integrations — the same path the edge and security diagrams describe, now instrumented. Around that topology sit Real User Monitoring for web apps, synthetic monitoring for customer portals and control towers, infrastructure and Kubernetes observability, and cost and capacity dashboards. Cloud-native telemetry from Azure Monitor, Log Analytics, and CloudWatch feeds the same plane. On the security side, cloud audit logs plus Wiz, CrowdStrike, and WAF signals flow into a SIEM (Microsoft Sentinel) and on into SOC workflows, with automated incident creation and enrichment in ServiceNow against the CMDB.
The outcome the business cares about is a closed loop: a degradation or an attack is detected, correlated, and turned into an enriched ServiceNow incident with the affected configuration item already attached, so the right team acts on context instead of a raw alert. Observability and security stop being separate consoles and become two views of the same telemetry, which is exactly what an estate spanning two clouds, four data centres, a global workforce, and a connected fleet needs.
Cost model and TCO
A reference architecture that cannot be costed is an academic exercise, so the design carries a budget envelope as a first-class artefact. The figures below are planning-grade estimates for a steady-state production estate running across four strategic regions in both Azure and AWS — expressed in USD per month, built bottom-up from the services each landing zone provisions, and meant to size a budget and frame FinOps conversations rather than to stand in for a vendor quote. They assume roughly one hundred applications live across the two clouds, reserved-instance and savings-plan commitments applied to steady-state compute, non-production environments scaled down outside business hours, and security and observability tooling priced at enterprise list less the discount a carrier of this size negotiates. Actual spend will move with traffic, device counts, data volumes, and the commercial terms struck with each provider; the value here is the shape of the spend — where the money concentrates and which levers move it — not a figure to the dollar.
Two characteristics of the model are worth calling out. First, application compute and the SAP landing zone together account for the largest share of the bill — a little under half of it — exactly where it should sit for a business whose value is in its applications and its SAP core rather than in undifferentiated platform plumbing. Second, the platform, edge, connectivity, and tooling lines are the price of running consistently and securely across two clouds and four data centres: dual circuits in every region, central inspection and posture management, and a single global edge are deliberate recurring costs the resilience and security requirements make non-negotiable rather than discretionary.
| Cost area | Drivers | ~USD/month |
|---|---|---|
| Hybrid connectivity | 8 ExpressRoute + 8 Direct Connect circuits (10 Gbps each) + gateways across 4 regions | $45,000 |
| Azure platform | Firewalls, gateways, Bastion, DNS, Sentinel, Log Analytics, Key Vault HSM, Backup | $60,000 |
| AWS platform | Network Firewall, Transit Gateway, NAT, GuardDuty, Security Hub, Control Tower, CloudWatch | $45,000 |
| Application compute & workloads | 100+ applications across both clouds (web, app, API, batch, container tiers) | $220,000 |
| SAP landing zone | HANA high-memory certified compute, HA across 2 AZs + asynchronous DR region | $120,000 |
| Data & integration platform | Lakehouse, event streaming, analytics and BI compute and storage | $85,000 |
| IoT platform | AWS IoT Core / Azure IoT Hub, streaming, time-series stores | $40,000 |
| Observability & security tooling | Dynatrace, Wiz, CrowdStrike Falcon, Okta licences | $95,000 |
| Backup, DR & storage | Geo-redundant backup, immutable archive, cross-region replication, object storage | $35,000 |
| Third-party edge | Global CDN, authoritative DNS, WAF | $25,000 |
| TOTAL | Steady-state production estate, 4 regions, dual-cloud | ≈ $770,000/month (~$9.2M/year) |
The model is not a fixed cost — it is an envelope the FinOps practice is expected to work against. The largest structural saving is commitment-based pricing: reserved instances and savings plans on the steady-state compute trim that base by roughly thirty per cent, which is why the compute lines assume committed rather than on-demand rates. On top of that sit operational levers the platform makes routine — autoscaling with a resilience floor so the estate never pays for idle headroom, scheduled shutdown of non-production outside business hours, continuous right-sizing driven by Dynatrace capacity signals, and storage lifecycle tiering that moves cold data down the cost curve automatically. Spend is made visible and accountable through tagging-driven showback and chargeback on the cost dashboards, so each team owns its consumption rather than the bill arriving as an undifferentiated lump, and the enterprise tooling agreements carry their own committed-volume discounts. One-time build and migration cost — discovery, landing-zone construction, the migration factory, and SAP cutover — is deliberately not folded into this run-rate; it is governed separately through the delivery roadmap and its waves so steady-state and programme economics never blur.
Bill of materials
Where the cost model says what the estate spends, the bill of materials says what the estate is — the resource inventory the subscription and account factories actually produce, and the concrete definition of what “done” looks like at the end of the build. It is deliberately expressed by landing zone and domain rather than as a flat resource list, because the value of the platform is that these resources arrive governed, repeatable, and identical across regions through code, not hand-built one ticket at a time. The counts below are the target-state footprint for the four-region dual-cloud design; per-application resources scale with the roughly one hundred workloads the onboarding factory vends, while the platform and shared-service rows are fixed regardless of how many applications land on top of them.
| Landing zone / domain | Key resources (count) |
|---|---|
| Azure platform | 4 regional hub VNets, 4 Azure Firewall Premium, 4 Application Gateway WAF v2, 8 ExpressRoute gateways (+ backup VPN gateways), Microsoft Sentinel, Key Vault HSM, Log Analytics, Recovery Services backup vaults |
| Azure identity & management | Identity, Connectivity, Management, and Security platform subscriptions; Entra Connect; Private DNS Resolver; Azure Monitor / Automation / Update Management |
| Azure workloads | Separate production and non-production subscriptions vended per application (≈100 applications), each hub-peered, policy-governed, and baseline-enrolled |
| AWS platform | 4 regional Transit Gateways, 4 AWS Network Firewall, Control Tower, Log Archive account (immutable S3 + Object Lock), Security Tooling account, GuardDuty + Security Hub org-wide |
| AWS network & shared services | Inspection / ingress / egress VPCs per region, Direct Connect Gateways, Shared Network and Shared Services accounts, IAM Identity Center |
| AWS workloads | Vended application accounts across Prod and Non-Prod OUs (≈100 applications), each TGW-attached, SCP-guarded, and baseline-enrolled |
| SAP landing zone | HANA production across 2 AZs (HSR synchronous) + warm-standby HANA in the paired DR region (asynchronous), clustered ASCS/ERS, dedicated backup and monitoring |
| Data & integration platform | Event Hubs / Kafka / Kinesis backbone, medallion lakehouse (raw / curated / governed + master-data band), data catalog, lineage and classification, analytics and BI |
| IoT platform | AWS IoT Core (fleet telemetry) + Azure IoT Hub (control-tower telemetry), Kinesis / Stream Analytics streaming, Timestream and Azure Data Explorer time-series stores |
| Connectivity | 8 ExpressRoute circuits + 8 Direct Connect circuits (10 Gbps, active-active BGP) across 4 regions, diverse carriers per region |
| Third-party edge | Global CDN, authoritative DNS with global traffic management and failover, WAF with OWASP / bot / API / rate-limit policy |
| Shared SaaS & tooling | Okta, Microsoft Entra ID, Wiz, CrowdStrike Falcon, Dynatrace, ServiceNow, Bitbucket, Microsoft Intune — operating horizontally across both clouds |
Read together, the two tables close the loop between intent and economics. The bill of materials enumerates the building blocks the factories emit on demand; the cost model prices them at steady state and points at the levers that keep the run-rate honest. Both are living artefacts — the inventory grows as applications onboard through the waves, and the spend is reforecast against actuals each cycle — but holding the design to an explicit, planning-grade number from the outset is what turns “a secure multi-cloud foundation” from an aspiration into something the business can budget, govern, and operate with its eyes open.
Operating model and RACI
The six teams plus the Cloud Center of Excellence only deliver a stable estate if their responsibilities are unambiguous; the most common cause of a missed RTO is not a technology failure but two teams each assuming the other owned the decision. The operating model therefore pins three things explicitly. Support runs as follow-the-sun, 24×7, tiered L1 → L2 → L3: the ServiceNow service desk (L1) triages and runs standard runbooks, Cloud Operations (L2) owns operational restoration and incident command, and Cloud Platform Engineering or the relevant vendor (L3) handles deep platform and product engineering. Change runs in three lanes — standard (pre-approved, automated), normal (CAB-reviewed within published change windows), and emergency (post-implementation review) — every one of which is raised and recorded against the CMDB in ServiceNow, with manual approval gates before Staging and Production tied to the change record. Patching follows a monthly cadence with critical out-of-band releases, honouring the patch SLA of 7 days for critical and 30 days for high-severity vulnerabilities surfaced by Wiz and the cloud-native scanners.
The RACI below allocates the load-bearing activities across the six teams. The recurring pattern is deliberate: the CCoE is accountable for standards and guardrails but delegates execution, security holds identity and incident authority, the platform team owns the factories and the build path, and Cloud Operations carries day-two run. One accountable (A) owner per row keeps decisions unambiguous; R is shared where genuine joint delivery is required.
| Activity | CCoE | Platform Eng | Security & Identity | Network | Operations | App Enablement |
|---|---|---|---|---|---|---|
| Landing-zone standards & reference architecture | A | R | C | C | I | C |
| Subscription / account vending (factory) | A | R | C | C | I | I |
| Policy / guardrail change (Policy + SCP) | A | R | R | C | I | I |
| Identity & PIM access approval | C | I | A/R | I | I | I |
| Network & firewall change | I | C | C | A/R | I | I |
| Application onboarding to the estate | C | C | C | I | I | A/R |
| Incident response & major-incident command | I | C | A/R | C | R | C |
| DR test execution & game-day | A | R | R | R | R | C |
| Cost management & FinOps | A | R | I | I | R | C |
| Patching & vulnerability remediation | I | R | A | I | R | C |
Migration and onboarding waves
A landing zone is only worth building if the estate can actually be moved into it without stalling, and that movement has to be sequenced rather than improvised. The estate is populated through a migration factory: a repeatable discovery-to-cutover pipeline in which every application is assessed, dispositioned, scheduled into a wave, and admitted only once the wave ahead of it has proven its foundations. The work is organised into six waves over roughly twenty-four months, each with explicit entry and exit gates so that no wave inherits half-finished plumbing from the one before it. The early waves deliberately deliver platform and low-risk workloads first, building operational muscle and confidence before the customer-facing, data, and SAP estates — the genuinely business-critical tier — are touched. The roadmap diagram sequences this progression; the table below pins the scope and gating of each wave.
| Wave | Months | Scope (representative) | Entry criteria | Exit criteria |
|---|---|---|---|---|
| W1 — Foundation and pilots | 0–3 | Landing zones, identity, hybrid connectivity, guardrails, and 2–3 pilot applications | Design signed off | Platform baseline live; DR Tier-0 tested |
| W2 — Shared services and low-risk internal | 3–6 | ~20 shared-services and low-risk internal applications | Wave 1 exit achieved | Operations runbooks live |
| W3 — Customer web and API | 6–10 | ~25 customer web and API workloads; third-party edge cutover | Security gates green | Edge live across both clouds |
| W4 — Data and integration | 9–12 | Lakehouse, streaming backbone, partner feeds | Data governance ready | Golden datasets served |
| W5 — SAP and high-criticality | 12–18 | SAP landing zone and remaining Tier-1 workloads | HANA sizing confirmed | SAP DR rehearsed |
| W6 — Optimisation and operating-model transition | 18–24 | Right-sizing, FinOps, handover to internal teams | Estate stable | Teams self-sufficient |
Disposition is decided per application at discovery using the standard 6R model — rehost, replatform, refactor, repurchase, retain, or retire — so that each workload earns its place in a wave rather than being lifted wholesale. In practice most internal applications fall to rehost or replatform, taking the path of least friction onto the new platform; the customer-facing and data workloads are where refactoring concentrates, because cloud-native rebuild is what unlocks their scale and resilience targets; and a meaningful tail of legacy systems is retired outright rather than carried forward, shrinking the estate that the operating model ultimately has to run.
Architecture decision records
Reviewers should be able to see the trade-offs behind the design, not just its conclusions. Each load-bearing decision is therefore captured as an architecture decision record that names the alternative that was rejected and the reasoning that settled it — so the design can be interrogated, and so future teams understand which doors were deliberately closed and why. The records below are the ten decisions that shape the estate most; several of them, notably the dual-cloud posture and the split IoT platform, are direct consequences of the business mandate rather than free engineering choices, and the records make that provenance explicit.
| ID | Decision | Alternative considered | Rationale / trade-off |
|---|---|---|---|
| ADR-01 | Dual-cloud landing zone across Azure and AWS | Single-cloud consolidation | Business mandate plus resilience and best-of-breed services; accepts higher operational complexity, mitigated by one common guardrail set across both clouds |
| ADR-02 | Azure hub-and-spoke topology | Azure Virtual WAN | Greater control, lower cost, and team maturity at current scale; Virtual WAN to be revisited as the estate grows |
| ADR-03 | AWS Transit Gateway with a central inspection VPC | VPC peering mesh | Central policy enforcement and clean scaling, avoiding the unmanageable sprawl of a full peering mesh |
| ADR-04 | Okta as federation broker with Microsoft Entra ID as control plane | Entra ID only | Preserves the existing SaaS estate already standardised on Okta and brokers AWS federation cleanly |
| ADR-05 | Third-party global edge for CDN, DNS, and WAF | Per-cloud Azure Front Door plus CloudFront | A single cross-cloud control point with health-checked failover between clouds, rather than two disjoint edges |
| ADR-06 | SAP self-managed on certified IaaS | RISE with SAP | Retains control, leverages existing operations skills, and keeps integration close; RISE deferred, not ruled out |
| ADR-07 | Split IoT — AWS IoT Core for fleet, Azure IoT Hub for control tower | Single unified IoT platform | Best-fit service per workload and consistent with the dual-cloud mandate |
| ADR-08 | Terraform with Ansible as the IaC and configuration standard | Native Bicep plus CloudFormation | Cross-cloud consistency from one toolchain instead of two divergent native stacks |
| ADR-09 | Factory-based vending — AFT plus an Azure subscription factory | Manual provisioning | Governed, fast, and repeatable account and subscription delivery within a business day |
| ADR-10 | Centralised egress inspection per region | Per-VNet and per-VPC firewalls | One policy enforcement point per region, avoiding inconsistent rule sets scattered across every network |
Risks, assumptions, issues and dependencies
A programme of this scope is governed against a live RAID register, reviewed continuously rather than written once and shelved. Risks are tracked with owners and mitigations; assumptions are stated so that, if they break, the impact is visible immediately; open issues are worked to closure; and external dependencies are surfaced early, because several of them — circuit delivery and SAP Basis capacity in particular — sit on the critical path and can move whole waves if they slip. The register below is the working view the programme governs against.
| Type | Item | Impact | Mitigation / owner |
|---|---|---|---|
| Risk | Cross-cloud skills gap across delivery teams | Slower delivery; operational errors | Structured enablement and handover plan; CCoE |
| Risk | Circuit-provider concentration in a region | Correlated connectivity loss | Diverse carriers per region; Network and Connectivity |
| Risk | SAP migration complexity | Tier-1 cutover delay or instability | Early sizing and rehearsed failover; Platform plus SAP Basis |
| Risk | Cost overrun against the budget envelope | Programme funding pressure | FinOps guardrails and budgets; CCoE |
| Risk | IoT ingest spikes beyond plan | Telemetry loss or back-pressure | Autoscale and load test to 50,000 msg/sec; Platform |
| Risk | Data-residency breach | Regulatory and contractual exposure | Lakehouse governance band enforcement; Security and Identity |
| Risk | Tooling lock-in | Reduced future flexibility | Infrastructure-as-code and portability discipline; Platform |
| Assumption | The four on-prem data centres persist as first-class regions | Network and DR design depend on it | Validated with infrastructure owners |
| Assumption | The dual-cloud mandate is fixed | Whole topology assumes it | Confirmed with business sponsors |
| Assumption | Workday remains the authoritative HR source | Joiner-mover-leaver identity flow depends on it | Confirmed with HR and Identity teams |
| Assumption | Circuit lead times are met | Wave 1 foundation timing depends on it | Tracked against procurement milestones |
| Issue (open) | Circuit procurement pending | Blocks connectivity baseline | Expedited ordering; Network and Connectivity |
| Issue (open) | SAP sizing to be confirmed | Gates Wave 5 entry | Sizing exercise in progress; Platform plus SAP Basis |
| Issue (open) | Country data-residency list to finalise | Governance rules incomplete | Legal and compliance review under way; Security and Identity |
| Dependency | Carrier circuit delivery | Foundation and connectivity | External telecom carriers |
| Dependency | SAP Basis team availability | Wave 5 SAP migration | Internal SAP Basis function |
| Dependency | Okta and Workday integration | Identity lifecycle automation | Identity team plus vendors |
| Dependency | Security-tooling licences | Posture, EDR, and SIEM coverage | Procurement and vendors |
Delivery roadmap and acceptance
The work is sequenced as Cloud Adoption Framework workstreams so the engagement has clear governance: Strategy and Plan (portfolio discovery, dependency mapping, country and data-residency constraints, SAP assessment), Ready (landing-zone build, identity foundation, connectivity, guardrails, operations baseline), Adopt (migration in waves), and Govern and Manage (policy, cost, monitoring, backup, DR, incident response). Migration itself runs as the six waves detailed above — each gated and sequenced on the delivery-roadmap timeline — so the Adopt workstream proceeds through explicit checkpoints rather than as an open-ended lift.
The architecture is complete only when it demonstrates the acceptance criteria the business set: a fully defined Azure and AWS landing zone with clean platform/application separation; a zero-trust identity model covering office and home users with SSO, MFA, Conditional Access, and PIM; resilient hybrid connectivity from all four data centres with dual ExpressRoute and dual Direct Connect; multi-layer security spanning a third-party edge, cloud posture, EDR, observability, and policy-as-code; a tested HA and DR design for critical services including SAP; a scalable onboarding model for 100-plus applications; and a governance and operations model the internal teams can run after handover. Twenty-seven diagrams do not make a design correct on their own — but a design that can be drawn this precisely, decision by decision, is one a consulting partner can actually build, and one the logistics business can actually operate.