SAP is the system of record you are least allowed to break. When the ERP is down, the warehouse stops shipping, invoices stop posting, payroll misses its run, and the CFO finds out before you do. That is why SAP migrations to the cloud are the most conservative projects most enterprises ever run — and why “lift SAP to Azure” is never really an infrastructure task. It is a landing-zone, high-availability, and disaster-recovery contract that has to survive a real datacenter failure during month-end close.
This article is a reusable reference architecture for SAP S/4HANA on Azure built the way the SAP-on-Azure community and the Azure Architecture Center actually deploy it: HANA on certified M-series VMs, high availability across Availability Zones, Azure NetApp Files (ANF) for the HANA persistence and shared filesystems, ExpressRoute for private connectivity back to the on-premises SAP landscape and end users, and an application-consistent AzAcSnap + Recovery Services backup strategy. It scales from a single-SID S/4HANA system at a mid-market manufacturer to a multi-SID landscape (S/4, BW/4HANA, Solution Manager, Fiori) at a large enterprise using the same building blocks.
This is certified IaaS SAP, not RISE with SAP (where Microsoft/SAP run the managed estate for you) and not a greenfield rewrite. The patterns below are what you own and operate yourself, and what an auditor will ask you to prove you have tested.
The business scenario
Picture a manufacturer — mid-market today, growing through acquisition into a genuinely large enterprise — running SAP ECC on aging on-premises hardware in a single corporate datacenter. SAP’s mainstream maintenance for ECC ends in 2027 (extended to 2030 for customers who pay for it), so the board has approved an S/4HANA conversion. The infrastructure team has been told three things at once: modernize to S/4HANA, get out of the owned datacenter, and come back with a credible HA/DR story because the last unplanned ERP outage cost a full day of shipments at a plant.
The SAP landscape that has to land on Azure:
- S/4HANA production (the prize), plus QA, Dev, and a sandbox.
- SAP HANA as the in-memory database under S/4 — the component that drives almost every sizing and HA decision because it holds the entire dataset in RAM.
- The ABAP application servers (the “PAS” primary and several “AAS” additional dialog instances) and the ASCS/ERS central services that hold the enqueue lock table and message server.
- Supporting systems: SAP BW/4HANA for analytics, Solution Manager, SAP Web Dispatcher and Fiori front-end for browser/mobile access, and interfaces to MES, EDI, and a tax engine.
The business has signed off on numbers that drive every decision below:
- RTO (Recovery Time Objective): 30 minutes for a single-zone failure (automatic), and 2 hours for a full region loss (declared DR).
- RPO (Recovery Point Objective): zero for committed HANA transactions within the region (synchronous replication), and ≤ 15 minutes cross-region for DR.
- SAP-certified only. Production HANA must run on a VM/storage combination that appears on the SAP HANA Hardware Directory and Azure’s certified list, because SAP support is otherwise void. This is the single hardest constraint and it eliminates “just use a big general-purpose VM.”
The cost constraint is real but different from a web app: leadership accepts that production HANA is expensive (it is reserved, always-on RAM measured in hundreds of GB to multiple TB), but expects everything around it — non-prod, DR compute, app-tier scale — to be optimized aggressively. That tension is the whole game.
Architecture overview
The architecture is a single-region, zone-redundant SAP landing zone with a cross-region DR tier, deployed inside an enterprise hub-and-spoke (or Virtual WAN) network governed by Azure landing zones. The end-to-end picture, described as a flow:
Connectivity and the request path. End users and on-premises systems reach SAP over ExpressRoute — a private, dedicated circuit from the corporate WAN into Azure, terminated on an ExpressRoute Gateway in the connectivity hub. For resilience the circuit uses ExpressRoute with two peering connections (the circuit is dual-homed across two Microsoft edge routers by design) and, for the highest tier, a second circuit in a different peering location so a metro fiber cut cannot isolate the ERP. SAP GUI traffic, RFC, and HTTPS for Fiori all ride this private path; there is no public ingress to the SAP application tier. Browser and mobile users hit SAP Web Dispatcher (the SAP-aware reverse proxy / load balancer) behind a Standard Internal Load Balancer, which fans out across the ABAP dialog instances.
The SAP tiers. A request flows Web Dispatcher → ABAP application server (PAS/AAS) → HANA database. The application servers are stateless dialog work processes and scale horizontally across zones. The ASCS/ERS central-services pair is the stateful heart of the app tier: ASCS runs the message server and enqueue server (the global lock table), and ERS holds a shadow copy of the lock table so a failover does not lose in-flight locks. ASCS and ERS run on separate VMs in separate Availability Zones, clustered with Pacemaker behind an Azure Load Balancer carrying the virtual IP.
HANA and its storage — the core of the design. Production HANA runs on a certified M-series VM (for example an M-series / Mv3 SKU sized to the dataset) in Availability Zone 1, with a second identical HANA VM in Availability Zone 2. The two are kept in lockstep by HANA System Replication (HSR) in synchronous mode with SYNC and logreplay operation, so a committed transaction in zone 1 is acknowledged only after it is durable in zone 2 — that is what delivers RPO = 0 within the region. A Pacemaker cluster with the SAP HANA resource agents monitors both nodes and performs automatic failover (promoting the zone-2 secondary to primary and re-pointing the virtual IP) when zone 1 is lost. The HANA persistence — /hana/data, /hana/log, and /hana/shared — lives on Azure NetApp Files, an Azure-native NFS service whose throughput and sub-millisecond latency meet the strict HANA storage KPIs (the certified alternative to managed Premium/Ultra disks, and the easier path to hitting the log-write latency requirement).
Fencing — the detail that makes or breaks HA. A clustered HANA or ASCS pair must guarantee that a hung-but-not-dead node cannot keep writing (a “split brain”). On Azure this is solved with SBD (STONITH Block Device) using Azure Shared Disk, or with the Azure Fence Agent (fence_azure_arm) that calls the Azure API to power-off a failed node. The cluster is configured so that fencing succeeds before failover completes — no fencing, no promotion.
Backup. HANA is backed up two ways. Storage snapshots via AzAcSnap (the Azure Application-Consistent Snapshot tool) quiesce HANA and take application-consistent ANF snapshots in seconds, with near-zero impact, retained as a fast local restore ladder. In parallel, Azure Backup for SAP HANA streams Backint-based database backups into a Recovery Services vault for long-term, off-volume, policy-governed retention. The VM operating systems are protected by standard Azure VM backup.
DR. A second region holds a DR HANA replica fed by a third HSR tier in asynchronous mode (chained from the in-region secondary, or multi-target), giving ≤ 15-minute cross-region RPO without paying the latency tax of synchronous replication over distance. The app tier in DR is a pilot light — minimal or stopped VMs that are scaled/started on declaration — protected by IaC and, where appropriate, Azure Site Recovery for the non-HANA VMs.
The mental model: Availability Zones are the in-region blast-radius boundary, HSR + Pacemaker turn a zone loss into an automatic 30-minute event, ANF makes HANA storage fast enough to be certified, ExpressRoute keeps the whole thing private, and AzAcSnap + Backint give you both a fast restore and a compliant archive.
Component breakdown
| Component | Role in the architecture | Key configuration choices |
|---|---|---|
| HANA VMs (certified M-series / Mv3) | In-memory database engine for S/4HANA. The most important and most expensive component. | SKU chosen from the SAP-certified list to match sized RAM (e.g. M128 ≈ 2 TB, scaling to multi-TB Mv3); Write Accelerator on the log disk if using managed disks; pinned with a 3-year Reserved Instance; one VM per zone. |
| HANA System Replication (HSR) | Keeps the secondary HANA in lockstep for HA (and a third for DR). | Sync + logreplay between zones (RPO 0); async to the DR region; preload tables on secondary for fast takeover; multi-target replication so HA and DR replicas are fed simultaneously. |
| Pacemaker cluster (SLES/RHEL HA) | Detects node/zone failure and automates HANA and ASCS/ERS failover. | SAPHanaSR / SAPHanaSR-angi (or RHEL SAPHana) resource agents; SBD via Azure Shared Disk or fence_azure_arm for STONITH; AUTOMATED_REGISTER=true to re-register the old primary. |
| Azure NetApp Files (ANF) | NFS storage for /hana/data, /hana/log, /hana/shared, plus sapmnt. |
Ultra service level for log to meet latency KPIs; volumes pinned to the same zone as the HANA VM via application volume groups; cross-zone / cross-region replication for the persistence; snapshots as restore points. |
| ASCS / ERS | SAP central services — message server + enqueue (lock table) and its replicated shadow. | Two VMs across two zones; Pacemaker-clustered with a load-balancer VIP; ENSA2 (standalone enqueue server 2) so ERS can run on any node; sapmnt on ANF/SMB. |
| ABAP application servers (PAS/AAS) | Stateless dialog/batch work processes that run the business logic. | Horizontally scaled across zones; no clustering needed (stateless); behind SAP Web Dispatcher; logon/RFC load balancing via SAP message server groups. |
| SAP Web Dispatcher | SAP-aware reverse proxy / HTTP(S) load balancer and the Fiori entry point. | Redundant pair behind a Standard Internal Load Balancer; TLS termination; routes to the ABAP web instances; the only thing that needs to be reachable by browsers. |
| ExpressRoute + Gateway | Private, dedicated connectivity to on-prem and corporate WAN. | ErGw3AZ zone-redundant gateway; dual peerings (and ideally a second circuit in another peering location); FastPath for low-latency SAP GUI/RFC; routed through the connectivity hub. |
| AzAcSnap + Azure Backup (Backint) | Application-consistent snapshots and policy-governed database backups. | AzAcSnap for fast ANF snapshot restore ladder; Azure Backup for SAP HANA (Backint) into an immutable Recovery Services vault; tested restore + point-in-time recovery. |
| Azure landing zone (hub/spoke or vWAN) | The governed network, identity, and policy substrate the SAP estate lands in. | Dedicated SAP spoke(s); NSGs/ASGs segmenting DB/app/SCS tiers; Azure Policy for SKU/region guardrails; proximity placement group to co-locate app + DB for latency. |
A few of these deserve more than a table row.
Why ANF and not just disks. HANA imposes hard storage KPIs (notably log-write latency under ~1 ms and specific throughput-per-TB targets). You can meet them with Premium SSD v2 / Ultra Disk + Write Accelerator, and many landscapes do. ANF wins when you want NFS-based deployment, simpler storage-snapshot backups, scale-out HANA shared storage, and built-in cross-zone/cross-region replication of the persistence layer — at the cost of a separate service to size and a capacity-pool minimum. This architecture uses ANF because the snapshot/replication story tightens both the backup and DR contracts.
Why ASCS/ERS is clustered but app servers are not. The enqueue server holds the global lock table that serializes business transactions; lose it without ERS and you lose in-flight locks and risk inconsistency, so it must fail over cleanly under Pacemaker. Dialog application servers, by contrast, are stateless — if one dies, users reconnect and the message server load-balances them elsewhere. Clustering them would add complexity for no benefit. Recognizing which SAP component is stateful is the key to a clean HA design.
Proximity placement groups (PPGs). SAP is latency-sensitive between the app tier and HANA. A PPG asks Azure to place the app VMs and the HANA VM physically close. The tension: a PPG pulls instances together, while Availability Zones deliberately spread them apart. The standard resolution is to pin the SID to specific zones and use PPGs within the constraints of the chosen SKUs and zones, validating that the certified HANA SKU is actually available in the target zone before committing.
Implementation guidance
Sizing first, always. Nothing is provisioned until SAP sizing is done. For a greenfield S/4 you run the SAP Quick Sizer; for an ECC→S/4 conversion you run report /SDF/HDB_SIZING against the source to get the required HANA memory. That number selects the certified M-series SKU and the ANF capacity pool. Getting this wrong is the most common and most expensive mistake — undersize and you cannot import the database; oversize and you reserve a fortune in idle RAM.
Reference VM landscape (illustrative production single-SID S/4HANA):
| Tier | VM role | Example SKU | Zone(s) | Notes |
|---|---|---|---|---|
| DB | HANA primary | M-series sized to RAM (e.g. ~2 TB) | Zone 1 | Reserved 3-yr; ANF persistence |
| DB | HANA secondary (HA) | identical to primary | Zone 2 | HSR sync; Pacemaker |
| DB | HANA DR replica | identical | DR region | HSR async; pilot-light app tier |
| SCS | ASCS / ERS | E-series (small) | Zones 1 & 2 | Pacemaker + LB VIP, ENSA2 |
| App | PAS | E-series | Zone 1 | stateless |
| App | AAS (xN) | E-series | Zones 1 & 2 | scale-out dialog/batch |
| Web | Web Dispatcher (pair) | D-series (small) | Zones 1 & 2 | behind internal LB |
Infrastructure as Code. Build the landing zone and SAP infra declaratively. The community-standard path is SAP Deployment Automation Framework (SDAF) — Microsoft’s open-source Terraform + Ansible framework purpose-built for SAP on Azure: Terraform stands up the network, VMs, ANF, and load balancers; Ansible installs SUSE/RHEL clustering, the SAP software (via SWPM/SWConfigure), and configures HSR and Pacemaker. If you prefer Bicep, the Azure Verified Modules include SAP building blocks, but you will still drive OS/cluster/SAP configuration with Ansible. Either way: the cluster and fencing config is the part you must template and test, because it is the part humans get wrong under pressure.
Networking and identity wiring:
- Place SAP in a dedicated spoke peered to the connectivity hub; HANA, app, and SCS get separate subnets with NSGs and Application Security Groups so only the ports SAP needs (e.g. 3200–3299, 3300–3399, 4XX HANA SQL, 5XX, 1128/1129) cross tiers.
- ANF lives in its own delegated subnet in the spoke; ensure the HANA VMs and ANF volumes are in the same zone (use ANF application volume groups for HANA to enforce placement and the right volume layout).
- ExpressRoute Gateway is zone-redundant (ErGw3AZ) in the hub; enable FastPath to bypass the gateway data path for the lowest SAP GUI/RFC latency.
- Identity: join the SAP OS hosts to the corporate directory as policy requires, but front administrative access with Microsoft Entra ID + Azure Bastion + Just-In-Time VM access; store the HANA system /
<sid>admand DBACOCKPIT secrets and SBD/cluster credentials in Azure Key Vault, retrieved by VMs via managed identity. Single sign-on for Fiori uses Entra ID + SAML/OAuth through the Web Dispatcher. - No public IPs anywhere on the SAP tiers; all management is private.
Cluster bring-up order (the part Ansible automates and you rehearse): install the HA pattern OS (certified SLES for SAP or RHEL for SAP), configure corosync/Pacemaker, set up STONITH (SBD on Azure Shared Disk or fence_azure_arm), install HANA on both nodes, establish HSR, then layer the SAPHanaSR(-angi) resource agents and the virtual IP via Azure Load Balancer (with floating IP / Direct Server Return and an HA-ports or per-port rule, plus a health probe on the cluster). Validate by killing the primary and watching automatic takeover before you ever put data on it.
Enterprise considerations
Security and Zero Trust. SAP holds your most sensitive financial and personal data, so the perimeter is identity, not network alone. Enforce private-only access (ExpressRoute + private endpoints, zero public ingress to app/DB), micro-segment the tiers with NSGs/ASGs, and require JIT + Bastion + Entra Conditional Access for any human touching a host. Encrypt everything: HANA native data-and-log encryption on top of ANF encryption at rest, TLS for SAP GUI (SNC) and HTTPS for Fiori, secrets in Key Vault. Turn on Microsoft Defender for Cloud and the SAP threat-monitoring integration (Defender for SAP / Microsoft Sentinel for SAP) to watch for suspicious transactions (SE16 exports, debug privilege grants, mass user changes) — application-layer threats that infrastructure controls cannot see. Keep an immutable, isolated backup copy (vault soft-delete + immutability) so ransomware cannot destroy your recovery path.
Cost optimization. Production HANA RAM is the dominant line item, so optimize it deliberately rather than avoid it: 3-year Reserved Instances (or a savings plan) on the always-on HANA and SCS VMs typically cut 40–60% versus pay-as-you-go. Then attack everything else: auto-shutdown non-prod (Dev/QA/sandbox) on a schedule — an SAP sandbox does not need to run nights and weekends — and down-size non-prod HANA (it does not need production RAM). Run the DR app tier as a pilot light (stopped/minimal, IaC to scale on declaration) so you pay for DR HANA replication and storage but not idle DR dialog instances. Right-size ANF capacity pools to actual throughput need (over-provisioning capacity to buy throughput is a known ANF cost trap — use manual QoS to decouple them). Track it all with Cost Management budgets tagged per SID/environment.
Scalability. The app tier scales horizontally — add AAS dialog instances across zones as user/batch load grows; the message server load-balances automatically. HANA scales up first (a bigger certified SKU, up to multi-TB Mv3), and only for very large datasets (BW-style) does it scale out across multiple nodes with ANF shared storage. Plan SKU headroom: an S/4 dataset grows, and bumping the HANA SKU is a resize with downtime, so size with 2–3 years of growth and a clear runway to the next certified tier.
Reliability and DR (the RTO/RPO contract). In-region, a zone failure is handled automatically: Pacemaker fences the dead node and promotes the zone-2 HANA secondary; with preload tables and logreplay the takeover is typically minutes — comfortably inside the 30-minute RTO with RPO 0. A region loss is a declared DR event: promote the async DR HANA replica (RPO ≤ 15 min), scale up the pilot-light app tier from IaC, re-point Web Dispatcher / DNS, and reconnect interfaces — targeted at the 2-hour RTO. The contract is only real if it is tested: schedule DR drills (at least annually, ideally semi-annually), run AzAcSnap restore tests to prove a HANA snapshot actually recovers, and rehearse the cluster failover in QA after every patch. Untested DR is a slide, not a capability.
Observability. Install Azure Monitor for SAP solutions (AMS) — the first-party monitor that pulls HANA, OS, Pacemaker cluster, and load-balancer metrics into a single Azure-native workspace, with alerts on HSR replication lag, cluster resource state, HANA memory/CPU, and ANF latency. Pair it with SAP Solution Manager for the application/business-process view (job monitoring, transaction response times) and ship logs to Log Analytics. The signals that page someone: HSR out of sync, cluster in a failed/maintenance state, HANA out-of-memory or savepoint stalls, ANF log latency breaching KPI, and ExpressRoute peering down.
Governance. This estate lives inside Azure landing zones. Use Azure Policy to enforce region, certified-SKU, tagging, and “no public IP on SAP subnets” guardrails; Management Groups to separate prod from non-prod; RBAC + PIM for least-privilege operator access; and a change-control gate around the production cluster (SAP teams are rightly conservative — production HANA patching follows SAP/OS support matrices and maintenance windows, not a continuous-deployment cadence).
Reference enterprise example
Helvar Components AG is a (fictional) European manufacturer of industrial sensors: ~4,500 employees, eight plants, €1.1B revenue. They run SAP ECC on end-of-life on-prem hardware and have approved an S/4HANA conversion with a parallel exit from their owned datacenter. Their last ERP outage — a SAN failure — froze shipping at two plants for nine hours and triggered the project.
Sizing. Running /SDF/HDB_SIZING against ECC returns a required HANA memory of ~1.6 TB for S/4 production, with 2-year growth projected to ~2 TB. They select a certified M-series SKU at ~2 TB for production HANA, confirm it is available in Zones 1 and 2 of their chosen region (West Europe), and reserve it for 3 years.
The build. Using SDAF (Terraform + Ansible) they stand up:
- HANA primary in Zone 1, HA secondary in Zone 2, HSR sync + logreplay between them, Pacemaker with
fence_azure_armSTONITH,AUTOMATED_REGISTER=true. - ANF capacity pool (Ultra service level for
/hana/log), volumes placed via application volume groups in the matching zones, cross-region replication of the persistence enabled to North Europe. - ASCS/ERS across both zones (ENSA2, Pacemaker, LB VIP), PAS in Zone 1, three AAS split across zones, a Web Dispatcher pair behind an internal load balancer.
- ExpressRoute ErGw3AZ in the hub with two peerings and FastPath; SAP in a dedicated spoke with tier-segmented NSGs; no public IPs.
- AzAcSnap snapshots every 4 hours (24h ladder) plus Azure Backup Backint to an immutable vault with 35-day operational + 7-year compliance retention.
- A North Europe DR tier: an async HSR HANA replica and a pilot-light app tier defined entirely in Terraform.
Non-prod economics. Dev, QA, and sandbox run smaller HANA SKUs and auto-shut-down nights/weekends via Automation, cutting their compute bill by roughly 65%. Production HANA + SCS on 3-year reservations save ~52% versus list.
The test that mattered. Six weeks before go-live, during the mock cutover, they killed the Zone-1 HANA VM at 14:00 on a business day in QA-prod-clone. Pacemaker fenced the node, promoted the Zone-2 secondary, and re-pointed the VIP; with preloaded tables the database was serving queries again in under 4 minutes, RPO 0. They then ran a full DR exercise to North Europe: promoted the async replica (RPO measured at 6 minutes), scaled the pilot-light app tier from IaC, and had Fiori + SAP GUI transacting in 94 minutes — inside the 2-hour target. The auditors got their evidence pack.
Outcome. Go-live was uneventful. Three months later a real platform incident degraded one Availability Zone; the HANA cluster failed over automatically and shipping never stopped — the plant operators did not notice. Total Azure run-rate landed about 18% under the original budget, almost entirely from non-prod scheduling and reservations, while production HA/DR posture improved dramatically over the old single-datacenter SAN.
When to use it
Use this architecture when you run (or are converting to) SAP S/4HANA and need certified, self-managed IaaS on Azure with a real, tested HA/DR contract — zone-redundant HANA, near-zero in-region RPO, and a documented cross-region DR. It fits any enterprise from a single production SID up to a multi-SID landscape, and it is the right shape when you must own the OS, the database, and the SAP basis layer for compliance, customization, or licensing reasons.
Trade-offs and anti-patterns:
- The biggest alternative is RISE with SAP. If you would rather not operate the HANA OS, clustering, and patching yourself, RISE (SAP-managed on hyperscaler infra, including Azure) shifts that burden to SAP. Choose this self-managed architecture when you need maximum control and have the SAP Basis capability; choose RISE when you want a managed ERP and are comfortable with SAP owning the stack. Do not half-do both.
- Anti-pattern: uncertified HANA infrastructure. Running production HANA on a non-certified VM SKU or storage layout to save money voids SAP support. Never do it for production. (Non-prod has more flexibility but still respect the support matrix.)
- Anti-pattern: skipping fencing (“we’ll add STONITH later”). A Pacemaker cluster without working fencing will eventually split-brain and corrupt data. Fencing is not optional and must be validated before any data lands.
- Anti-pattern: synchronous replication across regions. HSR
SYNCover hundreds of kilometers will cripple transaction latency. Use sync within the region (zones) and async to the DR region — never the reverse. - Anti-pattern: untested DR. A DR region that has never been failed over to is a liability, not an asset. If you cannot run the drill, you do not have DR.
- Anti-pattern: over-provisioning ANF capacity to buy throughput. Use manual QoS capacity pools so you size capacity and throughput independently, or the storage bill balloons.
Lighter-weight alternatives exist for smaller footprints: a single-zone HANA with backups only (no HSR) is cheaper and may suit a small, downtime-tolerant subsidiary — but it cannot meet a 30-minute RTO or RPO 0, so it is a conscious risk decision, not a default. And Azure Center for SAP solutions (ACSS) can deploy and manage much of this scaffolding for you if you want a more guided, first-party experience than raw SDAF. Whatever you choose, let the sizing number and the RTO/RPO contract drive the design — for SAP, those two facts decide everything else.