A global automotive supplier — the kind that ships sequenced brake assemblies into three OEM plants on a just-in-sequence contract — runs its entire order-to-cash, plant maintenance, and finance close on a single SAP S/4HANA system. The CFO’s framing is blunt: every hour S/4HANA is down, two assembly lines on the customer side stop, and the line-down penalties are written into the supply contract at five figures per hour, per plant. The board has just approved a lift-and-shift off an aging on-premises HANA appliance to Azure, and the mandate that comes with it is uncomfortably specific: business-hours unplanned downtime must stay under roughly 15 minutes a year for the database tier, a regional disaster must be survivable with under 15 minutes of lost data, and every backup must be immutable and encrypted with keys the customer controls — because their largest OEM customer’s security questionnaire demands customer-managed encryption and a tested DR runbook before they will renew. This article is the reference architecture for landing that workload on Azure properly: a clustered, replicated, backed-up S/4HANA that an SAP basis team, a CISO, and an auditor will each sign.
The pressures here are different from a stateless web app and worth naming, because they dictate every design choice. HANA is a single, large, stateful in-memory database — you cannot just run three copies behind a load balancer. The RTO and RPO are contractual, not aspirational. The certified hardware list is narrow — SAP only supports specific Azure VM SKUs for production HANA, so “just pick a bigger VM” is not on the table. And the data is the company’s financial system of record, which means encryption, immutability, and retention are audit findings waiting to happen if you get them wrong. The architecture below is built to satisfy all four at once.
Why the obvious approaches fail
Three shortcuts get proposed on every SAP-to-Azure project, and each fails in a way worth understanding before someone burns a quarter on it.
“Just take a VM snapshot every night.” HANA holds its dataset in memory and persists to disk asynchronously; a crash-consistent disk snapshot of a running HANA is not guaranteed to be a recoverable database. Worse, a nightly snapshot means up to 24 hours of lost financial postings on a recovery — orders of magnitude past the 15-minute RPO. Snapshots are part of the answer, but only when they are HANA-aware and storage-coordinated.
“Run it Active/Active across two VMs.” HANA’s primary is a single writer. You cannot have two nodes both accepting writes to the same database; that is not how an ACID in-memory column store works. High availability for HANA means a hot standby replica kept in sync, with automated failover — not load-balanced active/active.
“Lift the on-prem DR plan as-is.” The on-prem plan was a tape restore with a 24-hour RTO that the business quietly tolerated. On a contract with line-down penalties, that tolerance is gone. DR on Azure has to be a warm, replicated, tested capability with a documented runbook — the OEM’s questionnaire explicitly asks when it was last exercised.
The architecture threads all three: synchronous in-region replication for HA, asynchronous cross-region replication for DR, and HANA-aware immutable backups for the long-tail recovery and audit needs — each layer doing exactly one job.
Architecture overview
The system spans two Azure regions and is best understood as three independent protection layers stacked on the same landscape: an HA layer inside the primary region for routine failures, a DR layer to a paired region for a regional loss, and a backup layer for point-in-time recovery, ransomware, and audit retention. The classic SAP three-tier split still applies — database, central services, and application servers — but it is the database tier where the hard engineering lives.
The defining property of the topology is the one the basis team cares about most: the HANA primary and its synchronous secondary sit in different Availability Zones of the same region, replicating at the storage-consistent transaction level, fronted by a virtual IP that follows the active node automatically. No transaction is acknowledged to the application until the secondary has it, which is what makes the sub-15-minute RPO real for in-region failures.
The landscape, tier by tier:
-
HANA database tier (the crown jewels). Two Azure M-series VMs — the SKUs SAP certifies for production HANA, chosen so the VM’s memory matches the HANA in-memory footprint with headroom — one per Availability Zone. They run HANA System Replication (HSR) in synchronous mode (
SYNCwithlogreplay), so the secondary continuously replays the primary’s redo log and stays warm and query-ready. A Pacemaker cluster (SUSE or RHEL HA extension) monitors both nodes with the SAP HANA cluster resource agents and a STONITH fencing device (an Azure fence agent that can power-off a hung node via the Azure API), so a failed primary is fenced and the secondary is promoted automatically — no human in the loop at 3 a.m. -
Storage for HANA. The
/hana/data,/hana/log, and/hana/sharedvolumes live on Azure NetApp Files (ANF) — an NFS share backed by NetApp’s hardware — chosen for the low, predictable latency HANA’s log writes demand and for ANF’s application-consistent snapshot capability via AzAcSnap (the Azure Application-Consistent Snapshot tool), which quiesces HANA, takes a storage snapshot in seconds, and releases it. Those snapshots are near-instant and form the fast-recovery tier. -
SAP central services (ASCS/ERS). The single point of failure in any SAP system is the ASCS instance (it owns the enqueue lock table and message server). It runs in its own Pacemaker cluster with the Enqueue Replication Server (ERS) on a second node, so the lock table survives an ASCS failover with no in-flight transactions lost. A virtual IP on an Azure Standard Load Balancer (with HA-ports and floating IP) follows the active ASCS.
-
Application server tier (SAP NetWeaver / ABAP). Multiple stateless PAS/AAS application server VMs spread across both zones, registered with SAP logon groups so users and RFC traffic are balanced and a lost app server just sheds its sessions. This tier scales horizontally — it is the only tier that does.
-
DR region. A third HANA node in the paired Azure region receives asynchronous HSR (multi-target replication: the primary feeds both the in-zone synchronous secondary and the cross-region async one). Azure Site Recovery (ASR) replicates the application-tier and ASCS VMs to the DR region for orchestrated, runbook-driven failover, while the database tier fails over via HSR (faster and lossless-r than re-protecting a stateful DB through ASR).
-
Backup. Azure Backup for SAP HANA runs the native Backint integration — Azure Backup registers as HANA’s backup agent, so HANA itself streams consistent full, differential, and 15-minute log backups into an immutable Recovery Services vault. This is the layer that gives true point-in-time recovery and ransomware protection.
The edge and identity wrap around all of it: Akamai terminates TLS and provides WAF/anycast for the Fiori launchpad and any internet-facing SAP Web Dispatcher, and Microsoft Entra ID — federated upstream from Okta as the corporate workforce IdP — provides SSO into Fiori and SAML/OIDC into the SAP landscape so basis admins and business users authenticate once with corporate credentials and conditional access.
Component breakdown
| Layer | Service / tool | Role in the architecture | Key configuration choices |
|---|---|---|---|
| Compute (DB) | Azure M-series VMs | Certified host for production HANA primary + secondary | Memory sized to HANA footprint; one VM per Availability Zone; accelerated networking |
| In-region HA | HANA System Replication (sync) | Hot standby replica, zero-data-loss in region | SYNC mode + logreplay; pre-loaded column tables on secondary |
| Cluster / fencing | Pacemaker + STONITH | Automated promote on primary failure; split-brain prevention | SAPHanaSR resource agents; Azure fence agent for power-off fencing |
| HANA storage | Azure NetApp Files | Low-latency NFS for data/log/shared volumes | Ultra/Premium service level; AzAcSnap app-consistent snapshots |
| Central services | ASCS/ERS on Pacemaker | Protect the enqueue lock table (the SAP SPOF) | ERS on second node; Standard LB HA-ports + floating IP |
| App tier | SAP NetWeaver app servers | Stateless transaction processing, horizontally scaled | SAP logon groups; spread across both zones |
| Cross-region DR (DB) | HANA System Replication (async) | Warm replica in paired region, multi-target from primary | ASYNC tier; chained/multi-target replication topology |
| Cross-region DR (app) | Azure Site Recovery | Orchestrated failover of app + ASCS VMs | Recovery plan with boot order; recurring DR test failovers |
| Backup | Azure Backup for HANA (Backint) | Immutable full/differential/log backups, PITR | 15-min log backups; immutable vault; soft delete on |
| Edge | Akamai | TLS, WAF, anycast for Fiori / Web Dispatcher | Bot mitigation; origin shield to SAP Web Dispatcher |
| Identity / SSO | Okta + Microsoft Entra ID | Workforce SSO federated to Entra; SAML/OIDC into SAP | OIDC federation Okta→Entra; conditional access; SAML to Fiori |
| Secrets / keys | HashiCorp Vault + Key Vault | App/DB credentials, key custody for CMK | Dynamic DB creds; HSM-backed key for backup/disk encryption |
| CSPM / posture | Wiz + Wiz Code | Cloud posture, sensitive-data exposure, IaC scanning | Agentless scan of VMs/ANF/vault; Wiz Code on Terraform PRs |
| Runtime security | CrowdStrike Falcon | Endpoint/runtime protection on every SAP VM | Sensor on HANA + app + ASCS hosts; detections to SOC |
| Observability | Dynatrace / Datadog | OS, VM, HANA, and SAP-layer telemetry + alerting | OneAgent/agent on all hosts; HSR-lag and cluster-state alerts |
| ITSM / change | ServiceNow | Change approvals, DR-test records, incident tickets | Change gate on cluster/IaC changes; auto-ticket on failover |
| CI / IaC | Jenkins / GitHub Actions + Terraform/Ansible | Build the landscape as code; configure hosts repeatably | OIDC to Azure; Ansible for OS/HANA/cluster config |
A few of these choices are the ones teams get wrong, so they deserve the why.
Why synchronous replication in-region but asynchronous across regions. Synchronous HSR waits for the secondary to acknowledge every log write before the commit returns. Inside a region, Availability Zones are close enough (sub-millisecond) that this is invisible to users and buys you zero data loss. Across regions the network round-trip is tens of milliseconds; synchronous there would throttle every transaction to the speed of light between cities. So DR replication is asynchronous — accepting a few seconds of potential data loss in a true regional disaster in exchange for not crippling production performance every second of every normal day. Multi-target replication lets the primary feed both at once.
Why Azure NetApp Files and not just managed disks. HANA’s log writes are latency-critical and SAP enforces a strict storage KPI (the certified throughput/latency requirements for /hana/log). ANF delivers consistently low NFS latency and, crucially, storage-level snapshots that AzAcSnap can make application-consistent — quiescing HANA for the split second of the snapshot. That snapshot capability is the fast-recovery tier: restoring a multi-terabyte HANA from a storage snapshot is minutes, where a streaming Backint restore of the same data is hours.
Why ERS is non-negotiable. The single most common way an SAP HA design fails review is forgetting the ASCS is a single point of failure and that its in-memory enqueue lock table must survive a failover. Without the Enqueue Replication Server, an ASCS failover drops every in-flight lock and rolls back open transactions — a self-inflicted outage during what should be a graceful failover. ERS keeps a synchronous replica of the lock table on a second node so the failover is transparent to running work.
Implementation guidance
Build the landscape as code, in the right order. The dependency chain matters: network and storage before VMs, VMs before cluster, cluster before HANA install.
- A hub/spoke network with subnets for the DB tier, app tier, ANF (its own delegated subnet), and management, plus a VPN/ExpressRoute back to the corporate network for basis administration.
- Azure NetApp Files capacity pools and volumes for
/hana/data,/hana/log,/hana/shared, sized to the HANA memory footprint and the certified throughput tier. - The M-series VMs across two Availability Zones in the primary region and the DR VM in the paired region, with accelerated networking and proximity placement where latency-sensitive.
- OS, HANA, and cluster configuration applied with Ansible — idempotent roles for SUSE/RHEL tuning, HANA install, HSR setup, and the Pacemaker resource configuration — so the whole landscape is reproducible and the DR region is provably identical to production.
- The Standard Load Balancer virtual IPs for HANA and ASCS, with HA-ports and floating IP.
The infrastructure is defined in Terraform and the host configuration in Ansible, applied by a pipeline running in Jenkins (or GitHub Actions), authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak — a lesson this team intends never to repeat. Wiz Code scans every Terraform pull request for misconfigurations (a public NSG rule, an unencrypted disk, a vault without purge protection) before merge, so posture is enforced left of deploy rather than caught in production by Wiz’s agentless runtime scanning.
A minimal Pacemaker resource shape communicates the HA intent — the cluster owns the HANA topology and the virtual IP, and fences a hung node:
# SAPHanaSR-managed primary/secondary with a floating virtual IP
primitive rsc_SAPHana_HDB ocf:suse:SAPHana \
params SID=HDB InstanceNumber=00 \
PREFER_SITE_TAKEOVER=true AUTOMATED_REGISTER=true \
op start timeout=3600 op stop timeout=3600 op monitor interval=60
primitive rsc_ip_HDB ocf:heartbeat:IPaddr2 \
params ip=10.20.1.10 # virtual IP follows the primary
# STONITH: fence a hung node by powering it off via the Azure API
primitive rsc_stonith_azure stonith:fence_azure_arm \
params subscriptionId="..." resourceGroup="rg-sap-prod" \
msi=true pcmk_reboot_action="off"
AUTOMATED_REGISTER=true is the choice that lets the old primary rejoin as the new secondary automatically after it recovers, instead of waiting for a basis admin to re-register it by hand.
Identity: federate the humans, lease the secrets. Business users and basis admins authenticate Okta → Entra: corporate Okta credentials and conditional-access policies federate to Microsoft Entra ID over OIDC, and Entra issues the SAML assertion that logs them into the Fiori launchpad and SAP NetWeaver — one login, MFA enforced, no SAP-local passwords for humans. The credentials the systems need — the HANA backup user, the <sid>adm and SYSTEM operational secrets, RFC service accounts — live in HashiCorp Vault, leased dynamically and short-lived, so nothing sensitive sits in a profile parameter or a script on a jump host. Vault also brokers custody of the customer-managed key: the key itself is HSM-backed in Azure Key Vault, and it encrypts both the HANA data volumes and the Azure Backup vault, satisfying the OEM’s customer-controlled-encryption requirement.
Enterprise considerations
Backup, immutability, and retention. The backup layer is what the auditor reads first. Azure Backup for HANA registers via Backint as HANA’s own backup destination, so HANA orchestrates application-consistent fulls and differentials plus transaction-log backups every 15 minutes — that 15-minute log cadence is what delivers genuine point-in-time recovery within the RPO, independent of the replication layer. The Recovery Services vault is configured with immutability locked and soft delete on, so a compromised admin account (or ransomware) cannot delete or shorten the retention of existing recovery points — the backups are write-once for their retention window. Encryption uses the customer-managed key from Key Vault. A practical layering: ANF/AzAcSnap snapshots for fast local restore (minutes), Backint backups for point-in-time and immutability, and long-term retention tiered to cheaper storage for the multi-year financial-records mandate.
| Recovery scenario | Mechanism | Typical RTO | Typical RPO |
|---|---|---|---|
| Single VM / AZ failure | Pacemaker promotes sync secondary | 1–5 min | ~0 (synchronous) |
| ASCS host failure | Pacemaker + ERS failover | 1–3 min | 0 (lock table replicated) |
| Logical corruption / bad transport | Restore from ANF snapshot (AzAcSnap) | Minutes | Last snapshot |
| Point-in-time / ransomware | Azure Backup Backint restore + log replay | Hours | ≤15 min (log backups) |
| Full regional disaster | HSR async takeover + ASR app failover | 15–60 min | Seconds–minutes (async) |
Security & Zero Trust. The architecture is identity-gated and least-privilege by construction: no SAP-local human passwords, customer-managed keys, no public data-plane on the SAP PaaS surfaces. Layer on top: (a) CrowdStrike Falcon sensors on every HANA, ASCS, and application host for runtime threat detection, feeding the supplier’s SOC — the SAP VMs are long-lived and patched on SAP’s cadence, which makes runtime EDR essential rather than optional; (b) Wiz running continuous CSPM and sensitive-data-exposure scanning across the VMs, ANF volumes, and the vault, alerting the moment a resource drifts to public exposure or an NSG widens; © Wiz Code as the pre-merge gate on the Terraform/Ansible repo; (d) network segmentation with NSGs so the DB tier accepts traffic only from the app tier and management subnet, never the internet; (e) Akamai WAF in front of any internet-facing Web Dispatcher/Fiori path. A cluster failover or a guardrail breach auto-raises a ServiceNow incident so operations and security have a ticket, not just a log line.
Cost optimization. The M-series HANA VMs dominate the bill — they are large and reserved, so engineer the spend deliberately.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Reserved / savings plan | 1–3-year commit on the steady production HANA VMs | ~40–60% off pay-as-you-go on the biggest line item |
| Right-size the secondary | Sync secondary matches primary; app tier scales to demand | Avoids over-provisioning the stateless tier |
| DR sizing | Run DR HANA VM smaller until a real failover, scale up on invoke | Pay full size only during an actual disaster |
| ANF tiering | Premium for log, lower service level for /hana/shared |
Buys latency only where HANA’s KPI demands it |
| Backup tiering | Hot for recent recovery points; archive for the multi-year mandate | Cuts long-term retention cost sharply |
| Non-prod auto-stop | Snooze sandbox/QA HANA outside business hours | Removes idle spend on the non-prod copies |
A subtle but real lever: the DR HANA VM can run a smaller SKU while it is only replaying logs, and be resized up at failover time — you pay for full DR capacity only during an actual disaster, not 8,760 hours a year, while still keeping a warm, consistent replica.
Scalability. Only the application tier scales horizontally — add NetWeaver app servers to SAP logon groups and the load balancer spreads the sessions. The HANA tier scales vertically: a bigger M-series SKU when the in-memory dataset grows (HANA scale-out across nodes exists but is a major step reserved for the very largest systems, and most S/4HANA stays single-node-scale-up). Plan the growth runway by tracking HANA memory utilization in Dynatrace (or Datadog) and sizing the next SKU before you hit the ceiling — a scale-up means a VM resize and a brief failover, so it is scheduled, not reactive.
Observability — instrument the layers that actually fail. Put Dynatrace (or Datadog) OneAgent on every host and emit the SAP-specific signals, because generic CPU graphs miss the failure modes that matter here:
- HSR replication lag on both the synchronous and asynchronous secondaries — a growing async lag silently erodes your DR RPO, and you want an alert long before a disaster reveals it.
- Pacemaker cluster state — a node in
UNCLEANor a failed STONITH means your next failover will not work; this must page immediately. - ANF latency against the HANA storage KPI — drift here degrades HANA before users notice.
- Backint backup success and log-backup cadence — a missed log backup is a silent RPO gap; a failed full breaks the chain.
- Enqueue lock-table replication between ASCS and ERS.
Pipe these to alerting, and route a failover event or a backup failure into ServiceNow automatically. The metric the business actually cares about — can we recover, and how much would we lose — is the combination of HSR lag and last-good-backup age, so surface those two on the operations dashboard the CIO sees.
Reliability & DR testing — the part the questionnaire grades. A DR plan that has never been exercised is a liability, and the OEM’s security questionnaire asks for the date of the last test. Azure Site Recovery recovery plans let you run a non-disruptive test failover of the application and ASCS tier into an isolated network in the DR region on a schedule, while an HSR takeover drill validates the database side — all without touching production. Run it quarterly, capture the result as a ServiceNow change record, and you have the auditable evidence the contract requires. The Pacemaker PREFER_SITE_TAKEOVER=true and AUTOMATED_REGISTER=true settings, plus a documented runbook for the cross-region promotion, are what turn DR from a fire drill into a procedure.
Governance. Define and document the RTO/RPO per recovery scenario (the table above is the contract translated into engineering). Keep the entire landscape — Terraform, Ansible, Pacemaker configuration — in version control so the DR region is provably identical to production and any change is reviewable and revertable. Gate cluster and infrastructure changes through ServiceNow so basis changes have an approval trail. Pin OS and HANA patch levels deliberately and promote them through a QA copy of the landscape before production, because an untested kernel or HANA revision is exactly the kind of change that breaks a Pacemaker resource agent at the worst possible moment.
Explicit tradeoffs
Accept these or do not build it. This architecture is genuinely complex — a clustered, fenced, replicated, multi-region stateful database is more moving parts than any stateless app, and it demands SAP basis and Linux HA expertise to operate. Synchronous in-region replication costs you a second full-sized M-series VM doing nothing but staying warm — that is the price of zero-data-loss HA, and there is no cheaper way to get it for HANA. The cross-region DR adds a third replica, ASR replication egress, and a runbook you must keep tested. Customer-managed keys add a key-custody responsibility — lose or revoke the key and you lose access to the data, so the Vault/Key Vault key lifecycle becomes itself a thing to protect and back up. And the Pacemaker/STONITH layer, done wrong, can cause the very outages it exists to prevent (a misconfigured fence agent that power-cycles a healthy node) — which is exactly why it is built as code, tested, and not hand-edited in production.
The alternatives, and when they win. If your SAP system is not under contractual line-down penalties — an internal reporting system, a non-prod landscape — a single-VM HANA with Azure Backup and a documented rebuild-from-backup DR is dramatically simpler and may be the right economic call; you accept hours of RTO for a fraction of the cost. If you want to step entirely out of the OS/cluster/storage business, RISE with SAP (SAP’s managed private-cloud offering on Azure) hands the HA/DR/backup operation to SAP and shifts this from an architecture you run to a service you consume — at the cost of control and a different commercial model. And for the database HA layer specifically, some shops accept a slightly higher RTO and skip Pacemaker in favor of a simpler manual-failover HSR setup — viable only where a few extra minutes of recovery time does not cost five figures a minute. For this supplier, with assembly lines stopping on the other end of the contract, the full clustered-replicated-backed-up architecture is not gold-plating; it is the minimum that lets the basis team, the CISO, and the OEM’s auditor each say yes.
The shape of the win
For the supplier, the payoff is not “SAP in the cloud.” It is that a HANA primary VM can die at 3 a.m. and Pacemaker promotes the synchronous secondary in another Availability Zone before the night-shift dispatcher notices, with zero lost postings; that a regional disaster fails over to the paired region in under an hour with seconds of data loss instead of a day; and that when the OEM’s auditor asks for an immutable-backup proof and a dated DR-test record, the answer is a ServiceNow change ticket from last quarter, not a promise. That last point is what renews the contract. Everything upstream — the M-series VMs sized to HANA, the synchronous and asynchronous HSR tiers, the Pacemaker fencing, the ANF snapshots, the immutable Backint backups under a customer-managed key, the Wiz posture scanning and CrowdStrike runtime defense, the HSR-lag alert in Dynatrace — exists so that a CFO counting line-down penalties, a CISO reading a security questionnaire, and an SAP basis lead on call all reach the same conclusion: the mission-critical ERP is finally on infrastructure that will not lose the company a customer. Start narrower if the penalties do not justify it, but for a system of record under a just-in-sequence contract, this is where it has to land.