Most SQL Managed Instance “HA” incidents I have reviewed were not engine failures. They were the day the primary region went dark and the team discovered their connection strings hardcoded a regional FQDN, their logins did not exist on the secondary, and nobody had ever actually run a failover. Managed Instance gives you two distinct continuity primitives – auto-failover groups for cross-region geo-replication and the Managed Instance link for hybrid replication to and from SQL Server. They solve different problems, they cannot be combined on the same instance, and both are useless until you have rehearsed the runbook. This is how to wire both correctly and prove they work.
1. Service tiers, zone redundancy, and the built-in HA model
Before you reach for cross-region anything, understand what you already get inside one region. SQL Managed Instance has two operational models depending on service tier:
- General Purpose separates compute from storage. Your database files live on Azure remote storage (premium/standard managed disks) with built-in 3-copy durability. A node failure means a stateless compute node is reattached to the same storage – fast to provision, but recovery is a remount, not an instant replica handoff.
- Business Critical runs an Always On availability group of four replicas locally, with local SSD. One replica is a free read-only replica you reach with
ApplicationIntent=ReadOnly. Failover between local replicas is seconds.
Zone redundancy is the cheapest resilience upgrade most teams skip. Enabling it spreads the replicas (Business Critical) or the storage (General Purpose, where supported) across availability zones in the region, so a single-zone outage does not take the instance down. It is a property of the instance, not a separate resource:
# Enable zone redundancy on an existing instance (Business Critical)
az sql mi update \
--resource-group rg-data-eastus2 \
--name mi-prod-eastus2 \
--zone-redundant true
Zone redundancy protects against a zone failure. It does not protect against a region failure, a bad deployment, or accidental data deletion. That is the line where failover groups and the link begin. Keep the built-in HA and add cross-region DR – they are layers, not alternatives.
2. Provisioning auto-failover groups across two regions
A failover group geo-replicates all user databases on an instance to a partner instance in a different region, with a managed listener that survives failover. Two hard prerequisites trip people up before they ever run the command:
- The two instances must share a DNS zone. A Managed Instance’s host certificate is a SAN cert scoped to a generated DNS zone ID (e.g.
a1b2c3d4e5f6). For a client to reconnect to the secondary using the same cert after failover, the secondary must be in the same DNS zone. You set this at creation time of the secondary by pointing it at the primary as its DNS zone partner – you cannot retrofit it. - The subnets must have connectivity. Global VNet peering is the recommended path (private, backbone, no gateway). VPN/ExpressRoute also work but are slower for the seeding phase.
Create the secondary in the partner’s DNS zone (the --dns-zone-partner takes the primary instance resource ID):
# Secondary instance, placed into the primary's DNS zone
az sql mi create \
--resource-group rg-data-westus2 \
--name mi-prod-westus2 \
--location westus2 \
--subnet "/subscriptions/<sub>/resourceGroups/rg-net-westus2/providers/Microsoft.Network/virtualNetworks/vnet-westus2/subnets/snet-mi" \
--capacity 8 --storage 256GB \
--edition BusinessCritical --family Gen5 \
--dns-zone-partner "/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Sql/managedInstances/mi-prod-eastus2"
With both instances live and peered, create the failover group on the secondary’s resource group, referencing the primary as --source-mi. Use --failover-policy Manual (customer-managed) so you decide when to fail over – this is the strongly recommended posture:
az sql instance-failover-group create \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2 \
--source-mi mi-prod-eastus2 \
--partner-mi mi-prod-westus2 \
--partner-resource-group rg-data-westus2 \
--failover-policy Manual \
--grace-period 1
Note the directionality: --source-mi is the current primary (East US 2), --partner-mi is the secondary you just built (West US 2), and --location/--resource-group describe the secondary. Set --secondary-type Standby instead of the default Geo if the secondary is DR-only and you want the license-free passive replica benefit – you stop paying SQL licensing on its vCores as long as you never run read workloads on it.
--grace-period only matters under a Microsoft-managed (Automatic) policy; it is the minimum wait (floor of 1 hour) before Microsoft forces a region-wide failover. Under Manual it is inert.
3. Read-write and read-only listener endpoints for connection routing
This is the entire payoff. The failover group publishes two CNAMEs in the shared DNS zone, and they do not change when the primary role moves:
| Listener | FQDN format | Points to | Connection string intent |
|---|---|---|---|
| Read-write | <fog-name>.<zone_id>.database.windows.net |
current primary | default (read-write) |
| Read-only | <fog-name>.secondary.<zone_id>.database.windows.net |
current secondary | ApplicationIntent=ReadOnly |
So your applications connect to the listener, never to mi-prod-eastus2... directly:
# Read-write (OLTP) -- always lands on whichever instance is primary
Server=tcp:fog-prod-sqlmi.a1b2c3d4e5f6.database.windows.net;Database=orders;Encrypt=True;
# Read-only (reporting) -- lands on the geo-secondary
Server=tcp:fog-prod-sqlmi.secondary.a1b2c3d4e5f6.database.windows.net;Database=orders;ApplicationIntent=ReadOnly;Encrypt=True;
Two operational truths to internalize:
- On failover, the DNS record updates immediately, but existing TCP connections are not migrated. Clients must reconnect, and they will only pick up the new primary after their DNS cache expires. Build aggressive retry-with-reconnect into your data layer; do not assume a live connection rides through.
- These listeners are not reachable over the public endpoint. They resolve over the VNet-local endpoint only – from inside the VNet or peered/VPN/ExpressRoute networks. Plan your app placement accordingly.
4. The Managed Instance link: near-real-time replication from SQL Server
Failover groups are MI-to-MI. The Managed Instance link is the hybrid story: it builds a distributed availability group between a SQL Server instance (on-prem or on an Azure VM) and a SQL Managed Instance, replicating a single database in near-real-time. Because it is a DAG, the underlying tech is the same Always On machinery your DBAs already know.
Key facts that shape the design:
- One database per link. To replicate ten databases you create ten links. A single MI hosts up to 100 links (500 on the Next-gen General Purpose tier).
- Minimum versions matter. One-way replication (SQL Server -> MI) is GA from SQL Server 2016 SP3 / 2017 CU31 / 2019 CU20. Two-way DR with failback requires SQL Server 2022 (16.x) or SQL Server 2025 (17.x).
- Authentication is certificate-based. SQL Server and MI exchange the public keys of their respective certificates. Windows auth cannot establish the link.
- It uses the VNet-local endpoint only – not the public endpoint, not a private endpoint.
- User databases only. Logins, agent jobs, server-level objects, and system databases are not replicated. This is the single most common post-failover outage cause; handle it explicitly (see the runbook).
The trust handshake on the SQL Server side, abbreviated:
-- On SQL Server: certificate to authenticate the database mirroring endpoint
USE master;
CREATE CERTIFICATE SqlServerCert
WITH SUBJECT = 'SQL Server certificate for MI link';
-- Database mirroring endpoint that the DAG rides over (default instance uses 5022)
CREATE ENDPOINT Endpoint_Mirroring
STATE = STARTED
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
FOR DATABASE_MIRRORING (
ROLE = ALL,
AUTHENTICATION = CERTIFICATE SqlServerCert,
ENCRYPTION = REQUIRED ALGORITHM AES);
If multiple SQL Server instances share a host, each needs its own mirroring endpoint on a dedicated port. Only the default instance should use 5022. SSMS automates the entire certificate exchange and DAG creation through its “Replicate database to Managed Instance” wizard – use it for the first link, then script the rest with the latest
Az.Sqlmodule or REST.
5. One-way vs two-way link scenarios for migration and DR
The link is the same feature with two very different operating modes, gated by SQL Server version and the MI update policy:
| Scenario | Versions | Direction | Use it for |
|---|---|---|---|
| Online migration / read offload | SQL Server 2016-2019 | SQL Server -> MI only | Minimal-downtime migration; offload read-only/reporting/analytics to Azure |
| Two-way DR with failback | SQL Server 2022 / 2025 | bidirectional | Hybrid DR where Azure or on-prem can be primary, and you can fail back |
The hard constraint on two-way: replicating from MI back to SQL Server (and failing over in that direction) is not supported on the “Always-up-to-date” update policy. The instance must run the matching SQL Server 2022 or SQL Server 2025 update policy. If you stand up an MI on Always-up-to-date and later need to fail back to on-prem, you are stuck – you can fail over to MI but never back over the link. Decide the update policy when you provision, based on whether failback to SQL Server is a requirement.
For migration, the link is the only option that delivers a true online cutover to the Business Critical tier. The pattern: establish the link, let it seed and stay synchronized while production keeps running on SQL Server, then fail over to MI during a short, planned cutover window. For SQL Server 2016-2019 that failover is one-directional and breaks the link – which is exactly what you want for a migration, but means you must be confident before you cut over.
6. Planned vs forced failover and the data-loss tradeoffs
Geo-replication is asynchronous over the WAN. That single fact drives the entire failover-vs-forced-failover decision and your achievable RPO. There are two distinct operations:
Planned (no data loss). set-primary without --allow-data-loss. It fully synchronizes the secondary, then switches roles. It requires the current primary to be reachable. This is your tool for DR drills, region relocations, and failback after an outage clears.
# Planned failover: run AGAINST the secondary you want to promote.
# --location/-g identify the secondary. No data loss.
az sql instance-failover-group set-primary \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2
Forced (potential data loss). Same command with --allow-data-loss. It promotes the secondary immediately without waiting for in-flight transactions to replicate. This is your only option when the primary region is gone. Anything not yet replicated is lost – that gap is your RPO.
# Forced failover during a real regional outage. Data loss is possible.
az sql instance-failover-group set-primary \
--name fog-prod-sqlmi \
--resource-group rg-data-westus2 \
--location westus2 \
--allow-data-loss
The DNS update is immediate either way, but the role switch can take up to ~5 minutes; until it completes some databases on the new primary remain read-only. After a forced failover, when the old primary comes back it automatically rejoins as the new secondary – do not delete it in a panic. Once the region is healthy, use a planned set-primary (no data loss) to fail back.
There is a way to get a data-lossless forced failover for a drill: stop the workload, drain long-running transactions, disconnect all clients, and confirm the failover group status is
Synchronizingon both instances before forcing. But for production failback, always prefer planned over forced. Forced failover has no guardrails against data loss by design.
7. Networking prerequisites: subnet delegation, NSG, and route tables
Managed Instance is the most networking-opinionated PaaS database Azure ships. Get the subnet wrong and provisioning simply fails. The non-negotiables:
- A dedicated subnet delegated to
Microsoft.Sql/managedInstances. Nothing else lives in it. No other resource types, no other instances from a different DNS zone. - Subnet sizing: a
/27is the documented minimum, but plan/26or larger so you have room to scale vCores (scaling can temporarily need extra addresses). - An NSG and a route table must be associated with the subnet – they are required, not optional. The platform manages a set of mandatory rules/routes; do not strip them.
For a failover group, the two instance subnets across regions need connectivity. Global VNet peering is the recommended choice:
# Global VNet peering between the two MI VNets (run both directions).
az network vnet peering create \
--name peer-eastus2-to-westus2 \
--resource-group rg-net-eastus2 \
--vnet-name vnet-eastus2 \
--remote-vnet "/subscriptions/<sub>/resourceGroups/rg-net-westus2/providers/Microsoft.Network/virtualNetworks/vnet-westus2" \
--allow-vnet-access true --allow-forwarded-traffic true
# ...then the symmetric peer-westus2-to-eastus2 with the remote-vnet reversed.
Subnet delegation and the required associations, expressed declaratively in Bicep so it is reproducible:
resource miSubnet 'Microsoft.Network/virtualNetworks/subnets@2024-05-01' = {
parent: vnet
name: 'snet-mi'
properties: {
addressPrefix: '10.20.0.0/26'
delegations: [
{
name: 'miDelegation'
properties: { serviceName: 'Microsoft.Sql/managedInstances' }
}
]
networkSecurityGroup: { id: nsgMi.id }
routeTable: { id: routeMi.id }
}
}
For the link, connectivity is between SQL Server and the MI subnet: VNet peering if SQL Server is on an Azure VM, or VPN/ExpressRoute from on-prem. The data path uses the database mirroring port (default 5022) plus the AG communication ports; ensure your NSGs and on-prem firewalls allow that traffic over the private link. The link cannot use the public or private endpoint – only the VNet-local endpoint.
8. Building and rehearsing a failover runbook with RPO/RTO targets
A failover capability you have never executed is a liability, not a control. Write the runbook, attach numbers, and drill it on a schedule.
Set the targets first. For Managed Instance geo-failover, design around an RPO of ~5 seconds under healthy async replication (the realistic data-loss window in a forced failover) and an RTO of ~1 hour for the failover operation plus application recovery – the role switch itself is minutes, but end-to-end recovery is gated by your slowest dependency. Document the actual numbers you measure in drills; do not ship the marketing ones.
The runbook, in order:
-
Pre-stage the secondary. Sync logins with matching SIDs, agent jobs, linked servers, credentials, and any
master/msdbobjects. These are NOT replicated. This is the step that turns a “successful” failover into a broken application.-- On the SECONDARY instance: recreate the login with the SAME SID as primary. CREATE LOGIN app_svc WITH PASSWORD = '<strong-secret>', SID = 0x2A...; -- SID from primary -
Decide planned vs forced based on whether the primary is reachable. Drills and failback use planned (
set-primary). A real region-down event uses forced (--allow-data-loss). -
Verify replication health before any planned operation – status must be
Synchronizingon both instances. -
Execute the appropriate
set-primarycommand from section 6. -
Fail over the app tier too. Repoint or restart app components in the secondary region so cross-region latency to the database does not silently degrade you. The DB moving alone is half a failover.
-
Validate (next section), then communicate completion.
-
Fail back with a planned failover once the primary region is healthy – never leave production running on a DR-only-sized secondary longer than necessary.
Automate the health check so the drill is repeatable rather than artisanal:
# Pre-flight gate: refuse to proceed unless the group is Synchronizing.
state=$(az sql instance-failover-group show \
--name fog-prod-sqlmi -g rg-data-westus2 -l westus2 \
--query "replicationState" -o tsv)
echo "Failover group replication state: $state"
[ "$state" = "CATCH_UP" ] || { echo "Not in sync -- aborting drill."; exit 1; }
Drill cadence: quarterly planned failover in production (it is data-lossless and reversible), plus a tabletop of the forced-failover path. The first time you discover the service account login does not exist on the secondary should be a Tuesday-afternoon drill, not a 3 a.m. region outage.
Verify
After configuring a failover group or running a drill, confirm the system is actually in the state you think it is:
# 1. Failover group exists, knows both partners, and reports replication role/state.
az sql instance-failover-group show \
--name fog-prod-sqlmi -g rg-data-westus2 -l westus2 \
--query "{role:replicationRole, state:replicationState, partner:partnerRegion, policy:readWriteEndpoint.failoverPolicy}" -o table
# 2. Both listeners resolve and the right instance answers.
# Read-write must hit the current primary; read-only must hit the secondary.
nslookup fog-prod-sqlmi.a1b2c3d4e5f6.database.windows.net
nslookup fog-prod-sqlmi.secondary.a1b2c3d4e5f6.database.windows.net
-- 3. From inside the VNet, confirm which physical instance you landed on
-- via each listener (run after connecting through the read-write FQDN).
SELECT @@SERVERNAME AS landed_on,
DATABASEPROPERTYEX(DB_NAME(), 'Updateability') AS rw_or_ro;
-- 4. For the link: confirm the distributed AG is healthy and synchronizing.
SELECT ag.name, drs.synchronization_state_desc, drs.synchronization_health_desc
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_groups ag ON ag.group_id = drs.group_id;
Expected: the failover group state is CATCH_UP/Synchronizing, both CNAMEs resolve, the read-write listener lands on a writable primary, the read-only listener lands on a read-only secondary, and the link’s DAG reports SYNCHRONIZED/HEALTHY.
Enterprise scenario
A payments platform team ran their core ledger on SQL Server 2019 on-prem and needed to move to Azure SQL Managed Instance with a regulator-mandated cap of 15 minutes of cutover downtime on a database just under 2 TB. Their first instinct was the Log Replay Service, but they could not hit the window during rehearsals – the final log restore plus validation blew past 15 minutes every time.
They switched to the Managed Instance link. Because the link rides a distributed availability group and keeps the MI continuously seeded and synchronized while production stays live on SQL Server, the actual cutover collapsed to: stop the app, confirm the DAG was SYNCHRONIZED, fail over to MI, repoint the app at the failover-group read-write listener. Measured cutover: under 4 minutes.
The trap they hit – and the reason this is worth writing down – was the post-cutover outage that had nothing to do with data. The link replicates the database, not the instance. Their batch settlement agent jobs and the service-account logins simply were not on the MI, so the first overnight settlement run failed even though every row of customer data was intact and correct. They fixed it by scripting logins with matching SIDs and recreating the agent jobs as an explicit pre-cutover task – and then encoded it as a gate in the runbook:
-- Pre-cutover gate on the target MI: fail loudly if a required login is missing,
-- BEFORE anyone declares the migration "done".
DECLARE @required TABLE (name sysname);
INSERT INTO @required VALUES (N'app_settlement_svc'), (N'app_ledger_svc');
IF EXISTS (SELECT 1 FROM @required r
WHERE NOT EXISTS (SELECT 1 FROM sys.server_principals p WHERE p.name = r.name))
THROW 50001, 'Required login(s) missing on MI -- do not cut over.', 1;
The lesson generalizes to failover groups too: cross-region geo-replication and the link both move user databases and nothing else. Server-level state is your job, and it belongs in the runbook as a hard gate, not a hopeful afterthought.