Single Server is gone – it retired in March 2025 – and every PostgreSQL workload on Azure that still mattered had to land on Flexible Server. That migration is the right moment to fix the things teams got wrong the first time: HA that was never failed over, a connection storm that took down a perfectly healthy database, a read replica nobody had promoted, and a major-version upgrade everyone was too scared to run. Flexible Server gives you real building blocks for all four – zone-redundant HA, an in-engine PgBouncer, asynchronous read replicas, and in-place major version upgrades. None of them are safe until you have driven them in anger. Here is how to wire each one correctly and prove it.
1. Flexible Server architecture versus retired Single Server
The mental model is different and it matters. Single Server was a gateway-fronted PaaS where you never saw the compute. Flexible Server runs the Postgres engine on a dedicated Linux VM that lives in your topology – you choose the zone, you choose whether it sits in a delegated subnet, and storage is a separate managed disk underneath. That single architectural change is why HA, replicas, and PgBouncer all behave the way they do.
Storage and compute scale independently. Compute is a burstable, general-purpose, or memory-optimized SKU you can resize with a restart. Storage is a Premium SSD (v1) or Premium SSD v2 disk where IOPS and throughput are either a function of disk size or – on SSD v2 – provisioned independently. Provision a baseline so you are not surprised at scale-out:
az postgres flexible-server create \
--resource-group rg-data-eastus2 \
--name pg-prod-eastus2 \
--location eastus2 \
--tier MemoryOptimized --sku-name Standard_E4ds_v5 \
--version 16 \
--storage-size 256 \
--high-availability ZoneRedundant \
--vnet vnet-data-eastus2 --subnet snet-postgres \
--private-dns-zone pg-prod.private.postgres.database.azure.com \
--admin-user pgadmin --admin-password "<from-key-vault>"
Burstable (B-series) SKUs do not support high availability and are not eligible for the read replica feature in production-grade configurations. If a workload needs HA or replicas, start on General Purpose (Ds_v5) or Memory Optimized (Es_v5). You cannot retrofit HA onto a Burstable instance without a tier change.
2. Zone-redundant and same-zone HA failover behavior
Flexible Server HA is a hot standby: a second VM running the same engine, kept in sync by synchronous streaming replication, with its own copy of the data on separate storage. This is not a read replica – you cannot read from the standby, and you do not pay for it as a separate server, you pay for the doubled compute and storage. There are two flavors:
- Zone-redundant HA places primary and standby in different availability zones in the same region. It survives a zone outage. This is the default you want for anything customer-facing.
- Same-zone HA places the standby in the same zone (different fault domain). It protects against a node failure but not a zone outage. Use it only where the region has no zones, or where cross-zone latency on synchronous commit is unacceptable for the workload.
Because replication is synchronous, every committed write is on both nodes before the client gets its acknowledgement – so a failover is zero data loss (RPO = 0). Failover is automatic on primary failure and typically completes in tens of seconds; the client sees a dropped connection and the same FQDN now points at the promoted standby.
You can and should rehearse it. A planned (forced) failover is a single command and is the only honest way to measure your application’s reconnect behavior:
# Drill the failover. Measure how long your app takes to recover.
az postgres flexible-server restart \
--resource-group rg-data-eastus2 \
--name pg-prod-eastus2 \
--failover Forced
Synchronous HA has a cost: commit latency now includes the cross-zone round trip. For chatty, small-transaction OLTP this is real. Measure
commitlatency before and after enabling zone-redundant HA; if it hurts, same-zone HA trades zone resilience for lower latency, and a cross-region read replica (Section 5) can carry the DR requirement instead.
Enable or change HA on an existing server without recreating it:
az postgres flexible-server update \
--resource-group rg-data-eastus2 \
--name pg-prod-eastus2 \
--high-availability ZoneRedundant \
--standby-zone 3
3. VNet integration versus private endpoint connectivity
Flexible Server offers two mutually exclusive networking models, and you choose at creation time – you cannot convert one to the other later. Get this decision right up front.
VNet integration (private access) injects the server into a delegated subnet. The subnet must be delegated to Microsoft.DBforPostgreSQL/flexibleServers and used by nothing else. The server has no public endpoint at all; it is reachable only from peered or connected networks, and name resolution flows through a linked private DNS zone.
# The delegated subnet -- nothing else may live here.
az network vnet subnet create \
--resource-group rg-net-eastus2 \
--vnet-name vnet-data-eastus2 \
--name snet-postgres \
--address-prefixes 10.40.2.0/27 \
--delegations Microsoft.DBforPostgreSQL/flexibleServers
Private Link (private endpoint) keeps the server logically separate and projects it into your VNet through a NIC with a private IP. This is the model that composes cleanly with hub-and-spoke, centralized private DNS, and resources that already standardize on private endpoints. Public network access can be disabled entirely so the only path in is the endpoint.
The practical decision rule:
| Need | Choose |
|---|---|
| Simplest private model, dedicated subnet acceptable | VNet integration |
| Hub-and-spoke with centralized private DNS / endpoint policy | Private endpoint |
| Connect from many spokes / on-prem over a standard PE pattern | Private endpoint |
| Convert later between the two | Not possible – pick deliberately |
Either way, your private DNS zone (privatelink.postgres.database.azure.com for PE, or the linked zone for VNet integration) must resolve the server FQDN to the private IP, or clients fall back to the public name and fail when public access is off. This is the single most common “it worked in dev” outage.
4. Built-in PgBouncer and connection management
PostgreSQL forks a backend process per connection. Every idle connection from an over-eager application pool consumes a few MB of server RAM and a slot against max_connections. Serverless front ends and high-fan-out microservices routinely open thousands of short-lived connections and exhaust the server long before they exhaust CPU. The fix is a transaction-mode pooler, and Flexible Server ships PgBouncer built in – no sidecar, no separate VM to patch.
Turn it on with server parameters and connect through port 6432 instead of 5432:
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name pgbouncer.enabled --value true
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name pgbouncer.default_pool_size --value 50
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name pgbouncer.pool_mode --value transaction
The connection string just changes the port:
psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
dbname=appdb user=app_svc sslmode=require"
Two engineering caveats that cause real bugs in transaction mode, because a connection is handed back to the pool at the end of every transaction:
- Session-scoped state does not survive. Prepared statements,
SEToutside a transaction, advisory session locks,LISTEN/NOTIFY, and temp tables are unsafe. Many drivers default to server-side prepared statements – disable them (e.g. JDBCprepareThreshold=0, or libpq-level handling) or you will get cryptic “prepared statement does not exist” errors. - PgBouncer survives failover with the FQDN. Because it runs on the server VM, the same hostname:6432 works after an HA failover. Good. But the pool is local to the primary – it is not a separate tier you scale independently.
One more subtlety on a server with HA: PgBouncer runs on the primary. After a failover the promoted standby starts its own PgBouncer, so the parameter values must already be set (they are server-wide). Do not bake PgBouncer config into application start-up assuming it is external – it is part of the server.
5. Read replicas, promotion, and cross-region DR
Read replicas are asynchronous physical replicas. They serve read-only traffic and double as a disaster-recovery target in another region. Asynchronous means non-zero RPO – replication lag – so a replica is a DR option, not an HA substitute. Create one in the same or a different region:
az postgres flexible-server replica create \
--resource-group rg-data-westus2 \
--replica-name pg-prod-westus2-ro \
--source-server "/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.DBforPostgreSQL/flexibleServers/pg-prod-eastus2" \
--location westus2
Point reporting and read-heavy queries at the replica’s own FQDN. Watch lag with the platform metric read_replica_lag_in_seconds (or query pg_stat_replication on the primary) and alert on it – a replica silently falling hours behind is a DR target that will lose hours of data.
Promotion is the DR action. Flexible Server gives you two promote semantics, and choosing the wrong one is a classic mistake:
- Promote to independent server (switchover/standalone): the replica becomes a full read-write server and detaches from the primary. This is the real DR failover.
- Planned switchover: under controlled conditions Azure can sync the last changes before promoting, minimizing data loss for a planned event.
# DR: promote the replica to a standalone read-write server.
az postgres flexible-server replica promote \
--resource-group rg-data-westus2 \
--name pg-prod-westus2-ro \
--promote-mode standalone \
--promote-option forced
A replica is only a credible DR target if you have promoted one in a drill and repointed an app at it. The most expensive surprise is discovering at 3am that promotion is one-way: once you promote
forced, that replica is now an independent primary and the original replication relationship is gone. Rehearse on a disposable copy first.
6. Server parameters, extensions, and storage/IOPS tuning
Parameters. Flexible Server exposes Postgres GUCs as server parameters. Some apply dynamically; others are flagged static and require a restart. Treat them as code – set them through CLI/Bicep/Terraform, never click-ops:
# Static -> requires a restart. Set then restart in a maintenance window.
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name max_connections --value 600
Extensions are not on until you allowlist them. Add the extension to azure.extensions (the server parameter that gates what CREATE EXTENSION is allowed to load), then create it in the database:
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name azure.extensions --value "pg_stat_statements,pgcrypto,uuid-ossp"
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
Some extensions –
pg_cron,pg_stat_statements,pg_prewarm,auto_explain– also need to be inshared_preload_libraries, which is a static parameter. Set the preload list, restart once, then runCREATE EXTENSION. Skipping the preload step is whypg_cron“isn’t there” after enabling it.
Storage and IOPS. On Premium SSD (v1), IOPS scale with disk size and storage auto-grow bumps the disk when it nears full – enable it so you never take an out-of-space outage. On Premium SSD v2, you provision capacity, IOPS, and throughput independently, which decouples performance from size and avoids over-provisioning a 4 TB disk just to buy IOPS:
az postgres flexible-server update \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--storage-auto-grow Enabled
7. In-place major version upgrades and pre-upgrade validation
Flexible Server supports in-place major version upgrades – e.g. 14 -> 16 – without dump/restore or a new server. It is offline (a downtime window while pg_upgrade runs), one-way, and irreversible, so the order of operations is non-negotiable:
- Take a known-good backup or on-demand snapshot first. PITR is your only way back.
- Validate compatibility. Run the upgrade in validate-only mode so the platform performs the pre-upgrade checks (
pg_upgrade --check) without committing. - Rehearse on a PITR-restored clone of production to time the window and catch extension/deprecation issues.
- Execute in a window, then verify and re-
ANALYZE.
# 1. On-demand backup before anything.
az postgres flexible-server backup create \
--resource-group rg-data-eastus2 --name pg-prod-eastus2 \
--backup-name pre-upgrade-v16
# 2. Dry run: validate compatibility WITHOUT upgrading.
az postgres flexible-server upgrade \
--resource-group rg-data-eastus2 --name pg-prod-eastus2 \
--version 16 --perform-validation-only
# 3. Execute (offline) once validation passes.
az postgres flexible-server upgrade \
--resource-group rg-data-eastus2 --name pg-prod-eastus2 \
--version 16
pg_upgradedoes not carry over optimizer statistics. The database is up but query plans are bad until you runANALYZE(orvacuumdb --all --analyze-in-stages) post-upgrade. Teams that skip this report a “slow database after the upgrade” – it is missing stats, not the engine. MakeANALYZEthe last step of the runbook, before you reopen traffic. Also note: a read replica cannot be major-upgraded independently of its primary – plan replicas around the upgrade.
8. Backup, PITR, geo-redundancy, and restore drills
Backups are automatic and free up to your storage size, with a configurable retention of 7 to 35 days. The choices that matter are redundancy and proving restore:
- Locally/zone-redundant backup keeps copies in-region. Geo-redundant backup must be enabled at creation time and replicates backups to the paired region, enabling geo-restore if the whole region is lost. You cannot turn geo-redundancy on after the fact.
- PITR restores to a new server at any second within the retention window. That is the muscle you exercise to clone production for upgrade rehearsals and to recover from a bad migration or accidental
DELETE.
# Point-in-time restore to a NEW server (clone for drills, or recover).
az postgres flexible-server restore \
--resource-group rg-data-eastus2 \
--name pg-prod-eastus2-pitr \
--source-server pg-prod-eastus2 \
--restore-time "2026-06-08T02:15:00Z"
A backup you have never restored is a hypothesis. Schedule a quarterly PITR drill: restore to a new server, connect, run a row-count and checksum sanity check, time it, then delete the clone. The number you get is your real RTO – not the one in the SLA deck.
Verify
Prove each layer independently, from the outside in:
# 1. HA is healthy and the standby is in the expected (different) zone.
az postgres flexible-server show \
--resource-group rg-data-eastus2 --name pg-prod-eastus2 \
--query "{ha:highAvailability.state, mode:highAvailability.mode, \
primaryZone:availabilityZone, standbyZone:highAvailability.standbyAvailabilityZone}"
# 2. Private DNS resolves the FQDN to a PRIVATE IP (not a public address).
nslookup pg-prod-eastus2.postgres.database.azure.com
# 3. PgBouncer is actually listening on 6432 and pooling.
psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
dbname=pgbouncer user=app_svc sslmode=require" -c "SHOW POOLS;"
-- 4. Replica lag is bounded (run on the PRIMARY).
SELECT client_addr, state,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- 5. After a major upgrade, confirm version and that stats exist.
SHOW server_version;
SELECT relname, last_analyze
FROM pg_stat_user_tables
ORDER BY last_analyze NULLS FIRST
LIMIT 10; -- NULLs here mean ANALYZE has not run yet
Expected: HA state is Healthy with primary and standby in different zones, the FQDN resolves to a private IP, SHOW POOLS returns rows on 6432, pg_stat_replication shows a connected replica with low replay_lag, and post-upgrade every hot table has a recent last_analyze.
Enterprise scenario
A retail analytics platform team migrated their order-processing database off the retiring Single Server onto Flexible Server with zone-redundant HA, and the cutover went clean. Two weeks later, during a flash sale, the database fell over – not on CPU, which sat at 40%, but with FATAL: sorry, too many clients already. Their checkout service ran on Azure Functions on the Flex Consumption plan, and under load it scaled to hundreds of instances, each opening its own connection pool. Several thousand connections hit a max_connections of 200 and the engine refused new ones. The HA standby was perfectly healthy and completely irrelevant – failing over to an identically-configured node would have hit the same wall.
The fix was not a bigger SKU. They enabled the built-in PgBouncer in transaction mode, repointed the function app at port 6432, and let a default_pool_size of 50 multiplex thousands of short-lived client connections onto a few dozen real backends. The one code change that mattered: their PostgreSQL driver was using server-side prepared statements, which break in transaction pooling, so they disabled prepared statements at the driver. They encoded the pooler as the default path, not an afterthought:
# The fix that held: pool in transaction mode, app connects on 6432.
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name pgbouncer.enabled --value true
az postgres flexible-server parameter set \
--resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
--name pgbouncer.pool_mode --value transaction
The lesson that generalizes: HA protects against a node or zone dying, not against your own application’s connection behavior. A serverless or high-fan-out front end in front of PostgreSQL is a connection-management problem first and a database-sizing problem a distant second. Put a transaction-mode pooler in the path before the launch, not after the incident review.