Azure Databases

Azure Database for PostgreSQL Flexible Server: Zone-Redundant HA, Read Replicas, PgBouncer, and In-Place Upgrades

Single Server is gone – it retired in March 2025 – and every PostgreSQL workload on Azure that still mattered had to land on Flexible Server. That migration is the right moment to fix the things teams got wrong the first time: HA that was never failed over, a connection storm that took down a perfectly healthy database, a read replica nobody had promoted, and a major-version upgrade everyone was too scared to run. Flexible Server gives you real building blocks for all four – zone-redundant HA, an in-engine PgBouncer, asynchronous read replicas, and in-place major version upgrades. None of them are safe until you have driven them in anger. Here is how to wire each one correctly and prove it.

1. Flexible Server architecture versus retired Single Server

The mental model is different and it matters. Single Server was a gateway-fronted PaaS where you never saw the compute. Flexible Server runs the Postgres engine on a dedicated Linux VM that lives in your topology – you choose the zone, you choose whether it sits in a delegated subnet, and storage is a separate managed disk underneath. That single architectural change is why HA, replicas, and PgBouncer all behave the way they do.

Storage and compute scale independently. Compute is a burstable, general-purpose, or memory-optimized SKU you can resize with a restart. Storage is a Premium SSD (v1) or Premium SSD v2 disk where IOPS and throughput are either a function of disk size or – on SSD v2 – provisioned independently. Provision a baseline so you are not surprised at scale-out:

az postgres flexible-server create \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --location eastus2 \
  --tier MemoryOptimized --sku-name Standard_E4ds_v5 \
  --version 16 \
  --storage-size 256 \
  --high-availability ZoneRedundant \
  --vnet vnet-data-eastus2 --subnet snet-postgres \
  --private-dns-zone pg-prod.private.postgres.database.azure.com \
  --admin-user pgadmin --admin-password "<from-key-vault>"

Burstable (B-series) SKUs do not support high availability and are not eligible for the read replica feature in production-grade configurations. If a workload needs HA or replicas, start on General Purpose (Ds_v5) or Memory Optimized (Es_v5). You cannot retrofit HA onto a Burstable instance without a tier change.

2. Zone-redundant and same-zone HA failover behavior

Flexible Server HA is a hot standby: a second VM running the same engine, kept in sync by synchronous streaming replication, with its own copy of the data on separate storage. This is not a read replica – you cannot read from the standby, and you do not pay for it as a separate server, you pay for the doubled compute and storage. There are two flavors:

Because replication is synchronous, every committed write is on both nodes before the client gets its acknowledgement – so a failover is zero data loss (RPO = 0). Failover is automatic on primary failure and typically completes in tens of seconds; the client sees a dropped connection and the same FQDN now points at the promoted standby.

You can and should rehearse it. A planned (forced) failover is a single command and is the only honest way to measure your application’s reconnect behavior:

# Drill the failover. Measure how long your app takes to recover.
az postgres flexible-server restart \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --failover Forced

Synchronous HA has a cost: commit latency now includes the cross-zone round trip. For chatty, small-transaction OLTP this is real. Measure commit latency before and after enabling zone-redundant HA; if it hurts, same-zone HA trades zone resilience for lower latency, and a cross-region read replica (Section 5) can carry the DR requirement instead.

Enable or change HA on an existing server without recreating it:

az postgres flexible-server update \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --high-availability ZoneRedundant \
  --standby-zone 3

3. VNet integration versus private endpoint connectivity

Flexible Server offers two mutually exclusive networking models, and you choose at creation time – you cannot convert one to the other later. Get this decision right up front.

VNet integration (private access) injects the server into a delegated subnet. The subnet must be delegated to Microsoft.DBforPostgreSQL/flexibleServers and used by nothing else. The server has no public endpoint at all; it is reachable only from peered or connected networks, and name resolution flows through a linked private DNS zone.

# The delegated subnet -- nothing else may live here.
az network vnet subnet create \
  --resource-group rg-net-eastus2 \
  --vnet-name vnet-data-eastus2 \
  --name snet-postgres \
  --address-prefixes 10.40.2.0/27 \
  --delegations Microsoft.DBforPostgreSQL/flexibleServers

Private Link (private endpoint) keeps the server logically separate and projects it into your VNet through a NIC with a private IP. This is the model that composes cleanly with hub-and-spoke, centralized private DNS, and resources that already standardize on private endpoints. Public network access can be disabled entirely so the only path in is the endpoint.

The practical decision rule:

Need Choose
Simplest private model, dedicated subnet acceptable VNet integration
Hub-and-spoke with centralized private DNS / endpoint policy Private endpoint
Connect from many spokes / on-prem over a standard PE pattern Private endpoint
Convert later between the two Not possible – pick deliberately

Either way, your private DNS zone (privatelink.postgres.database.azure.com for PE, or the linked zone for VNet integration) must resolve the server FQDN to the private IP, or clients fall back to the public name and fail when public access is off. This is the single most common “it worked in dev” outage.

4. Built-in PgBouncer and connection management

PostgreSQL forks a backend process per connection. Every idle connection from an over-eager application pool consumes a few MB of server RAM and a slot against max_connections. Serverless front ends and high-fan-out microservices routinely open thousands of short-lived connections and exhaust the server long before they exhaust CPU. The fix is a transaction-mode pooler, and Flexible Server ships PgBouncer built in – no sidecar, no separate VM to patch.

Turn it on with server parameters and connect through port 6432 instead of 5432:

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.enabled --value true

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.default_pool_size --value 50

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.pool_mode --value transaction

The connection string just changes the port:

psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
  dbname=appdb user=app_svc sslmode=require"

Two engineering caveats that cause real bugs in transaction mode, because a connection is handed back to the pool at the end of every transaction:

One more subtlety on a server with HA: PgBouncer runs on the primary. After a failover the promoted standby starts its own PgBouncer, so the parameter values must already be set (they are server-wide). Do not bake PgBouncer config into application start-up assuming it is external – it is part of the server.

5. Read replicas, promotion, and cross-region DR

Read replicas are asynchronous physical replicas. They serve read-only traffic and double as a disaster-recovery target in another region. Asynchronous means non-zero RPO – replication lag – so a replica is a DR option, not an HA substitute. Create one in the same or a different region:

az postgres flexible-server replica create \
  --resource-group rg-data-westus2 \
  --replica-name pg-prod-westus2-ro \
  --source-server "/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.DBforPostgreSQL/flexibleServers/pg-prod-eastus2" \
  --location westus2

Point reporting and read-heavy queries at the replica’s own FQDN. Watch lag with the platform metric read_replica_lag_in_seconds (or query pg_stat_replication on the primary) and alert on it – a replica silently falling hours behind is a DR target that will lose hours of data.

Promotion is the DR action. Flexible Server gives you two promote semantics, and choosing the wrong one is a classic mistake:

# DR: promote the replica to a standalone read-write server.
az postgres flexible-server replica promote \
  --resource-group rg-data-westus2 \
  --name pg-prod-westus2-ro \
  --promote-mode standalone \
  --promote-option forced

A replica is only a credible DR target if you have promoted one in a drill and repointed an app at it. The most expensive surprise is discovering at 3am that promotion is one-way: once you promote forced, that replica is now an independent primary and the original replication relationship is gone. Rehearse on a disposable copy first.

6. Server parameters, extensions, and storage/IOPS tuning

Parameters. Flexible Server exposes Postgres GUCs as server parameters. Some apply dynamically; others are flagged static and require a restart. Treat them as code – set them through CLI/Bicep/Terraform, never click-ops:

# Static -> requires a restart. Set then restart in a maintenance window.
az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name max_connections --value 600

Extensions are not on until you allowlist them. Add the extension to azure.extensions (the server parameter that gates what CREATE EXTENSION is allowed to load), then create it in the database:

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name azure.extensions --value "pg_stat_statements,pgcrypto,uuid-ossp"
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Some extensions – pg_cron, pg_stat_statements, pg_prewarm, auto_explain – also need to be in shared_preload_libraries, which is a static parameter. Set the preload list, restart once, then run CREATE EXTENSION. Skipping the preload step is why pg_cron “isn’t there” after enabling it.

Storage and IOPS. On Premium SSD (v1), IOPS scale with disk size and storage auto-grow bumps the disk when it nears full – enable it so you never take an out-of-space outage. On Premium SSD v2, you provision capacity, IOPS, and throughput independently, which decouples performance from size and avoids over-provisioning a 4 TB disk just to buy IOPS:

az postgres flexible-server update \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --storage-auto-grow Enabled

7. In-place major version upgrades and pre-upgrade validation

Flexible Server supports in-place major version upgrades – e.g. 14 -> 16 – without dump/restore or a new server. It is offline (a downtime window while pg_upgrade runs), one-way, and irreversible, so the order of operations is non-negotiable:

  1. Take a known-good backup or on-demand snapshot first. PITR is your only way back.
  2. Validate compatibility. Run the upgrade in validate-only mode so the platform performs the pre-upgrade checks (pg_upgrade --check) without committing.
  3. Rehearse on a PITR-restored clone of production to time the window and catch extension/deprecation issues.
  4. Execute in a window, then verify and re-ANALYZE.
# 1. On-demand backup before anything.
az postgres flexible-server backup create \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --backup-name pre-upgrade-v16

# 2. Dry run: validate compatibility WITHOUT upgrading.
az postgres flexible-server upgrade \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --version 16 --perform-validation-only

# 3. Execute (offline) once validation passes.
az postgres flexible-server upgrade \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --version 16

pg_upgrade does not carry over optimizer statistics. The database is up but query plans are bad until you run ANALYZE (or vacuumdb --all --analyze-in-stages) post-upgrade. Teams that skip this report a “slow database after the upgrade” – it is missing stats, not the engine. Make ANALYZE the last step of the runbook, before you reopen traffic. Also note: a read replica cannot be major-upgraded independently of its primary – plan replicas around the upgrade.

8. Backup, PITR, geo-redundancy, and restore drills

Backups are automatic and free up to your storage size, with a configurable retention of 7 to 35 days. The choices that matter are redundancy and proving restore:

# Point-in-time restore to a NEW server (clone for drills, or recover).
az postgres flexible-server restore \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2-pitr \
  --source-server pg-prod-eastus2 \
  --restore-time "2026-06-08T02:15:00Z"

A backup you have never restored is a hypothesis. Schedule a quarterly PITR drill: restore to a new server, connect, run a row-count and checksum sanity check, time it, then delete the clone. The number you get is your real RTO – not the one in the SLA deck.

Verify

Prove each layer independently, from the outside in:

# 1. HA is healthy and the standby is in the expected (different) zone.
az postgres flexible-server show \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --query "{ha:highAvailability.state, mode:highAvailability.mode, \
            primaryZone:availabilityZone, standbyZone:highAvailability.standbyAvailabilityZone}"
# 2. Private DNS resolves the FQDN to a PRIVATE IP (not a public address).
nslookup pg-prod-eastus2.postgres.database.azure.com
# 3. PgBouncer is actually listening on 6432 and pooling.
psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
  dbname=pgbouncer user=app_svc sslmode=require" -c "SHOW POOLS;"
-- 4. Replica lag is bounded (run on the PRIMARY).
SELECT client_addr, state,
       write_lag, flush_lag, replay_lag
FROM   pg_stat_replication;
-- 5. After a major upgrade, confirm version and that stats exist.
SHOW server_version;
SELECT relname, last_analyze
FROM   pg_stat_user_tables
ORDER  BY last_analyze NULLS FIRST
LIMIT  10;  -- NULLs here mean ANALYZE has not run yet

Expected: HA state is Healthy with primary and standby in different zones, the FQDN resolves to a private IP, SHOW POOLS returns rows on 6432, pg_stat_replication shows a connected replica with low replay_lag, and post-upgrade every hot table has a recent last_analyze.

Enterprise scenario

A retail analytics platform team migrated their order-processing database off the retiring Single Server onto Flexible Server with zone-redundant HA, and the cutover went clean. Two weeks later, during a flash sale, the database fell over – not on CPU, which sat at 40%, but with FATAL: sorry, too many clients already. Their checkout service ran on Azure Functions on the Flex Consumption plan, and under load it scaled to hundreds of instances, each opening its own connection pool. Several thousand connections hit a max_connections of 200 and the engine refused new ones. The HA standby was perfectly healthy and completely irrelevant – failing over to an identically-configured node would have hit the same wall.

The fix was not a bigger SKU. They enabled the built-in PgBouncer in transaction mode, repointed the function app at port 6432, and let a default_pool_size of 50 multiplex thousands of short-lived client connections onto a few dozen real backends. The one code change that mattered: their PostgreSQL driver was using server-side prepared statements, which break in transaction pooling, so they disabled prepared statements at the driver. They encoded the pooler as the default path, not an afterthought:

# The fix that held: pool in transaction mode, app connects on 6432.
az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.enabled --value true
az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.pool_mode --value transaction

The lesson that generalizes: HA protects against a node or zone dying, not against your own application’s connection behavior. A serverless or high-fan-out front end in front of PostgreSQL is a connection-management problem first and a database-sizing problem a distant second. Put a transaction-mode pooler in the path before the launch, not after the incident review.

Checklist

AzurePostgreSQLDatabaseHigh AvailabilityNetworking

Comments

Keep Reading