Azure Database for PostgreSQL Flexible Server: Zone-Redundant HA, Read Replicas, PgBouncer, and In-Place Upgrades

Single Server is gone – it retired in March 2025 – and every PostgreSQL workload on Azure that still mattered had to land on Azure Database for PostgreSQL Flexible Server, the managed Postgres service that runs the engine on a dedicated Linux VM inside your own network topology. That migration is the right moment to fix the things teams got wrong the first time: HA that was never failed over, a connection storm that took down a perfectly healthy database, a read replica nobody had promoted, and a major-version upgrade everyone was too scared to run. Flexible Server gives you real building blocks for all four – zone-redundant HA, an in-engine PgBouncer, asynchronous read replicas, and in-place major version upgrades. None of them are safe until you have driven them in anger.

This is the operations guide. We treat HA, pooling, replicas, networking, and upgrades not as feature bullets but as production mechanisms each with a failure mode, a confirm command, and a fix. You will learn to choose the right SKU and storage tier so HA and replicas are even possible, wire the connection path through port 6432 so a serverless front end cannot exhaust the engine, place the server in a delegated subnet or behind a private endpoint and make its FQDN resolve to a private IP, and run a 14 -> 16 major upgrade with a rehearsed runbook that ends in ANALYZE rather than a “slow database” incident. Every operation gets both an az CLI snippet and a Bicep snippet, and because this is a reference you will return to mid-incident, the SKUs, parameters, networking models, replica semantics, error codes, and the symptom playbook are all laid out as scannable tables – read the prose once, then keep the tables open at 02:14.

By the end you will stop guessing. When the pager goes off you will know whether you face a Burstable SKU that can never have HA, a connection storm against max_connections, a private DNS zone that resolves to a public IP, a replica silently hours behind, a static parameter that needs a restart, or a major upgrade that dropped the optimizer statistics. Knowing which within ninety seconds is what separates a five-minute incident from a two-hour one.

What problem this solves

Flexible Server hides a lot of machinery – a managed Linux VM, a separate managed disk, synchronous replication to a standby, an embedded connection pooler – so you can run Postgres without operating a cluster. That abstraction is a gift until it breaks, then the failure modes are subtle: the database refuses connections while CPU sits at 40%, the standby is “healthy” and completely irrelevant, the FQDN resolves but to the wrong address, the replica lag metric is climbing and nobody is alerted, the upgrade “worked” but every query is slow.

What breaks without this knowledge: a team enables HA, never drills a failover, and discovers at the worst possible moment that their application takes four minutes to reconnect; a checkout service on Azure Functions scales to hundreds of instances and exhausts max_connections, and the on-call response is to scale up the SKU (which does nothing, because CPU was never the problem); a region-loss DR plan rests on a read replica nobody has ever promoted, and promotion turns out to be one-way; a major version upgrade runs in a panic with no validation pass and no ANALYZE, and the “engine regression” is really missing statistics.

Who hits this: every team running production PostgreSQL on Azure – migrated off Single Server, or greenfield on Flexible Server. It bites hardest on serverless or high-fan-out front ends (the connection-storm problem is near-universal), customer-facing OLTP that needs zone resilience without paying unacceptable commit latency, regulated workloads that must be private-only with correct DNS, reporting-heavy systems that want read replicas, and anyone running an old major version who has been deferring the upgrade out of fear. The fix is almost never “a bigger SKU” – it’s choosing the right tier, putting a transaction pooler in the path, drilling the failover and the promotion, and rehearsing the upgrade on a PITR clone.

To frame the whole field before the deep dive, here is every production concern this article covers, the question it forces, and the one place to look first:

Concern	The question it forces	Where to look first	Most common single mistake
High availability	Zone-redundant or same-zone – and have you drilled it?	`az postgres flexible-server show` -> `highAvailability.state`	Never failing over; assuming HA covers connection storms
Connection management	Are you pooling, or opening thousands of backends?	`SHOW POOLS;` on port 6432	App connects on 5432; prepared statements break transaction pooling
Networking	Private access or private endpoint – and does DNS resolve private?	`nslookup <fqdn>` returns a private IP?	DNS resolves to the public name; model can’t be changed later
Read replicas / DR	Is the replica current, and have you promoted one?	`read_replica_lag_in_seconds`; `pg_stat_replication`	No lag alert; discovering promotion is one-way at 3am
Parameters & extensions	Dynamic or static – did you restart? Is it preloaded?	`az postgres flexible-server parameter show`	Setting a static param and expecting it live; skipping preload
Storage & IOPS	Auto-grow on (v1) or IOPS provisioned (v2)?	`az postgres flexible-server show` -> storage block	Out-of-space outage; over-provisioning a disk to buy IOPS
Major upgrade	Validated, rehearsed, and did you `ANALYZE`?	`--perform-validation-only` output	Skipping `ANALYZE`; not rehearsing on a clone
Backup & restore	Geo enabled at creation, and have you restored?	`az postgres flexible-server backup list`	Geo not enabled (can’t add later); never test-restoring

Learning objectives

By the end of this article you can:

Choose the right tier and SKU (Burstable vs General Purpose vs Memory Optimized) so that HA and read replicas are even possible, and explain why Burstable disqualifies both.
Configure zone-redundant or same-zone HA, explain the synchronous-commit latency trade-off, and drill a forced failover to measure real application reconnect time.
Decide between VNet integration (private access) and Private Link (private endpoint) at creation time – knowing the choice is irreversible – and make the server FQDN resolve to a private IP.
Stand up the built-in PgBouncer in transaction mode on port 6432, size the pool, and avoid the session-state bugs (prepared statements, LISTEN/NOTIFY, temp tables) that pooling breaks.
Create asynchronous read replicas in the same or another region, alert on read_replica_lag_in_seconds, and promote one correctly for DR – understanding that forced promotion is one-way.
Manage server parameters (dynamic vs static), allowlist extensions via azure.extensions and shared_preload_libraries, and tune storage/IOPS on Premium SSD v1 versus v2.
Run a tested in-place major version upgrade with a backup, a validation pass, a PITR-clone rehearsal, and ANALYZE as the final step.
Configure backups, PITR, and geo-redundancy, and run a restore drill that yields your real RTO rather than the one on the SLA slide.

Prerequisites & where this fits

You should already understand core relational concepts (transactions, connections, replication, VACUUM/ANALYZE) and basic PostgreSQL administration. You should be able to run az in Cloud Shell, read JSON output, and connect with psql. Familiarity with Azure networking primitives – VNets, subnets, NSGs, private DNS zones, and private endpoints – is assumed; if those are shaky, read Azure Virtual Network: Subnets, NSGs & Peering and Azure Private Link & Private Endpoints for PaaS first, because the single most common Flexible Server outage is a DNS misconfiguration.

This sits in the Data platform track. It assumes the HA/DR mental model from High Availability vs Disaster Recovery: RTO & RPO and pairs with Azure SQL Database: Hyperscale, Elastic Pools & Ledger as the other managed-relational option. The identity that connects without passwords comes from Entra Managed Identities Deep Dive, the secrets from Azure Key Vault: Secrets, Keys & Certificates, and the lag/restart alerting from Azure Monitor & Application Insights for Observability. For multi-region failover beyond a single replica, see Azure Multi-Region Active-Active & Disaster Recovery.

A quick map of who owns what during an incident, so you call the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Application / driver	Connection pool, prepared statements	App / dev team	Connection storm; “prepared statement does not exist” in transaction pooling
PgBouncer (on the VM)	Transaction pooling on 6432	Platform + DBA	Pool exhaustion; session-state breakage
Postgres engine (primary)	Queries, `max_connections`, WAL	DBA	`too many clients`; bloat; bad plans post-upgrade
HA standby	Synchronous replica, hot standby	Platform (Azure)	Failover behaviour; commit latency from cross-zone sync
Read replica(s)	Async physical replicas, DR target	DBA + platform	Lag; promotion semantics
Storage (managed disk)	Premium SSD v1/v2, IOPS	Platform	Out-of-space; IOPS throttling
Networking / DNS	Delegated subnet or PE, private DNS	Network team	FQDN resolves public; “worked in dev” outage
Backup vault	Automated backups, geo, PITR	DBA	Can’t geo-restore (not enabled); untested RTO

Core concepts

Six mental models make every later decision obvious.

The engine runs on a VM in your topology. Single Server was a gateway-fronted PaaS where you never saw the compute. Flexible Server runs the Postgres engine on a dedicated Linux VM that lives in your topology – you choose the zone, you choose whether it sits in a delegated subnet, and storage is a separate managed disk underneath. That single architectural change is why HA, replicas, and PgBouncer all behave the way they do: the standby is a second VM, PgBouncer runs on the VM, and the network model is a property of your VNet, not a hidden gateway.

Compute and storage scale independently. Compute is a Burstable, General Purpose, or Memory Optimized SKU you resize with a restart. Storage is a Premium SSD (v1) or Premium SSD v2 disk where IOPS and throughput are either a function of disk size (v1) or provisioned independently (v2). This separation is why you can buy IOPS without buying a 4 TB disk on v2, and why Burstable – a shared-core SKU – cannot carry HA or replicas.

HA is a hot standby, not a read replica. Flexible Server HA is a second VM running the same engine, kept in sync by synchronous streaming replication, with its own data on separate storage. You cannot read from the standby. Because replication is synchronous, every committed write is on both nodes before the client is acknowledged – so failover is zero data loss (RPO = 0). The cost is commit latency: it now includes a cross-zone round trip.

Read replicas are asynchronous and serve reads. A read replica is a separate physical replica, fed asynchronously, that serves read-only traffic and doubles as a DR target in another region. Asynchronous means non-zero RPO (replication lag), so a replica is a DR option, not an HA substitute. Promotion makes it an independent read-write server – and forced promotion is one-way.

Connections are expensive; pool them. PostgreSQL forks a backend process per connection, each consuming a few MB and a slot against max_connections. Serverless and high-fan-out front ends open thousands of short-lived connections and exhaust the engine long before CPU. The fix is a transaction-mode pooler; Flexible Server ships PgBouncer built in on port 6432. In transaction mode a connection returns to the pool at the end of every transaction, which is why session-scoped state (prepared statements, LISTEN/NOTIFY, temp tables) is unsafe.

The network model is chosen once, at creation. VNet integration (private access) injects the server into a delegated subnet; Private Link (private endpoint) projects it into your VNet via a NIC. They are mutually exclusive and cannot be converted later. Either way, name resolution must point the FQDN at the private IP via a private DNS zone, or clients fall back to the public name and fail when public access is off.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Term	One-line definition	Where it lives	Why it matters
Flexible Server	Postgres engine on a dedicated Linux VM	Your VNet (or PE)	The whole unit you operate
Tier / SKU	Burstable / General Purpose / Memory Optimized	Compute	Gates HA and replicas
Zone-redundant HA	Sync standby in a different zone	Same region, other AZ	Zone outage survival, RPO 0
Same-zone HA	Sync standby in the same zone	Same AZ, other fault domain	Node failure only; lower latency
Standby	The synchronous hot-standby VM	Paired with primary	Promoted on failover; not readable
Read replica	Async physical replica	Same/other region	Reads + DR; non-zero RPO
Promotion	Make a replica an independent server	Replica	DR action; `forced` is one-way
PgBouncer	Built-in transaction pooler	On the server VM, port 6432	Defeats connection storms
`max_connections`	Backend slot ceiling	Server parameter (static)	Exhaustion -> `too many clients`
Delegated subnet	Subnet delegated to the PG provider	Your VNet	VNet-integration home; nothing else lives here
Private endpoint	NIC projecting the server into a VNet	Your VNet	Hub-spoke-friendly private model
Private DNS zone	Resolves FQDN to private IP	Linked to VNet	Wrong -> “worked in dev” outage
`azure.extensions`	Allowlist for `CREATE EXTENSION`	Server parameter	Extensions off until listed
`shared_preload_libraries`	Libraries loaded at startup	Server parameter (static)	Needed for `pg_cron`, `pg_stat_statements`
PITR	Point-in-time restore to a new server	Backup feature	Clone for drills; recover from mistakes
Geo-redundant backup	Backups replicated to paired region	Enabled at creation only	Geo-restore on region loss

1. Flexible Server architecture versus retired Single Server, and choosing a SKU

The mental model is different and it matters. Single Server was a gateway-fronted PaaS where you never saw the compute. Flexible Server runs the Postgres engine on a dedicated Linux VM that lives in your topology – you choose the zone, you choose whether it sits in a delegated subnet, and storage is a separate managed disk underneath. That single architectural change is why HA, replicas, and PgBouncer all behave the way they do.

Storage and compute scale independently. Provision a baseline so you are not surprised at scale-out:

az postgres flexible-server create \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --location eastus2 \
  --tier MemoryOptimized --sku-name Standard_E4ds_v5 \
  --version 16 \
  --storage-size 256 \
  --high-availability ZoneRedundant \
  --vnet vnet-data-eastus2 --subnet snet-postgres \
  --private-dns-zone pg-prod.private.postgres.database.azure.com \
  --admin-user pgadmin --admin-password "<from-key-vault>"

resource pg 'Microsoft.DBforPostgreSQL/flexibleServers@2024-08-01' = {
  name: 'pg-prod-eastus2'
  location: 'eastus2'
  sku: { name: 'Standard_E4ds_v5', tier: 'MemoryOptimized' }
  properties: {
    version: '16'
    storage: { storageSizeGB: 256, autoGrow: 'Enabled' }
    highAvailability: { mode: 'ZoneRedundant' }
    network: {
      delegatedSubnetResourceId: snet.id
      privateDnsZoneArmResourceId: dnsZone.id
    }
    administratorLogin: 'pgadmin'
    administratorLoginPassword: kvSecret  // never literal
  }
}

The first decision – before HA, before replicas – is the tier. Get it wrong and HA is simply unavailable. Here is the full tier comparison:

Tier	Series	Cores	Use case	HA supported	Read replicas	Notes
Burstable	B1ms–B20ms	Shared, 1–20 vCPU	Dev/test, tiny workloads	No	No	Banks CPU credits; cannot retrofit HA without a tier change
General Purpose	D2ds_v5–D96ds_v5	2–96 vCPU, 4 GB/vCPU	Most OLTP	Yes	Yes	The production default for balanced workloads
Memory Optimized	E2ds_v5–E96ds_v5	2–96 vCPU, 8 GB/vCPU	Memory-hungry, large caches	Yes	Yes	More RAM per core; for big working sets

The SKU resize and storage choices, option by option:

Operation	How	Requires restart	Online?	Gotcha
Scale compute up/down	`az ... update --sku-name`	Yes	Brief downtime	Plan a window; HA shortens it (failover absorbs)
Scale tier (e.g. B -> D)	`az ... update --tier --sku-name`	Yes	Brief downtime	Required before you can enable HA on a Burstable
Grow storage (v1)	`az ... update --storage-size`	No	Yes (online)	One-way – you cannot shrink a disk
Enable storage auto-grow (v1)	`--storage-auto-grow Enabled`	No	Yes	Prevents out-of-space; bumps disk near full
Switch storage type	At creation only	n/a	n/a	v1 <-> v2 is not an in-place change
Provision IOPS (v2)	`--iops` / `--throughput`	No	Yes	Decouples performance from disk size

Burstable (B-series) SKUs do not support high availability and are not eligible for read replicas in production-grade configurations. If a workload needs HA or replicas, start on General Purpose (Ds_v5) or Memory Optimized (Es_v5). You cannot retrofit HA onto a Burstable instance without a tier change – which is itself a sized, restart-bearing operation.

A short capability grid, because the tier choice cascades into everything downstream:

Capability	Burstable	General Purpose	Memory Optimized
Zone-redundant / same-zone HA	No	Yes	Yes
Read replicas	No	Yes	Yes
Built-in PgBouncer	Yes	Yes	Yes
Premium SSD v2	Yes	Yes	Yes
Geo-redundant backup	Yes	Yes	Yes
In-place major upgrade	Yes	Yes	Yes
RAM per vCPU	~2–4 GB	4 GB	8 GB

2. Zone-redundant and same-zone HA failover behavior

Flexible Server HA is a hot standby: a second VM running the same engine, kept in sync by synchronous streaming replication, with its own copy of the data on separate storage. This is not a read replica – you cannot read from the standby, and you do not pay for it as a separate server; you pay for the doubled compute and storage. There are two flavors:

Zone-redundant HA places primary and standby in different availability zones in the same region. It survives a zone outage. This is the default you want for anything customer-facing.
Same-zone HA places the standby in the same zone (different fault domain). It protects against a node failure but not a zone outage. Use it only where the region has no zones, or where cross-zone latency on synchronous commit is unacceptable for the workload.

Because replication is synchronous, every committed write is on both nodes before the client gets its acknowledgement – so a failover is zero data loss (RPO = 0). Failover is automatic on primary failure and typically completes in tens of seconds; the client sees a dropped connection and the same FQDN now points at the promoted standby.

The two HA modes side by side, with everything that differs:

Attribute	Zone-redundant HA	Same-zone HA	No HA (single)
Standby placement	Different availability zone	Same zone, other fault domain	None
Survives a node failure	Yes	Yes	No
Survives a zone outage	Yes	No	No
Replication	Synchronous	Synchronous	n/a
RPO	0	0	up to last backup/WAL
Commit latency impact	Cross-zone round trip	Intra-zone (lower)	None
Cost	~2x compute + storage	~2x compute + storage	1x
Requires zones in region	Yes	No	No
Typical RTO (failover)	Tens of seconds	Tens of seconds	Restart/restore time

You can and should rehearse it. A planned (forced) failover is a single command and is the only honest way to measure your application’s reconnect behavior:

# Drill the failover. Measure how long your app takes to recover.
az postgres flexible-server restart \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --failover Forced

The failover types and when each fires:

Failover type	Trigger	Data loss	Use when
Automatic (unplanned)	Primary node/zone failure detected	0 (synchronous)	The platform does this for you
Forced (planned drill)	`--failover Forced`	0	Rehearsing reconnect; testing client retry
Planned (maintenance)	Patch/maintenance window	0	Azure-initiated; standby takes over first

Synchronous HA has a cost: commit latency now includes the cross-zone round trip. For chatty, small-transaction OLTP this is real. Measure commit latency before and after enabling zone-redundant HA; if it hurts, same-zone HA trades zone resilience for lower latency, and a cross-region read replica (Section 5) can carry the DR requirement instead.

Enable or change HA on an existing server without recreating it:

az postgres flexible-server update \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2 \
  --high-availability ZoneRedundant \
  --standby-zone 3

properties: {
  highAvailability: {
    mode: 'ZoneRedundant'
    standbyAvailabilityZone: '3'
  }
}

What HA does and does not protect against – the distinction that the enterprise scenario below is built on:

Failure	HA protects?	What actually saves you
Standby/primary node crash	Yes	Automatic failover
Single availability-zone outage	Zone-redundant: yes	Zone-redundant HA
Whole-region outage	No	Cross-region read replica + promotion; geo-restore
Connection storm (`too many clients`)	No	PgBouncer in transaction mode
Accidental `DELETE` / bad migration	No	PITR to a new server
Storage full	No	Storage auto-grow (v1) / right-sized IOPS (v2)
Bad query plans after upgrade	No	`ANALYZE` post-upgrade

3. VNet integration versus private endpoint connectivity

Flexible Server offers two mutually exclusive networking models, and you choose at creation time – you cannot convert one to the other later. Get this decision right up front.

VNet integration (private access) injects the server into a delegated subnet. The subnet must be delegated to Microsoft.DBforPostgreSQL/flexibleServers and used by nothing else. The server has no public endpoint at all; it is reachable only from peered or connected networks, and name resolution flows through a linked private DNS zone.

# The delegated subnet -- nothing else may live here.
az network vnet subnet create \
  --resource-group rg-net-eastus2 \
  --vnet-name vnet-data-eastus2 \
  --name snet-postgres \
  --address-prefixes 10.40.2.0/27 \
  --delegations Microsoft.DBforPostgreSQL/flexibleServers

Private Link (private endpoint) keeps the server logically separate and projects it into your VNet through a NIC with a private IP. This is the model that composes cleanly with hub-and-spoke, centralized private DNS, and resources that already standardize on private endpoints. Public network access can be disabled entirely so the only path in is the endpoint.

The three networking models, every property that differs:

Property	Public access	VNet integration (private access)	Private endpoint (Private Link)
Public IP	Yes (firewall-gated)	None	None (PE only)
Where the server lives	Azure-managed	Your delegated subnet	NIC in your subnet
Subnet requirement	n/a	Delegated, dedicated	Any subnet with PE
Private DNS zone	n/a	`<name>.private.postgres...`	`privatelink.postgres.database.azure.com`
Hub-spoke friendly	n/a	Workable	Best fit
Reach from on-prem	Over public + firewall	Via peering/VPN/ER	Via peering/VPN/ER
Change to another model later	No	No	No
Disable public access	`--public-access Disabled`	Already none	Already none

The practical decision rule:

Need	Choose
Simplest private model, dedicated subnet acceptable	VNet integration
Hub-and-spoke with centralized private DNS / endpoint policy	Private endpoint
Connect from many spokes / on-prem over a standard PE pattern	Private endpoint
Public access acceptable behind a strict firewall (dev)	Public access
Convert later between any of the above	Not possible – pick deliberately

The firewall and access controls when public access is in play:

Control	What it does	CLI
Allow an IP range	Opens a CIDR to the public endpoint	`az postgres flexible-server firewall-rule create --start-ip-address --end-ip-address`
Allow Azure services	Lets Azure-internal traffic in	A `0.0.0.0` rule (use sparingly)
Disable public access	Closes the public endpoint entirely	`az postgres flexible-server update --public-access Disabled`
Require SSL	Enforces TLS on connections	`require_secure_transport` parameter (on by default)
Minimum TLS version	Floors the TLS version	`ssl_min_protocol_version` parameter

Either way, your private DNS zone (privatelink.postgres.database.azure.com for PE, or the linked zone for VNet integration) must resolve the server FQDN to the private IP, or clients fall back to the public name and fail when public access is off. This is the single most common “it worked in dev” outage. If your DNS is centralized in a hub, route Flexible Server through it the same way you do every other private endpoint – see Private Endpoints & Private DNS at Scale.

4. Built-in PgBouncer and connection management

PostgreSQL forks a backend process per connection. Every idle connection from an over-eager application pool consumes a few MB of server RAM and a slot against max_connections. Serverless front ends and high-fan-out microservices routinely open thousands of short-lived connections and exhaust the server long before they exhaust CPU. The fix is a transaction-mode pooler, and Flexible Server ships PgBouncer built in – no sidecar, no separate VM to patch.

Turn it on with server parameters and connect through port 6432 instead of 5432:

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.enabled --value true

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.default_pool_size --value 50

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name pgbouncer.pool_mode --value transaction

The connection string just changes the port:

psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
  dbname=appdb user=app_svc sslmode=require"

The PgBouncer parameters you actually tune, end to end:

Parameter	What it controls	Default	When to change	Gotcha
`pgbouncer.enabled`	Turns the pooler on	`false`	Always, for serverless/high-fan-out	App must connect on 6432
`pgbouncer.pool_mode`	`session` / `transaction` / `statement`	`transaction`	Keep transaction for pooling wins	`session` defeats the purpose
`pgbouncer.default_pool_size`	Server-side conns per user/db pair	50	Size to backends you can afford	Too high re-creates the storm
`pgbouncer.min_pool_size`	Warm connections kept open	0	Raise to cut cold-connect latency	Holds backends even when idle
`pgbouncer.max_client_conn`	Client conns PgBouncer accepts	high	Cap to protect memory	Hitting it rejects clients
`pgbouncer.server_idle_timeout`	Close idle server conns after N s	600	Lower to free backends faster	Too low churns connections
`pgbouncer.query_wait_timeout`	Max wait for a pooled conn	120	Lower to fail fast under saturation	Too low errors legit queries
`pgbouncer.stats_users`	Users allowed to query `pgbouncer` admin db	—	Add a monitoring user	Needed for `SHOW POOLS`

The three pool modes, and what each one breaks:

Pool mode	Connection returned to pool…	Pooling benefit	Safe for session state	Use when
`session`	When the client disconnects	Minimal	Yes (all features)	Legacy apps needing full session semantics
`transaction`	At the end of each transaction	High	No (no cross-txn state)	Serverless / high-fan-out (the default)
`statement`	After each statement	Highest	No (no multi-statement txns)	Rare; autocommit-only workloads

Two engineering caveats that cause real bugs in transaction mode, because a connection is handed back to the pool at the end of every transaction:

Session-scoped state does not survive. Prepared statements, SET outside a transaction, advisory session locks, LISTEN/NOTIFY, and temp tables are unsafe. Many drivers default to server-side prepared statements – disable them (e.g. JDBC prepareThreshold=0, or libpq-level handling) or you will get cryptic “prepared statement does not exist” errors.
PgBouncer survives failover with the FQDN. Because it runs on the server VM, the same hostname:6432 works after an HA failover. Good. But the pool is local to the primary – it is not a separate tier you scale independently.

The session features transaction pooling breaks, and the driver-level fix:

Feature	Why it breaks in transaction mode	Driver / app fix
Server-side prepared statements	Statement prepared on a connection you won’t get back	JDBC `prepareThreshold=0`; Npgsql `Max Auto Prepare=0`; PgBouncer `max_prepared_statements>0` (newer)
`SET` / `SET LOCAL` outside a txn	GUC change lost when connection is recycled	Use `SET LOCAL` inside the transaction
`LISTEN` / `NOTIFY`	Listener bound to a connection you lose	Use a dedicated session-mode connection or a different mechanism
Advisory session locks	Lock tied to the session, not the txn	Use transaction-level advisory locks
Temp tables across statements	Temp table dropped when connection recycles	Keep temp-table usage within one transaction
`WITH HOLD` cursors	Cursor needs the session after commit	Avoid; materialize results

One more subtlety on a server with HA: PgBouncer runs on the primary. After a failover the promoted standby starts its own PgBouncer, so the parameter values must already be set (they are server-wide). Do not bake PgBouncer config into application start-up assuming it is external – it is part of the server.

Confirm pooling is live – connect to the special pgbouncer database on 6432 and read the pools:

psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
  dbname=pgbouncer user=app_svc sslmode=require" -c "SHOW POOLS;"

5. Read replicas, promotion, and cross-region DR

Read replicas are asynchronous physical replicas. They serve read-only traffic and double as a disaster-recovery target in another region. Asynchronous means non-zero RPO – replication lag – so a replica is a DR option, not an HA substitute. Create one in the same or a different region:

az postgres flexible-server replica create \
  --resource-group rg-data-westus2 \
  --replica-name pg-prod-westus2-ro \
  --source-server "/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.DBforPostgreSQL/flexibleServers/pg-prod-eastus2" \
  --location westus2

HA standby versus read replica – the distinction people conflate, every property laid out:

Attribute	HA standby	Read replica
Replication	Synchronous	Asynchronous
RPO	0	Non-zero (lag)
Readable	No	Yes (read-only)
Primary purpose	Availability (zone/node)	Read scale-out + DR
Cross-region	No (same region)	Yes
Count	One standby	Multiple replicas
Billed as	Doubled compute/storage	A separate server
Promotion	Automatic on failover	Manual (DR action)
Survives region loss	No	Yes (in another region)

Point reporting and read-heavy queries at the replica’s own FQDN. Watch lag and alert on it – a replica silently falling hours behind is a DR target that will lose hours of data:

Signal	Where	Healthy	Alert when	What it means
`read_replica_lag_in_seconds`	Azure Monitor metric	Near 0–few s	> 60 s sustained	Replica falling behind primary
`pg_stat_replication.replay_lag`	Query on the primary	Low	Growing	WAL applied slower than produced
`write_lag` / `flush_lag`	`pg_stat_replication`	Low	Growing	Network/standby write pressure
`pg_stat_wal_receiver`	Query on the replica	`streaming`	Not streaming	Replication link interrupted
Storage on replica	Metric	Below cap	Near full	Replica can stall replication

Promotion is the DR action. Flexible Server gives you two promote semantics, and choosing the wrong one is a classic mistake:

Promote to independent server (standalone): the replica becomes a full read-write server and detaches from the primary. This is the real DR failover.
Planned switchover: under controlled conditions Azure can sync the last changes before promoting, minimizing data loss for a planned event.

# DR: promote the replica to a standalone read-write server.
az postgres flexible-server replica promote \
  --resource-group rg-data-westus2 \
  --name pg-prod-westus2-ro \
  --promote-mode standalone \
  --promote-option forced

The promotion matrix – mode, option, data-loss, and reversibility:

Promote mode	Promote option	Syncs last changes?	Data loss	Reversible?	Use for
`standalone`	`forced`	No	Up to current lag	No (relationship gone)	Real DR; primary/region is down
`standalone`	`planned`	Yes (waits to catch up)	Minimal/none	No	Planned cutover with primary reachable
`switchover`	`planned`	Yes	Minimal/none	Role swap	Planned region switch (where supported)

A replica is only a credible DR target if you have promoted one in a drill and repointed an app at it. The most expensive surprise is discovering at 3am that promotion is one-way: once you promote forced, that replica is now an independent primary and the original replication relationship is gone. Rehearse on a disposable copy first.

Constraints to plan around – replicas are not free of rules:

Constraint	Implication
Replica inherits the primary’s major version	You cannot upgrade a replica independently (Section 7)
Replica is read-only until promoted	No writes until DR/promotion
HA on a replica	Configure after promotion, not before
Compute can differ from primary	Size the replica for its read load, but watch lag
Promotion is per replica	Promote the right one; others still chase the old primary

6. Server parameters, extensions, and storage/IOPS tuning

Parameters. Flexible Server exposes Postgres GUCs as server parameters. Some apply dynamically; others are flagged static and require a restart. Treat them as code – set them through CLI/Bicep/Terraform, never click-ops:

# Static -> requires a restart. Set then restart in a maintenance window.
az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name max_connections --value 600

The parameters you will actually touch, with type and restart requirement:

Parameter	What it controls	Type	Restart?	Typical setting	Gotcha
`max_connections`	Backend slot ceiling	Static	Yes	Sized to RAM, lower if pooling	Pooling lets you keep this modest
`shared_buffers`	Engine cache (managed)	Static	Yes	Platform-tuned to SKU	Mostly leave to the platform
`work_mem`	Per-sort/hash memory	Dynamic	No	4–64 MB	Multiplies by concurrent ops -> OOM risk
`maintenance_work_mem`	`VACUUM`/index build memory	Dynamic	No	256 MB–1 GB	Speeds maintenance; per operation
`effective_cache_size`	Planner’s cache estimate	Dynamic	No	~70% of RAM	Wrong value -> bad plans
`azure.extensions`	Allowlist for `CREATE EXTENSION`	Dynamic	No	Your needed list	Extension load fails if not listed
`shared_preload_libraries`	Libraries loaded at startup	Static	Yes	`pg_stat_statements,pg_cron`	Required before those extensions work
`log_min_duration_statement`	Log slow queries over N ms	Dynamic	No	1000	Too low floods logs
`idle_in_transaction_session_timeout`	Kill idle-in-txn sessions	Dynamic	No	60000	Frees locks held by stuck txns
`require_secure_transport`	Enforce TLS	Dynamic	No	`on`	Off only for legacy clients

The dynamic-versus-static rule, made operational:

Parameter class	When it applies	What you must do	How to tell
Dynamic	Immediately on set	Nothing extra	`az ... parameter show` -> `isConfigPendingRestart: false`
Static	After a restart	Set, then restart in a window	`isConfigPendingRestart: true` after set

Extensions are not on until you allowlist them. Add the extension to azure.extensions (the server parameter that gates what CREATE EXTENSION is allowed to load), then create it in the database:

az postgres flexible-server parameter set \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --name azure.extensions --value "pg_stat_statements,pgcrypto,uuid-ossp"

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Common extensions and how to enable each – note which need a preload-and-restart:

Extension	Purpose	In `azure.extensions`?	Needs `shared_preload_libraries`?	Restart?
`pg_stat_statements`	Query performance stats	Yes	Yes	Yes
`pg_cron`	In-database cron jobs	Yes	Yes	Yes
`pgaudit`	Audit logging	Yes	Yes	Yes
`auto_explain`	Auto-log slow plans	Yes (config)	Yes	Yes
`pgcrypto`	Cryptographic functions	Yes	No	No
`uuid-ossp`	UUID generation	Yes	No	No
`postgis`	Geospatial types	Yes	No	No
`pgvector`	Vector similarity (AI)	Yes	No	No
`hypopg`	Hypothetical indexes	Yes	No	No

Some extensions – pg_cron, pg_stat_statements, pg_prewarm, auto_explain – also need to be in shared_preload_libraries, which is a static parameter. Set the preload list, restart once, then run CREATE EXTENSION. Skipping the preload step is why pg_cron “isn’t there” after enabling it.

Storage and IOPS. On Premium SSD (v1), IOPS scale with disk size and storage auto-grow bumps the disk when it nears full – enable it so you never take an out-of-space outage. On Premium SSD v2, you provision capacity, IOPS, and throughput independently, which decouples performance from size and avoids over-provisioning a 4 TB disk just to buy IOPS:

az postgres flexible-server update \
  --resource-group rg-data-eastus2 --server-name pg-prod-eastus2 \
  --storage-auto-grow Enabled

Premium SSD v1 versus v2, the properties that drive the decision:

Property	Premium SSD v1	Premium SSD v2
IOPS model	Scales with disk size	Provisioned independently
Throughput model	Tied to size	Provisioned independently
Auto-grow	Supported (enable it)	Manage capacity directly
Buy IOPS without size	No (must grow disk)	Yes
Best for	Predictable, size-correlated I/O	High-I/O on a small footprint
Resize	Grow only, online	Adjust capacity/IOPS/throughput online
Shrink	No	No

7. In-place major version upgrades and pre-upgrade validation

Flexible Server supports in-place major version upgrades – e.g. 14 -> 16 – without dump/restore or a new server. It is offline (a downtime window while pg_upgrade runs), one-way, and irreversible, so the order of operations is non-negotiable:

Take a known-good backup or on-demand snapshot first. PITR is your only way back.
Validate compatibility. Run the upgrade in validate-only mode so the platform performs the pre-upgrade checks (pg_upgrade --check) without committing.
Rehearse on a PITR-restored clone of production to time the window and catch extension/deprecation issues.
Execute in a window, then verify and re-ANALYZE.

# 1. On-demand backup before anything.
az postgres flexible-server backup create \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --backup-name pre-upgrade-v16

# 2. Dry run: validate compatibility WITHOUT upgrading.
az postgres flexible-server upgrade \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --version 16 --perform-validation-only

# 3. Execute (offline) once validation passes.
az postgres flexible-server upgrade \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --version 16

The upgrade runbook as a table – step, purpose, command/path, and the failure it prevents:

#	Step	Purpose	Command / path	Prevents
1	On-demand backup	A guaranteed restore point	`az ... backup create`	No way back from a bad upgrade
2	Validate-only	Catch incompatibilities pre-flight	`... upgrade --perform-validation-only`	Failed upgrade mid-window
3	PITR clone + rehearse	Time the window; catch extension issues	`az ... restore` then upgrade the clone	Surprise duration; extension breakage
4	Check extensions	Deprecated/incompatible ones removed	Review on the clone	Upgrade abort on bad extension
5	Quiesce traffic	Clean, predictable cutover	App maintenance mode	Mid-flight writes
6	Execute upgrade	The actual `pg_upgrade`	`az ... upgrade --version 16`	—
7	`ANALYZE` everything	Rebuild optimizer statistics	`vacuumdb --all --analyze-in-stages`	“Slow database” from missing stats
8	Smoke test + reopen	Confirm before traffic	App health checks	Reopening onto a broken state

Pre-upgrade gotchas, ranked by how often they bite:

Issue	Why it matters	How to confirm	Fix before upgrade
Missing `ANALYZE` after upgrade	`pg_upgrade` drops optimizer stats -> bad plans	`last_analyze` is NULL post-upgrade	Run `ANALYZE` as the final step
Deprecated/incompatible extension	Upgrade may refuse or break	Validate-only output; test on clone	Drop/replace the extension first
Replica present	Replica cannot be upgraded independently	You have a replica attached	Plan replicas around the upgrade
Long-running transactions	Block the offline switch	`pg_stat_activity`	Quiesce/kill before the window
Untested window length	Surprise downtime	Time it on the PITR clone	Rehearse; size the maintenance window
No fresh backup	No rollback path	`az ... backup list`	Take an on-demand backup first

pg_upgrade does not carry over optimizer statistics. The database is up but query plans are bad until you run ANALYZE (or vacuumdb --all --analyze-in-stages) post-upgrade. Teams that skip this report a “slow database after the upgrade” – it is missing stats, not the engine. Make ANALYZE the last step of the runbook, before you reopen traffic. Also note: a read replica cannot be major-upgraded independently of its primary – plan replicas around the upgrade.

8. Backup, PITR, geo-redundancy, and restore drills

Backups are automatic and free up to your storage size, with a configurable retention of 7 to 35 days. The choices that matter are redundancy and proving restore:

Locally/zone-redundant backup keeps copies in-region. Geo-redundant backup must be enabled at creation time and replicates backups to the paired region, enabling geo-restore if the whole region is lost. You cannot turn geo-redundancy on after the fact.
PITR restores to a new server at any second within the retention window. That is the muscle you exercise to clone production for upgrade rehearsals and to recover from a bad migration or accidental DELETE.

# Point-in-time restore to a NEW server (clone for drills, or recover).
az postgres flexible-server restore \
  --resource-group rg-data-eastus2 \
  --name pg-prod-eastus2-pitr \
  --source-server pg-prod-eastus2 \
  --restore-time "2026-06-08T02:15:00Z"

The backup-redundancy options, every property that differs:

Redundancy	Copies kept	Survives zone loss	Survives region loss	Set at creation?	Cost
Locally redundant (LRS)	In one zone	No	No	Yes (changeable within limits)	Lowest
Zone redundant (ZRS)	Across zones in region	Yes	No	Yes	Medium
Geo redundant	Replicated to paired region	Yes	Yes	Yes – only at creation	Highest

The restore types and when each applies:

Restore type	Restores to	Use for	Constraint
PITR (earliest)	New server, oldest point	Recover from long-undetected corruption	Within retention
PITR (custom time)	New server, any second in window	Bad migration / accidental `DELETE`; clone for drills	Within retention
Fast restore (latest)	New server, most recent	Quick clone of current state	Within retention
Geo-restore	New server in paired region	Region loss	Requires geo-redundant backup

A backup you have never restored is a hypothesis. Schedule a quarterly PITR drill: restore to a new server, connect, run a row-count and checksum sanity check, time it, then delete the clone. The number you get is your real RTO – not the one in the SLA deck.

Architecture at a glance

The diagram traces the data path as it actually flows in a production Flexible Server deployment, then maps each failure class onto the exact hop where it bites. Read it left to right. An application – often Azure Functions scaling to hundreds of instances – connects not on 5432 but on port 6432, the built-in PgBouncer that multiplexes thousands of short-lived client connections onto a few dozen real backends. PgBouncer runs on the primary VM, which sits in a delegated subnet (Microsoft.DBforPostgreSQL/flexibleServers, e.g. 10.40.2.0/27) and resolves through a private DNS zone to a private IP – get that DNS wrong and the client falls back to the public name and fails. The primary engine in availability zone 1 commits synchronously to a hot standby in zone 2: every write is on both nodes before the client is acknowledged, giving RPO 0 and a tens-of-seconds automatic failover that keeps the same FQDN.

Off to the right, an asynchronous read replica in a second region (westus2) serves reporting reads and stands ready as the DR target – promotion (standalone --forced) makes it an independent primary, but that is one-way. Underneath, geo-redundant backups (enabled only at creation) feed PITR and geo-restore. The numbered badges call out the five failures that actually page you: a connection storm hitting max_connections on the engine (1), a private DNS zone resolving to the public IP (2), commit latency from the cross-zone synchronous round trip (3), replica lag turning the DR target stale (4), and a major upgrade that dropped optimizer statistics so every plan is bad (5). The legend narrates each as symptom, how to confirm it, and the fix – the whole method in one picture: localise the symptom to a hop, confirm with the named command, apply the fix.

Real-world scenario

Aarav Retail – a fictional but realistic mid-market e-commerce platform – ran its order-processing database on the retiring Single Server and migrated to Flexible Server with zone-redundant HA in Central India. The cutover went clean: a Memory Optimized Standard_E4ds_v5, 256 GB Premium SSD v1, HA across zones 1 and 2, monthly spend about ₹46,000 for the HA pair. The platform team was five engineers. For two weeks everything was fine.

Then a flash sale. At 18:04 the checkout service started throwing FATAL: sorry, too many clients already. The on-call engineer’s reflex was to check CPU – which sat at 40%. The HA standby was perfectly healthy and completely irrelevant; failing over to an identically configured node would have hit the same wall in seconds. The checkout service ran on Azure Functions on the Flex Consumption plan, and under flash-sale load it had scaled to several hundred instances, each opening its own connection pool. Thousands of connections hit a max_connections of 200 and the engine refused new ones. The second reflex – scale the SKU up to E8ds_v5 – bought ninety seconds (more RAM raised the realistic connection ceiling slightly) and then the storm caught up. Revenue was visibly dropping; the incident bridge filled.

The breakthrough came from asking the right first question: is this CPU, or is this connections? It was connections. The fix was not a bigger SKU. They enabled the built-in PgBouncer in transaction mode, repointed the function app at port 6432, and let a default_pool_size of 50 multiplex thousands of short-lived client connections onto a few dozen real backends:

# The fix that held: pool in transaction mode, app connects on 6432.
az postgres flexible-server parameter set \
  --resource-group rg-data-centralindia --server-name pg-aarav-prod \
  --name pgbouncer.enabled --value true
az postgres flexible-server parameter set \
  --resource-group rg-data-centralindia --server-name pg-aarav-prod \
  --name pgbouncer.pool_mode --value transaction
az postgres flexible-server parameter set \
  --resource-group rg-data-centralindia --server-name pg-aarav-prod \
  --name pgbouncer.default_pool_size --value 50

The one code change that mattered: their Npgsql driver was using server-side prepared statements, which break in transaction pooling, so they set Max Auto Prepare=0 at the driver. After repointing to 6432 the too many clients errors stopped immediately – SHOW POOLS; on 6432 showed a few dozen active server connections serving thousands of clients.

The incident review produced two more changes. First, a cross-region read replica in southindia with an alert on read_replica_lag_in_seconds > 60 – because the HA standby protected against a zone outage but not a region one, and the business had a regional-DR requirement nobody had implemented. They drilled a standalone --forced promotion on a throwaway replica to confirm it was one-way and to time the repoint. Second, they scaled the production SKU back down to E4ds_v5 (the storm, not the size, had been the problem) and lowered max_connections to 300 now that PgBouncer fronted the engine, landing at ₹44,000/month – lower than during the incident.

The next flash sale ran at the same peak load with zero connection errors, checkout p95 held at 180 ms, and the standby and replica sat ready and unused, exactly as intended. The lesson on the wall: “HA protects against a node or zone dying, not against your own application’s connection behavior. A serverless front end in front of PostgreSQL is a connection-management problem first and a database-sizing problem a distant second. Put a transaction-mode pooler in the path before the launch, not after the incident review.”

The incident as a timeline, because the order of moves is the lesson:

Time	Symptom	Action taken	Effect	What it should have been
18:04	`too many clients`, climbing	(alert fires)	—	Ask: CPU or connections?
18:07	Errors at 20%	Check CPU (40%)	Rules out compute	Correct diagnostic move
18:11	Errors at 35%	Scale up E4 -> E8	+90 s relief, then recurs	Don’t scale up to mask
18:22	Still failing	Realise it’s `max_connections`	Root cause found	This was the breakthrough
18:30	Mitigated	Enable PgBouncer transaction; app -> 6432	Errors clear	Correct night-of fix
18:34	Stable	`Max Auto Prepare=0` on Npgsql	Prepared-stmt errors gone	The required code change
+1 week	Hardened	Cross-region replica + lag alert; scale back to E4	DR exists; cost down	The actual long-term fix

Advantages and disadvantages

The managed-VM-with-built-in-mechanisms model both enables production-grade Postgres and hides sharp edges. Weigh it honestly:

Advantages (why this model helps you)	Disadvantages (why it bites)
HA, replicas, PgBouncer, PITR, and upgrades are all built in – no sidecars or self-managed clusters	The mechanisms have subtle rules (one-way promotion, irreversible network model) you must know cold
Zone-redundant HA gives RPO 0 with automatic failover and a stable FQDN	Synchronous commit adds cross-zone latency that real OLTP feels
Built-in PgBouncer defeats connection storms without a separate tier to patch	Transaction pooling silently breaks prepared statements, `LISTEN/NOTIFY`, temp tables
Cross-region read replicas give read scale-out and a DR target	Async replicas have non-zero RPO; promotion is irreversible and easy to misuse
In-place major upgrades avoid dump/restore	Upgrade is offline, one-way, and drops optimizer stats -> “slow DB” if you skip `ANALYZE`
Compute and storage scale independently (esp. SSD v2)	Storage grows only (never shrinks); Burstable can’t do HA/replicas at all
Geo-redundant backup + PITR make recovery and cloning routine	Geo-redundancy is creation-time only; a never-tested backup is a hypothesis
Networking lives in your VNet (delegated subnet or PE)	The model is irreversible, and a wrong private DNS zone is the most common outage

The model is right for almost every production PostgreSQL workload on Azure where you want managed HA/DR/pooling without operating a cluster. It bites hardest on serverless/high-fan-out front ends that skip pooling, latency-sensitive OLTP that hasn’t measured the synchronous-commit cost, DR plans resting on an un-drilled replica, and teams that run the major upgrade in a panic. Every disadvantage is manageable – but only if you know it exists before the incident, which is the point of this article.

Hands-on lab

Stand up a small Flexible Server, prove PgBouncer fixes a connection storm, drill an HA failover, and tear it down – all on modest SKUs (delete at the end). Run in Cloud Shell (Bash).

Step 1 – Variables and resource group.

RG=rg-pgflex-lab
LOC=centralindia
PG=pgflex-lab-$RANDOM   # globally-unique
ADMIN=pgadmin
PWD=$(openssl rand -base64 18)   # ephemeral; not for prod
az group create -n $RG -l $LOC -o table

Step 2 – Create a General Purpose server with zone-redundant HA. (HA needs GP/MO – Burstable cannot.)

az postgres flexible-server create -g $RG -n $PG -l $LOC \
  --tier GeneralPurpose --sku-name Standard_D2ds_v5 --version 16 \
  --storage-size 32 --high-availability ZoneRedundant \
  --admin-user $ADMIN --admin-password "$PWD" \
  --public-access 0.0.0.0 -o table   # lab only: open then we'll add our IP

Expected: a server resource; highAvailability.state will reach Healthy after a few minutes.

Step 3 – Confirm HA placement (primary and standby in different zones).

az postgres flexible-server show -g $RG -n $PG \
  --query "{ha:highAvailability.state, mode:highAvailability.mode, \
            primaryZone:availabilityZone, standbyZone:highAvailability.standbyAvailabilityZone}" -o jsonc

Expected: state: Healthy, mode: ZoneRedundant, and two different zone numbers.

Step 4 – Enable PgBouncer in transaction mode.

az postgres flexible-server parameter set -g $RG -s $PG --name pgbouncer.enabled --value true
az postgres flexible-server parameter set -g $RG -s $PG --name pgbouncer.pool_mode --value transaction
az postgres flexible-server parameter set -g $RG -s $PG --name pgbouncer.default_pool_size --value 25

Step 5 – Add your client IP and connect through 6432.

MYIP=$(curl -s ifconfig.me)
az postgres flexible-server firewall-rule create -g $RG -n $PG \
  --rule-name myip --start-ip-address $MYIP --end-ip-address $MYIP
FQDN=$(az postgres flexible-server show -g $RG -n $PG --query fullyQualifiedDomainName -o tsv)

# Pooler is on 6432; the special 'pgbouncer' db exposes pool stats
psql "host=$FQDN port=6432 dbname=pgbouncer user=$ADMIN password=$PWD sslmode=require" -c "SHOW POOLS;"

Expected: SHOW POOLS; returns rows – proof the pooler is live on 6432.

Step 6 – Drill a forced failover and time reconnect.

time az postgres flexible-server restart -g $RG -n $PG --failover Forced
# Then reconnect on 5432 and confirm the server is serving again:
psql "host=$FQDN port=5432 dbname=postgres user=$ADMIN password=$PWD sslmode=require" -c "SELECT now();"

Expected: the failover completes in tens of seconds; the same FQDN now points at the promoted standby and SELECT now() succeeds.

Validation checklist. You created a server where HA is possible (GP, not Burstable), confirmed the standby is in a different zone, turned on transaction pooling and proved it with SHOW POOLS; on 6432, and drilled a forced failover with a measured reconnect. That measured number – not the SLA – is your real failover RTO. The lab steps mapped to what each proves:

Step	What you did	What it proves	Real-world analogue
2	Create GP server with HA	HA requires GP/MO, not Burstable	Tier choice gates everything
3	Check zone placement	Standby is genuinely cross-zone	Verifying zone-redundant HA
4–5	PgBouncer + `SHOW POOLS;`	Pooling is live on 6432	The connection-storm fix
6	Forced failover, timed	Real reconnect time, not assumed	The failover drill you owe production

Cleanup (avoid lingering charges – an HA pair on D2ds is not free).

az group delete -n $RG --yes --no-wait

Cost note. A Standard_D2ds_v5 HA pair with 32 GB runs roughly ₹120–160/hour; an hour of this lab is well under ₹200, and deleting the resource group stops everything. There is no free tier for Flexible Server, but D2ds is among the cheapest SKUs on which HA is supported.

Common mistakes & troubleshooting

This is the playbook – the part you bookmark. First as a scannable table you read at 02:14, then the same entries with full confirm-command detail underneath.

#	Symptom	Root cause	Confirm (exact cmd / portal path)	Fix
1	`FATAL: too many clients already`, CPU low	Connection storm vs `max_connections`; no pooling	`SELECT count(*) FROM pg_stat_activity;` vs `SHOW max_connections;`	Enable PgBouncer transaction mode; app -> 6432
2	“prepared statement does not exist” after pooling	Server-side prepared statements in transaction pooling	App logs; driver config	`prepareThreshold=0` (JDBC) / `Max Auto Prepare=0` (Npgsql)
3	Connects in dev, fails after going private	Private DNS zone resolves FQDN to public IP	`nslookup <fqdn>` returns a public address	Link private DNS zone to the VNet; A record -> private IP
4	Cannot enable HA on the server	Server is on a Burstable SKU	`az ... show --query sku` shows `Burstable`	Scale tier to GP/MO, then enable HA
5	Commit latency jumped after enabling HA	Cross-zone synchronous commit round trip	Compare commit latency pre/post; `pg_stat_statements`	Same-zone HA, or accept and offload reads to a replica
6	DR replica is hours behind	No lag alert; replica under-provisioned or link broken	`read_replica_lag_in_seconds`; `pg_stat_replication.replay_lag`	Alert on lag; size replica; check `pg_stat_wal_receiver`
7	Promoted the wrong way; can’t go back	`forced` promotion is one-way	You ran `--promote-option forced`	Rebuild replication from the new primary; rehearse next time
8	`pg_cron`/`pg_stat_statements` “not there”	Not in `shared_preload_libraries` (static)	`SHOW shared_preload_libraries;`	Add to preload, restart, then `CREATE EXTENSION`
9	`CREATE EXTENSION` fails (not allow-listed)	Extension missing from `azure.extensions`	`az ... parameter show --name azure.extensions`	Add it to the allowlist, then create
10	Static parameter change “did nothing”	Static param needs a restart	`az ... parameter show -> isConfigPendingRestart`	Restart in a window
11	Out-of-space outage on Premium SSD v1	Auto-grow off; disk filled	Storage % metric; `--storage-auto-grow` state	Enable auto-grow (v1) or right-size IOPS (v2)
12	Slow database right after a major upgrade	`pg_upgrade` dropped optimizer statistics	`last_analyze` NULL in `pg_stat_user_tables`	`vacuumdb --all --analyze-in-stages`
13	Cannot geo-restore after region loss	Geo-redundant backup not enabled at creation	Backup redundancy in server properties	Recreate with geo-redundant backup; you can’t add it later
14	App holds locks; others block	Idle-in-transaction sessions	`pg_stat_activity` `state='idle in transaction'`	Set `idle_in_transaction_session_timeout`
15	Connections rejected at the pooler	`pgbouncer.max_client_conn` reached	`SHOW POOLS;` / `SHOW CLIENTS;`	Raise pool sizing carefully; cap client fan-out

The expanded form, with full reasoning for the entries that bite hardest:

1. FATAL: sorry, too many clients already while CPU sits low. Root cause: a connection storm – a serverless or high-fan-out front end opening thousands of short-lived backends – exhausts max_connections long before CPU. Confirm: SELECT count(*) FROM pg_stat_activity; approaches SHOW max_connections;; CPU metric is modest. Fix: enable PgBouncer (pgbouncer.enabled=true, pool_mode=transaction) and repoint the app to 6432. Do not just raise the SKU.

2. “prepared statement does not exist” right after enabling transaction pooling. Root cause: the driver uses server-side prepared statements, which are bound to a connection that transaction pooling hands back after each transaction. Confirm: errors appear only after switching to 6432/transaction mode; driver defaults to prepared statements. Fix: disable them – JDBC prepareThreshold=0, Npgsql Max Auto Prepare=0 – or use a PgBouncer build/setting that supports prepared statements in transaction mode.

3. Works in dev, fails the moment you disable public access / go private. Root cause: the private DNS zone isn’t resolving the FQDN to the private IP, so clients fall back to the public name and fail when public access is off. Confirm: nslookup <fqdn> returns a public address (or the wrong IP) from inside the VNet. Fix: link the private DNS zone (*.private.postgres... for VNet integration, privatelink.postgres.database.azure.com for PE) to the client VNet and ensure the A record points at the private IP. This is the single most common Flexible Server outage.

4. You cannot enable HA at all. Root cause: the server is on a Burstable SKU, which does not support HA or replicas. Confirm: az postgres flexible-server show --query sku shows tier Burstable. Fix: scale the tier to General Purpose or Memory Optimized (a sized, restart-bearing change), then enable HA.

5. Commit latency jumped after turning on zone-redundant HA. Root cause: synchronous replication now includes a cross-zone round trip on every commit. Confirm: compare commit latency (or pg_stat_statements for write-heavy statements) before and after; it tracks cross-zone RTT. Fix: if unacceptable, switch to same-zone HA (lower latency, no zone-outage protection) and carry the DR requirement with a cross-region read replica instead.

6. The DR replica is hours behind. Root cause: no lag alert, an under-provisioned replica, or a broken replication link. Confirm: read_replica_lag_in_seconds is high; pg_stat_replication.replay_lag (on the primary) is growing; pg_stat_wal_receiver (on the replica) isn’t streaming. Fix: alert on lag (> 60 s), size the replica for its read load, and investigate the link. A silent replica is a DR target that loses data.

8. pg_cron or pg_stat_statements “isn’t there” after you enabled it. Root cause: the extension needs to be in shared_preload_libraries, a static parameter, and you skipped the restart. Confirm: SHOW shared_preload_libraries; doesn’t list it. Fix: add it to the preload list, restart once, then run CREATE EXTENSION.

12. The database is slow immediately after a major upgrade. Root cause: pg_upgrade does not carry over optimizer statistics, so plans are bad until you rebuild them. Confirm: SELECT relname, last_analyze FROM pg_stat_user_tables ORDER BY last_analyze NULLS FIRST; shows NULLs. Fix: run vacuumdb --all --analyze-in-stages (or ANALYZE) as the last runbook step before reopening traffic.

13. You cannot geo-restore after a region outage. Root cause: geo-redundant backup must be enabled at creation and wasn’t. Confirm: the server’s backup redundancy isn’t geo-redundant. Fix: there is no retrofit – recreate the server with geo-redundant backup (and migrate). Decide this up front for anything with a region-loss requirement.

Verify

Prove each layer independently, from the outside in:

# 1. HA is healthy and the standby is in the expected (different) zone.
az postgres flexible-server show \
  --resource-group rg-data-eastus2 --name pg-prod-eastus2 \
  --query "{ha:highAvailability.state, mode:highAvailability.mode, \
            primaryZone:availabilityZone, standbyZone:highAvailability.standbyAvailabilityZone}"

# 2. Private DNS resolves the FQDN to a PRIVATE IP (not a public address).
nslookup pg-prod-eastus2.postgres.database.azure.com

# 3. PgBouncer is actually listening on 6432 and pooling.
psql "host=pg-prod-eastus2.postgres.database.azure.com port=6432 \
  dbname=pgbouncer user=app_svc sslmode=require" -c "SHOW POOLS;"

-- 4. Replica lag is bounded (run on the PRIMARY).
SELECT client_addr, state,
       write_lag, flush_lag, replay_lag
FROM   pg_stat_replication;

-- 5. After a major upgrade, confirm version and that stats exist.
SHOW server_version;
SELECT relname, last_analyze
FROM   pg_stat_user_tables
ORDER  BY last_analyze NULLS FIRST
LIMIT  10;  -- NULLs here mean ANALYZE has not run yet

Expected: HA state is Healthy with primary and standby in different zones, the FQDN resolves to a private IP, SHOW POOLS returns rows on 6432, pg_stat_replication shows a connected replica with low replay_lag, and post-upgrade every hot table has a recent last_analyze.

Best practices

Pick the tier before anything else. HA and replicas require General Purpose or Memory Optimized – never Burstable for production. Choosing Burstable first forces a sized tier-change later.
Put a transaction-mode pooler in the path before launch. Enable PgBouncer, set pool_mode=transaction, connect on 6432, and disable server-side prepared statements at the driver. This single discipline prevents the most common production outage.
Choose the network model deliberately at creation. VNet integration vs private endpoint cannot be changed later. Make the private DNS zone resolve the FQDN to the private IP, and disable public access where policy requires.
Default to zone-redundant HA for customer-facing workloads. Use same-zone only where the region lacks zones or synchronous-commit latency is unacceptable – and measure that latency before and after.
Drill the failover. Run a forced failover in a window and measure real application reconnect time. An HA pair you have never failed over is an untested claim.
Create a cross-region read replica and alert on lag. read_replica_lag_in_seconds > 60 s is a starting threshold. A replica nobody monitors is a DR target that silently loses data.
Rehearse promotion on a disposable replica. forced promotion is one-way; know exactly what it does and how long the repoint takes before you need it at 3am.
Treat parameters and extensions as code. Set them via CLI/Bicep/Terraform; remember static parameters and preload-requiring extensions need a restart.
Enable storage auto-grow (v1) or right-size IOPS (v2). An out-of-space outage is entirely preventable; over-provisioning a huge disk just to buy IOPS is wasteful when SSD v2 decouples them.
Validate, rehearse, and ANALYZE every major upgrade. Back up first, run --perform-validation-only, rehearse on a PITR clone, and make ANALYZE the final step before reopening traffic.
Enable geo-redundant backup at creation for anything with a region-loss requirement – it cannot be added later.
Run a quarterly PITR restore drill. Restore, connect, checksum, time it, delete. The measured time is your real RTO.

The alerts worth wiring before the next incident – leading indicators, not the lagging “database down”:

Alert on	Signal	Threshold (starting point)	Why it’s leading
Connection pressure	`active_connections` / `max_connections`	> 80%	Predicts `too many clients` before it fires
Replica lag	`read_replica_lag_in_seconds`	> 60 s sustained	DR target going stale
Storage usage	`storage_percent`	> 85%	Predicts out-of-space outage
CPU saturation	`cpu_percent`	> 80% for 10 min	Real compute pressure (vs connection storm)
Memory	`memory_percent`	> 90%	Predicts OOM / cache thrash
HA health	`is_db_alive` / HA state	Not Healthy	Standby unavailable -> no failover safety
IOPS throttling	`read_iops`/`write_iops` vs limit	Near provisioned cap	I/O bound; resize storage/IOPS
Deadlocks	`deadlocks`	Rising	Application locking problem

Security notes

Connect with managed identity, not passwords where possible. Flexible Server supports Microsoft Entra authentication; have the application authenticate as its managed identity and grant database roles to that identity, so no password sits in config. See Entra Managed Identities Deep Dive.
Keep the admin password in Key Vault. If you must use password auth, store it in Key Vault and inject via reference or pipeline – never literal in Bicep/Terraform. See Azure Key Vault: Secrets, Keys & Certificates.
Make it private. Use VNet integration or a private endpoint, disable public network access, and ensure the private DNS zone resolves to the private IP. A public endpoint behind a firewall is a weaker posture than no public endpoint at all.
Enforce TLS. Keep require_secure_transport=on and floor ssl_min_protocol_version to TLS 1.2+. A troubleshooting session is no excuse to disable TLS “temporarily.”
Least-privilege database roles. Grant the application role only the rights it needs; keep the admin login for administration, not the app. Use separate roles for read replicas’ reporting consumers.
Encrypt at rest with CMK where required. Default encryption uses platform-managed keys; for regulated data use a customer-managed key in Key Vault, with the identity that has wrap/unwrap rights.
Audit with pgaudit. Allowlist and preload pgaudit to log privileged operations; ship logs to Log Analytics for retention and alerting.

The security controls that also prevent incidents – secure and resilient pull the same direction:

Control	Setting / mechanism	Secures against	Also prevents
Entra authentication	Managed identity + DB role	Passwords in config	Password-rotation breakage
Private access / PE	Delegated subnet or private endpoint	Public exposure	“Worked in dev” DNS outages (when wired right)
Disable public access	`--public-access Disabled`	Internet-reachable database	Firewall-rule sprawl
Enforce TLS	`require_secure_transport`, `ssl_min_protocol_version`	Cleartext / downgrade	“Temporary” TLS-off mistakes
CMK encryption	Key Vault key + identity	Unmanaged key custody	—
`pgaudit`	Preload + allowlist	Untracked privileged ops	Blind incident forensics
Least-privilege roles	Per-app DB roles	Over-broad access	One compromised app touching all data

Cost & sizing

The bill drivers and how they interact with the fixes:

Compute SKU dominates, billed per vCPU-hour by tier. HA roughly doubles compute and storage (you pay for the standby). A read replica is a separate server you pay for in full. Storage (and, on SSD v2, provisioned IOPS/throughput) is the second driver; backups are free up to your storage size, with geo-redundant backup adding cross-region storage cost.
HA is insurance, not performance. It doesn’t make the database faster – it doubles spend to survive a zone/node failure. Enable it where availability requires it; for pure read scale-out, a replica is the right spend.
Don’t scale up to mask a connection storm. A bigger SKU buys a marginally higher connection ceiling and a much higher bill. PgBouncer fixes the storm for free (it’s built in) and often lets you scale down.
SSD v2 can cut storage cost by letting you provision IOPS independently instead of over-buying disk size to reach an IOPS target.
Geo-redundant backup and a cross-region replica are the cost of DR. Both are real line items; size them to the recovery requirement rather than reflexively enabling everything.

A rough monthly picture for a small production OLTP database in Central India: a single Standard_D2ds_v5 (~₹22,000) versus the same with zone-redundant HA (~₹44,000 for the pair), plus storage and – if DR is required – a cross-region read replica (another full server). Aarav Retail landed at ₹44,000 after fixing the storm and scaling back down, proving the fix is usually pooling, not a bigger SKU. The cost drivers and what each buys:

Cost driver	What you pay for	Rough INR / month	What it buys	Watch-out
1× D2ds_v5 (GP, no HA)	One General Purpose server	~₹22,000	Baseline OLTP	No zone resilience
Zone-redundant HA	The standby (≈2× compute+storage)	+~₹22,000	Zone-outage survival, RPO 0	Pure cost; not faster
1× E4ds_v5 (Memory Optimized)	More RAM per core	~₹40,000	Large working sets	Right-size; don’t over-buy RAM
Cross-region read replica	A separate full server	Full server cost	Read scale-out + DR	Watch lag; another bill
Premium SSD v1 storage	Disk size (IOPS scale with it)	Per-GB	Capacity + correlated IOPS	Over-buy size to get IOPS
Premium SSD v2 storage	Capacity + IOPS + throughput separately	Per-unit	IOPS without buying disk	Tune all three
Geo-redundant backup	Cross-region backup storage	Small per-GB	Geo-restore on region loss	Creation-time only
PgBouncer	Built in – free	₹0	Defeats connection storms	The cheapest fix on this list

There is no free tier for Flexible Server; the honest floor for production is a General Purpose SKU. The single most cost-effective move on this entire list is enabling the built-in PgBouncer – it costs nothing and frequently lets you run a smaller SKU.

Interview & exam questions

1. A PostgreSQL Flexible Server throws too many clients already while CPU is at 40%. What’s happening and how do you fix it? A connection storm: a serverless or high-fan-out front end opened thousands of backends and exhausted max_connections, which has nothing to do with CPU. Confirm with pg_stat_activity count against max_connections. Fix by enabling the built-in PgBouncer in transaction mode and repointing the app to port 6432, not by scaling the SKU. HA doesn’t help – the standby would hit the same wall.

2. Difference between Flexible Server HA and a read replica? HA is a synchronous hot standby in the same region (RPO 0, not readable, automatic failover) for availability against node/zone failure. A read replica is an asynchronous physical replica (non-zero RPO, readable, manual promotion) for read scale-out and cross-region DR. HA is “stay up”; a replica is “scale reads and survive a region.”

3. Why does enabling zone-redundant HA sometimes increase commit latency, and what are your options? Synchronous replication means every commit waits for the write to reach the standby in another zone – a cross-zone round trip added to commit. For chatty small-transaction OLTP this is measurable. Options: accept it, switch to same-zone HA (lower latency, no zone-outage protection), and/or offload reads to a replica to reduce primary load.

4. You enabled PgBouncer in transaction mode and now get “prepared statement does not exist.” Why? The driver uses server-side prepared statements, which are bound to a specific backend connection. In transaction pooling the connection returns to the pool after each transaction, so the prepared statement is gone on the next one. Fix by disabling prepared statements at the driver (prepareThreshold=0 / Max Auto Prepare=0).

5. An app works in dev but fails to connect once you disable public access. Most likely cause? The private DNS zone isn’t resolving the server FQDN to its private IP, so the client falls back to the public name and fails with public access off. Confirm with nslookup from inside the VNet. Fix by linking the correct private DNS zone to the client VNet with an A record to the private IP – the most common Flexible Server outage.

6. Can you change a Flexible Server from VNet integration to a private endpoint later? No. The networking model (public access, VNet integration, or private endpoint) is chosen at creation and is irreversible. Migrating models means creating a new server and moving the data. Decide deliberately up front based on whether you need a dedicated delegated subnet or a hub-and-spoke private-endpoint pattern.

7. What’s the correct procedure for a major version upgrade, and the single most-skipped step? Take an on-demand backup, run --perform-validation-only, rehearse on a PITR-restored clone, then execute in a window. The most-skipped step is ANALYZE – pg_upgrade drops optimizer statistics, so the database comes up with bad query plans until you run vacuumdb --all --analyze-in-stages. The “slow database after upgrade” is missing stats, not the engine.

8. Your DR plan rests on a cross-region read replica. What two things must be true for it to actually work? First, lag must be bounded and alerted (read_replica_lag_in_seconds) – an unmonitored replica silently falls behind and loses data on promotion. Second, you must have rehearsed promotion (standalone --forced) on a disposable copy, knowing it’s one-way and how long the application repoint takes. An un-drilled replica is a hope, not a plan.

9. You set shared_preload_libraries to include pg_cron but CREATE EXTENSION pg_cron says it isn’t available. Why? shared_preload_libraries is a static parameter – it only takes effect after a restart. You set it but didn’t restart, so the library isn’t loaded yet. Restart the server, then run CREATE EXTENSION. (The extension must also be allow-listed in azure.extensions.)

10. Why can’t you upgrade a read replica’s major version on its own? A read replica is a physical replica that inherits the primary’s major version via streaming replication; the WAL stream is version-specific. You upgrade the primary, which the replica follows, and you plan replicas around the upgrade window – you cannot independently upgrade a replica while it’s attached.

11. How do you make Flexible Server private and passwordless? Use VNet integration or a private endpoint with public access disabled and a private DNS zone resolving to the private IP, and authenticate the application via Microsoft Entra as its managed identity with database roles granted to that identity – so there’s no public endpoint and no password in config.

12. Geo-redundant backup – when must you enable it, and what does it give you? It must be enabled at server creation (it cannot be added later) and replicates backups to the paired region, enabling geo-restore to recover the database if the entire primary region is lost. It’s the backup-side complement to a cross-region replica for region-loss DR.

These map primarily to AZ-305 (Designing Azure Infrastructure Solutions) – design for high availability, business continuity, and data platforms – and DP-300 (Azure Database Administrator Associate) – deploy, configure, and optimize Azure database offerings, HA/DR, and security. The networking angle (delegated subnets, private endpoints, private DNS) touches AZ-700. A compact cert-mapping for revision:

Question theme	Primary cert	Exam objective area
HA vs replica, zone-redundant design	AZ-305 / DP-300	Design HA/BC; configure HA
PgBouncer, connection management	DP-300	Optimize and troubleshoot databases
Private access / PE / DNS	AZ-700 / AZ-305	Network connectivity; private access
Major upgrades, parameters, extensions	DP-300	Configure and maintain database offerings
Backup, PITR, geo-restore	AZ-305 / DP-300	Business continuity; recovery
Entra auth, CMK, least privilege	AZ-500 / DP-300	Secure data platforms

Quick check

Your database refuses new connections with too many clients already but CPU sits at 40%. What is the cause, and is scaling the SKU the right fix?
True or false: a zone-redundant HA standby can serve read-only reporting queries to take load off the primary.
You connect fine in dev, but after disabling public network access the app can’t reach the database. Name the single most likely cause and how you confirm it.
After a 14 -> 16 major upgrade the database is up but every query is slow. What did the upgrade not carry over, and what’s the one command that fixes it?
You want to change a server from VNet integration to a private endpoint. Can you, and if not, what must you do?

Answers

A connection storm exhausting max_connections – a serverless/high-fan-out front end opening thousands of backends, which is unrelated to CPU. Scaling the SKU is not the right fix (it buys only a marginally higher ceiling at much higher cost). The fix is the built-in PgBouncer in transaction mode, with the app connecting on port 6432.
False. The HA standby is a synchronous hot standby that is not readable. To offload reads you need an asynchronous read replica, which is a separate readable server (with non-zero RPO).
The private DNS zone isn’t resolving the server FQDN to its private IP, so the client falls back to the public name and fails with public access off. Confirm with nslookup <fqdn> from inside the VNet – it returns a public/wrong address. Fix by linking the correct private DNS zone with an A record to the private IP.
pg_upgrade does not carry over optimizer statistics, so query plans are bad until rebuilt. Fix by running vacuumdb --all --analyze-in-stages (or ANALYZE) as the last step before reopening traffic.
No – the networking model is fixed at creation and is irreversible. You must create a new server with the desired model (private endpoint) and migrate the data to it.

Glossary

Flexible Server – Azure Database for PostgreSQL deployment running the engine on a dedicated Linux VM inside your network topology; successor to the retired Single Server.
Tier (Burstable / General Purpose / Memory Optimized) – the compute family; Burstable is shared-core and cannot do HA or replicas, GP and MO can.
Zone-redundant HA – a synchronous hot standby placed in a different availability zone; survives a zone outage with RPO 0.
Same-zone HA – a synchronous hot standby in the same zone (different fault domain); survives node failure only, with lower commit latency.
Standby – the synchronous hot-standby VM in an HA configuration; not readable, promoted automatically on failover.
Read replica – an asynchronous physical replica that serves read-only traffic and acts as a cross-region DR target (non-zero RPO).
Promotion – making a replica an independent read-write server; forced promotion is one-way and severs the replication relationship.
PgBouncer – the connection pooler built into Flexible Server, listening on port 6432; transaction mode multiplexes many client connections onto few backends.
max_connections – the static server parameter capping concurrent backend processes; exhaustion yields too many clients already.
Delegated subnet – a subnet delegated to Microsoft.DBforPostgreSQL/flexibleServers, used exclusively by VNet-integrated servers.
Private endpoint – a NIC that projects the server into your VNet with a private IP, composing cleanly with hub-and-spoke.
Private DNS zone – the zone that resolves the server FQDN to its private IP; a misconfiguration here is the most common “worked in dev” outage.
azure.extensions – the server parameter that allow-lists which extensions CREATE EXTENSION may load.
shared_preload_libraries – a static parameter listing libraries loaded at startup; required (with a restart) for pg_cron, pg_stat_statements, pgaudit, and others.
Static vs dynamic parameter – dynamic parameters apply immediately; static ones require a server restart.
Premium SSD v1 / v2 – managed-disk storage; v1 ties IOPS to disk size with auto-grow, v2 provisions capacity, IOPS, and throughput independently.
PITR (point-in-time restore) – restoring to a new server at any second within the retention window; used to clone for drills and recover from mistakes.
Geo-redundant backup – backups replicated to the paired region, enabling geo-restore; can only be enabled at server creation.
pg_upgrade – the in-place major-version upgrade mechanism; offline, one-way, and does not carry over optimizer statistics.

Next steps

You can now run Flexible Server in production – HA drilled, pooling in the path, replicas monitored, upgrades rehearsed. Build outward:

Next: High Availability vs Disaster Recovery: RTO & RPO – the framework behind every HA/replica/backup decision in this article.
Related: Azure SQL Database: Hyperscale, Elastic Pools, Ledger & Always Encrypted – the other managed-relational option, with a different HA model.
Related: Azure Private Link & Private Endpoints for PaaS and Private Endpoints & Private DNS at Scale – get the networking and DNS right so the database is private and resolvable.
Related: Entra Managed Identities Deep Dive – connect the application to Postgres without a password.
Related: Azure Multi-Region Active-Active & Disaster Recovery – when one read replica isn’t enough and you need full regional failover.