Cloud SQL is the default managed relational database on GCP, and the gap between a demo instance and a production one is enormous. A demo instance has a public IP, no replicas, a default maintenance window, and a backup nobody has ever restored. A production instance survives a zone outage in under a minute, serves analytics off replicas without touching the primary, is reachable only over private networking from multiple VPCs, takes patches inside a window you control, and has a restore objective you have actually measured. This walkthrough builds that production posture for Cloud SQL (PostgreSQL and MySQL), end to end, with the failure modes that bite teams in practice.
Commands are written against gcloud sql with parallel Terraform google_sql_database_instance where it clarifies intent. Everything here assumes Cloud SQL Enterprise or Enterprise Plus edition; I call out where the edition matters.
1. HA architecture: regional disk and failover mechanics
A standard Cloud SQL instance is a single VM in a single zone backed by a regional or zonal persistent disk. High availability changes the topology: HA provisions a primary in one zone and a standby in another zone of the same region, with data on a regional persistent disk that synchronously replicates every write across both zones. This is replication at the storage layer, not log shipping. The standby has no separate copy of the data it serves; it attaches the same regional disk on failover.
| Property | Single-zone | Regional HA |
|---|---|---|
| Failure domain | One zone | Two zones in a region |
| Data replication | None (single disk) | Synchronous, regional persistent disk |
| Failover target | None | Standby in second zone |
| Typical failover time | N/A | Tens of seconds to ~1-2 min |
| Cost | 1x compute + disk | ~2x compute, regional disk |
The mechanics matter because they shape your expectations. On failover, Cloud SQL detects the primary is unhealthy (the health check fails for roughly 60 seconds), the standby takes over the regional disk, and the instance connection name and private IP stay the same. Clients reconnect to the same endpoint. There is no DNS change and no promotion of a replica. Because replication is synchronous, failover is non-lossy: a committed transaction was already on the regional disk in both zones.
Create an HA instance with availability-type=REGIONAL:
gcloud sql instances create pg-prod \
--project=app-data-prj \
--database-version=POSTGRES_16 \
--edition=ENTERPRISE_PLUS \
--tier=db-perf-optimized-N-8 \
--region=us-central1 \
--availability-type=REGIONAL \
--storage-type=SSD \
--storage-size=100 \
--storage-auto-increase \
--enable-point-in-time-recovery \
--backup-start-time=03:00 \
--retained-backups-count=14 \
--retained-transaction-log-days=7
The same in Terraform, where HA is a single nested field:
resource "google_sql_database_instance" "pg_prod" {
name = "pg-prod"
database_version = "POSTGRES_16"
region = "us-central1"
settings {
edition = "ENTERPRISE_PLUS"
tier = "db-perf-optimized-N-8"
availability_type = "REGIONAL" # this one line is HA
disk_type = "PD_SSD"
disk_size = 100
disk_autoresize = true
backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "03:00"
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 14
}
}
}
deletion_protection = true
}
HA protects against zone failure, not region failure and not your own bad migration. A
DROP TABLEreplicates to the regional disk instantly. HA is an availability control, not a recovery control. You still need backups, PITR, and ideally a cross-region replica. Treat them as separate problems.
2. Read replicas: in-region, cross-region, and cascading
A read replica is a separate instance with its own data copy, kept current by asynchronous replication from a source. Unlike the HA standby, a replica is independently addressable, serves read traffic, and can live in another region. Use replicas to offload read-heavy analytics, to put data near a remote read audience, and to seed disaster recovery in a second region.
Create an in-region replica to absorb read load:
gcloud sql instances create pg-prod-replica-1 \
--project=app-data-prj \
--master-instance-name=pg-prod \
--tier=db-perf-optimized-N-8 \
--region=us-central1 \
--availability-type=ZONAL
Create a cross-region replica in another region for DR and geo-local reads. The --region differs from the primary; everything else is inherited from the source:
gcloud sql instances create pg-prod-dr-east \
--project=app-data-prj \
--master-instance-name=pg-prod \
--tier=db-perf-optimized-N-8 \
--region=us-east4 \
--availability-type=ZONAL
Three topology rules worth internalizing:
- A replica can itself be HA. Set
--availability-type=REGIONALon the replica so your DR region survives a zone outage after you promote it. A zonal DR replica is a single point of failure the day you need it most. - Cascading replicas are supported: a replica can be the source for another replica. Use this to fan out reads in a remote region without adding asynchronous load to the cross-region link from the primary. Replicate primary -> regional replica -> N local replicas.
- Replicas do not inherit backups. Replicas are not backed up by default. Your backup and PITR strategy lives on the primary.
Replication is asynchronous, so a replica is always slightly behind. The lag metric you watch is database/replication/replica_lag (seconds). Read-your-own-writes consistency is not guaranteed against a replica; route writes and read-after-write to the primary, and route eventually-consistent reads to replicas.
3. Connectivity: private IP, PSC, and the Auth Proxy compared
Cloud SQL exposes three connectivity models, and choosing wrong creates either a security gap or a networking dead-end you cannot undo without recreating the instance.
| Model | How clients reach it | Best for | Key constraint |
|---|---|---|---|
| Private IP (PSA) | Internal IP via VPC peering (private services access) | Single owning VPC, simple setup | Consumes a peered range; one VPC peering, transitivity limits |
| Private Service Connect (PSC) | PSC endpoint(s) in consumer VPCs | Multi-VPC, centralized, cross-org reach | Must be enabled at create time; you manage endpoints/DNS |
| Cloud SQL Auth Proxy / Connectors | Local proxy brokers an mTLS tunnel | IAM-authenticated, app-side, dev and CI | A sidecar/binary per client; not a network path by itself |
Private IP via private services access (PSA) is the common default: GCP allocates a peered range, and the instance gets an internal IP reachable from the peered VPC. It is simple but it bakes in a VPC peering relationship and the non-transitivity that comes with peering. If you need the database reachable from many VPCs that do not all peer with one host VPC, PSA fights you.
The Cloud SQL Auth Proxy (and the language-native Connectors) is orthogonal to the network model. It does not give you a network route; it brokers an encrypted, IAM-authorized tunnel over whichever path already exists, and it handles automatic certificate rotation and (optionally) IAM database authentication. Run it as a sidecar next to your app:
# v2 Auth Proxy: connect over the instance's private IP, IAM auth on
./cloud-sql-proxy \
--private-ip \
--auto-iam-authn \
--port 5432 \
app-data-prj:us-central1:pg-prod
Use the proxy for application pods and CI even when you have private IP, because it removes long-lived passwords and static client certs from the equation. Use PSC (next section) when the network reachability itself has to span many VPCs or organizational boundaries.
4. PSC connectivity for centralized, multi-VPC access
Private Service Connect lets a single Cloud SQL instance be reached by many consumer VPCs through per-VPC PSC endpoints, with no shared address space and no VPC peering transitivity. This is the right model for a centralized data platform serving tenant or app VPCs. The catch you must plan for: PSC must be enabled when the instance is created. You cannot convert a PSA instance to PSC in place.
Enable PSC and the allowed consumer projects at create time:
gcloud sql instances create pg-platform \
--project=data-platform-prj \
--database-version=POSTGRES_16 \
--edition=ENTERPRISE_PLUS \
--tier=db-perf-optimized-N-8 \
--region=us-central1 \
--availability-type=REGIONAL \
--enable-private-service-connect \
--allowed-psc-projects=tenant-a-prj,tenant-b-prj \
--no-assign-ip \
--enable-point-in-time-recovery
Cloud SQL publishes a service attachment for the instance. Retrieve it, then each consumer creates a PSC endpoint (a forwarding rule) pointing at that attachment in its own VPC:
# Platform team: get the service attachment URI for the instance
gcloud sql instances describe pg-platform \
--project=data-platform-prj \
--format='value(pscServiceAttachmentLink)'
# Consumer (tenant-a) creates a PSC endpoint to that attachment
gcloud compute addresses create psc-pg-platform-ip \
--project=tenant-a-prj \
--region=us-central1 \
--subnet=tenant-a-subnet \
--addresses=10.20.0.40
gcloud compute forwarding-rules create psc-pg-platform-fr \
--project=tenant-a-prj \
--region=us-central1 \
--network=tenant-a-vpc \
--address=psc-pg-platform-ip \
--target-service-attachment=projects/data-platform-prj/regions/us-central1/serviceAttachments/<attachment-id>
Now 10.20.0.40 in tenant-a’s VPC reaches the instance, and tenant-b reaches the same instance through its own endpoint and its own IP. DNS hygiene is on you: create a private DNS zone (for example pg-platform.psc.internal) mapping a stable name to each consumer’s endpoint IP, so applications never hardcode the PSC IP. Treat the endpoint IP as a per-VPC implementation detail behind a name.
PSC also fixes the HA failover story for multi-VPC access. The service attachment is stable across failover, so consumer endpoints keep working when the primary fails over to its standby zone. You publish once and every consumer inherits the HA behavior.
5. Promoting replicas for DR and managing replication lag
A cross-region replica is your regional disaster recovery. When the primary’s region is impaired, you promote the replica, which severs replication and turns it into a standalone, writable primary. Promotion is one command and is not reversible: the old replication relationship is gone, and re-establishing it means rebuilding a fresh replica from the new primary.
gcloud sql instances promote-replica pg-prod-dr-east \
--project=app-data-prj
Two things to get right before you ever promote:
- Drain or fence the application first. Promotion does not coordinate with your app. If the old primary is actually alive (a network partition, not a true regional loss), promoting creates a split brain where both instances accept writes. Have a runbook that stops writes to the old endpoint before promoting.
- Quantify lag, not just liveness. Because replication is asynchronous, the replica may be seconds behind at the moment of failure, and those unreplicated transactions are lost on promotion. Watch
database/replication/replica_lagcontinuously and alert on it; this number is effectively your cross-region RPO.
Set an alert on replica lag so DR readiness is measured, not assumed:
gcloud monitoring policies create --policy-from-file=- <<'EOF'
displayName: "Cloud SQL cross-region replica lag > 30s"
combiner: OR
conditions:
- displayName: "replica_lag high"
conditionThreshold:
filter: >
resource.type="cloudsql_database"
AND resource.labels.database_id="app-data-prj:pg-prod-dr-east"
AND metric.type="cloudsql.googleapis.com/database/replication/replica_lag"
comparison: COMPARISON_GT
thresholdValue: 30
duration: 120s
aggregations:
- alignmentPeriod: 60s
perSeriesAligner: ALIGN_MEAN
EOF
For planned regional migrations (not a disaster), the same mechanism works in reverse: stand up a cross-region replica, let it catch up to near-zero lag, stop writes, promote, and cut traffic over. That is a controlled, near-lossless region move.
6. Maintenance windows, deny periods, and near-zero-downtime updates
Cloud SQL applies engine patches and infrastructure updates during a maintenance event, which on a standard instance involves a short restart. You control when this happens, and on Enterprise Plus you can largely eliminate the downtime.
Pin the maintenance window to your low-traffic hour and set the update timing preference. --maintenance-release-channel=stable takes updates later than preview, after they have soaked:
gcloud sql instances patch pg-prod \
--project=app-data-prj \
--maintenance-window-day=SUN \
--maintenance-window-hour=4 \
--maintenance-release-channel=stable
Layer deny maintenance periods on top to freeze updates entirely during a business-critical window (peak season, an audit, a launch). A deny period can span up to 90 days:
gcloud sql instances patch pg-prod \
--project=app-data-prj \
--deny-maintenance-period-start-date=2026-11-20 \
--deny-maintenance-period-end-date=2026-12-31 \
--deny-maintenance-period-time=00:00:00
The big lever is edition. Cloud SQL Enterprise Plus performs near-zero-downtime maintenance: the update is applied to a standby and the instance fails over to it, so the connection interruption is on the order of a second rather than the longer restart you get on Enterprise edition. If you have a hard availability SLO, Enterprise Plus plus connection pooling (next section) is how you hit it through routine patching. Always self-service planned maintenance on a staging instance first so you observe the actual blip your drivers experience.
7. Backups, PITR, and validating restore objectives
Automated backups plus point-in-time recovery (PITR) are what actually protect you from logical corruption and human error, the failures HA cannot touch. We enabled both in section 1: --enable-point-in-time-recovery turns on write-ahead/binary log archiving, and --retained-transaction-log-days (PostgreSQL) sets how far back you can replay. PITR lets you restore to any second within the retained log window by cloning to a new instance.
Restore to a point in time by cloning to a new instance (the source is never overwritten, which is the behavior you want during an incident):
gcloud sql instances clone pg-prod pg-prod-restore-20260608 \
--project=app-data-prj \
--point-in-time='2026-06-08T02:45:00Z'
The number that matters is not whether backups are enabled; it is whether you have restored from them recently and measured how long it took. An untested backup is a hypothesis. Run a scheduled restore drill that clones the latest backup into a throwaway instance, asserts a known row, and records the wall-clock duration as your real RTO:
# DR drill: list backups, clone the most recent into a scratch instance, time it
BACKUP_ID=$(gcloud sql backups list --instance=pg-prod \
--project=app-data-prj --sort-by=~windowStartTime \
--limit=1 --format='value(id)')
time gcloud sql backups restore "$BACKUP_ID" \
--restore-instance=pg-prod-drill \
--backup-instance=pg-prod \
--project=app-data-prj
Cross-region DR has two independent layers and you want both: a cross-region read replica for fast failover (low RTO), and cross-region backup copies for the case where corruption silently replicated to the replica before you noticed. The replica gives you speed; the backups give you a clean point before the corruption. Neither substitutes for the other.
8. Connection pooling and avoiding failover storms
The failure mode that turns a 60-second failover into a 10-minute outage is the thundering herd: every application instance loses its connections at once, then every one of them tries to reconnect simultaneously the moment the standby comes up, and the fresh primary drowns in connection establishment before it can serve a single query. Cloud SQL has a hard max_connections ceiling, and PostgreSQL connections are expensive (each is a backend process).
Two defenses, used together:
- Pool connections in front of the database. Enterprise Plus instances offer Managed Connection Pooling built in; otherwise run PgBouncer (PostgreSQL) or ProxySQL (MySQL) in transaction-pooling mode. A pool multiplexes thousands of client sessions onto a small, bounded set of server connections, so a reconnect storm hits the pool, not the database. Bound the server-side pool well under
max_connections. - Reconnect with jitter and a cap. Set a sane connection-acquisition timeout, exponential backoff with randomized jitter, and a maximum in-flight reconnect rate in the client driver, so recovery is staggered instead of synchronized.
A minimal PgBouncer transaction-pool config that caps server-side connections regardless of client count:
[databases]
appdb = host=10.20.0.40 port=5432 dbname=appdb
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 50
reserve_pool_size = 10
server_idle_timeout = 300
Transaction pooling fans 5000 client connections onto 50 server connections per database; raise default_pool_size only after measuring, and keep total server connections comfortably below the instance max_connections. Combine this with the Auth Proxy in front of the pool, and a failover becomes a brief stall, not an outage.
Verify
Confirm each control is actually in effect, not just configured.
# 1. HA is regional and the standby is in a different zone
gcloud sql instances describe pg-prod --project=app-data-prj \
--format='value(settings.availabilityType, gceZone, secondaryGceZone)'
# Expect: REGIONAL <zone-a> <zone-b>
# 2. Replicas exist and one is cross-region
gcloud sql instances describe pg-prod --project=app-data-prj \
--format='value(replicaNames)'
gcloud sql instances describe pg-prod-dr-east --project=app-data-prj \
--format='value(region, settings.availabilityType)'
# 3. PSC is on and the service attachment is published
gcloud sql instances describe pg-platform --project=data-platform-prj \
--format='value(settings.ipConfiguration.pscConfig.pscEnabled, pscServiceAttachmentLink)'
# 4. Public IP is OFF (private/PSC only)
gcloud sql instances describe pg-prod --project=app-data-prj \
--format='value(settings.ipConfiguration.ipv4Enabled)'
# Expect: False
# 5. Backups + PITR retention are set
gcloud sql instances describe pg-prod --project=app-data-prj \
--format='value(settings.backupConfiguration.pointInTimeRecoveryEnabled,
settings.backupConfiguration.transactionLogRetentionDays,
settings.backupConfiguration.backupRetentionSettings.retainedBackups)'
# 6. Maintenance window and deny period are pinned
gcloud sql instances describe pg-prod --project=app-data-prj \
--format='value(settings.maintenanceWindow.day, settings.maintenanceWindow.hour,
settings.denyMaintenancePeriods)'
Then exercise the real behaviors: run a manual failover (gcloud sql instances failover pg-prod) during a maintenance window and time the client interruption with pooling in place; watch database/replication/replica_lag on the cross-region replica under load; and run one full restore drill end to end, recording the wall-clock RTO.
Enterprise scenario
A fintech platform team ran a central PostgreSQL instance on Cloud SQL reached over private services access from a single host VPC. As they onboarded business units, each new unit got its own VPC, and the peering-and-routing topology to reach the shared database became unmanageable: PSA’s one-VPC-peering model and peering non-transitivity meant they were stitching together routes per business unit, and a routine HA failover one evening surfaced a second problem. At failover, roughly 1,200 application pods lost their connections, then reconnected in unison; the fresh primary hit max_connections and rejected new connections for several minutes. A 70-second zone failover had turned into a 9-minute partial outage, and the postmortem flagged both the connectivity sprawl and the reconnect storm.
The constraint: they could not convert the existing PSA instance to PSC in place (PSC is create-time only), and they could not take extended downtime to re-platform. The fix was a controlled migration plus a pooling layer. They created a new Enterprise Plus instance with PSC enabled and the business-unit projects on the allowed list, used a cross-region-style cutover (replicate, catch up to near-zero lag, stop writes, promote, repoint) to move with a sub-minute write freeze, and put Managed Connection Pooling in front so reconnects hit the pool, not the database. Each business unit now reaches the database through its own PSC endpoint behind a stable private DNS name, and Enterprise Plus near-zero-downtime maintenance plus the pool meant the next failover was a sub-second stall in application logs.
# The decisive create-time choice: PSC + allowed consumer projects + HA
gcloud sql instances create pg-platform-v2 \
--project=data-platform-prj \
--database-version=POSTGRES_16 \
--edition=ENTERPRISE_PLUS \
--tier=db-perf-optimized-N-16 \
--region=us-central1 \
--availability-type=REGIONAL \
--enable-private-service-connect \
--allowed-psc-projects=bu-payments-prj,bu-lending-prj,bu-cards-prj \
--no-assign-ip \
--enable-point-in-time-recovery
The lesson the team took forward: connectivity model and edition are decisions you make at create time, and getting them wrong is a migration, not an edit. Default new production instances to PSC, Enterprise Plus, regional HA, and a pool from day one.