AWS Databases

Amazon RDS & Aurora, In Depth: Engines, Multi-AZ, Read Replicas, Backups & Every Option

Almost every serious outage, surprise bill, and “the database is slow” ticket I have chased on AWS eventually lands on a managed-database decision someone made without quite understanding it. A team turns on a read replica expecting high availability and is shocked when a writer failure takes the application down anyway. A finance review uncovers a db.r6g.4xlarge Multi-AZ instance serving a workload a quarter that size. A “harmless” engine upgrade locks the writer for twenty minutes because nobody knew about Blue/Green. A restored snapshot comes up with the wrong parameter group and the application silently misbehaves for a week. Relational Database Service (RDS) and Aurora hide an enormous amount of operational machinery behind a friendly Create database button — and the gap between clicking that button and actually understanding it is exactly where these failures live.

This is the deep dive that closes that gap. Amazon RDS is AWS’s managed relational database service: you choose an engine (MySQL, PostgreSQL, MariaDB, Oracle, or SQL Server), an instance class, and storage, and AWS runs the undifferentiated heavy lifting — provisioning, patching, backups, failover, and monitoring — while you keep full SQL-level control of your data. Amazon Aurora is AWS’s cloud-native reimplementation of MySQL and PostgreSQL that keeps the same wire protocol and tools but replaces the storage and replication engine with a distributed, self-healing, six-way-replicated cluster volume that delivers much higher performance and availability. By the end of this lesson you will know every engine and instance family, exactly how storage and storage autoscaling work, the difference between Multi-AZ for high availability and read replicas for read scaling (the single most-misunderstood pair in the service), how automated backups, point-in-time recovery and snapshots actually behave, what parameter and option groups do, the full security surface, and the Aurora-specific concepts — cluster volume, the writer/reader/custom endpoints, Aurora replicas, Serverless v2, and Global Database. Every concept comes with the real aws CLI to drive it.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You should be comfortable with core networking — a VPC, subnets, and security groups — because every production database lives in private subnets behind a DB subnet group and a security group; if those terms are fuzzy, read the VPC deep dive first. Basic SQL familiarity and a sense of what KMS and IAM do will help, but every term is defined as we go. This lesson sits in the Databases module of the AWS Zero-to-Hero course, immediately after the storage deep dives — RDS and Aurora are, ultimately, a very opinionated way to put storage and compute together for relational data. It is the foundation for the two advanced database lessons it links at the end: zero-downtime upgrades with Blue/Green Deployments and Aurora high availability and Global Database.

Core concepts

Managed vs self-managed (what RDS actually buys you). You could run MySQL on an EC2 instance yourself: you would patch the OS and the engine, script your own backups, build your own replication and failover, and wear the pager when any of it breaks. RDS takes all of that operational surface and makes it AWS’s problem. You still own the data tier — your schema, queries, indexes, users, and the engine-level configuration — but AWS owns the operational tier: the host, the storage, automated backups, patching, monitoring hooks, and the failover machinery. The mental model to carry: RDS is a managed engine on managed infrastructure; you keep SQL-level control and give up host-level control. That trade is almost always worth it, and the few cases where it is not (a feature RDS does not expose, an OS-level extension, a need for superuser) are the cases that push people to EC2-hosted databases or to specialised services.

Instance + storage + network = a DB instance. An RDS DB instance is the unit you create and pay for. It is the bundle of a compute instance class (vCPU and memory), an attached storage volume (type, size, and provisioned performance), and a network placement (a VPC, a DB subnet group spanning at least two Availability Zones, and a security group). On RDS the instance and its storage are coupled — the instance reads and writes its own EBS-style volume. (Aurora breaks this coupling, as we will see — that is the whole point of Aurora.)

Availability Zones, and why two is the magic number. A Region (e.g. ap-south-1, Mumbai) is a geographic area; an Availability Zone (AZ) is one or more physically separate data centres within that region, with independent power, cooling, and networking. AZs are close enough for synchronous replication but far enough that one failing does not take its neighbours down. Almost everything in RDS high availability is built on placing copies of your data in different AZs, which is why a DB subnet group must contain subnets in at least two AZs before you can even create a Multi-AZ deployment.

The two axes everyone confuses: availability vs scale. This is the load-bearing idea of the whole lesson, so internalise it now. High availability answers “if the primary dies, how fast do I get a working writer back, with no data loss?” — and the RDS answer is Multi-AZ. Read scaling answers “I have more read traffic than one instance can serve; how do I add read capacity?” — and the answer is read replicas. They are different features for different problems, they can be combined, and confusing them is the classic interview trap and the classic production mistake. A read replica is not an HA solution (its promotion is manual and lossy); a Multi-AZ standby is not a read-scaling solution (on RDS the standby serves no traffic). We will hammer this distinction in its own section.

RTO and RPO (the vocabulary of recovery). RTO (Recovery Time Objective) is how long you can tolerate being down; RPO (Recovery Point Objective) is how much data you can tolerate losing, measured in time. Every backup and HA choice in this lesson is really a point on the RTO/RPO plane: Multi-AZ gives a low-RTO, zero-RPO failover; PITR gives a higher RTO but lets you rewind to any second; a cross-region snapshot copy gives disaster-recovery coverage at a higher RTO/RPO. Keep these two letters in mind whenever a feature claims to protect you.

RDS engines: the five (well, six) you can run

RDS runs five database engines. Aurora is offered as a sixth choice in the same console flow but is architecturally its own thing (covered in depth later).

Engine What it is Licensing Notable on RDS Typical use
MySQL Open-source RDBMS Open source (free) Versions 8.0+; broad ecosystem General-purpose web/OLTP, LAMP apps
PostgreSQL Open-source RDBMS Open source (free) Rich extensions (PostGIS, pg_stat_statements); strong SQL standard OLTP, geospatial, analytics-adjacent, anything needing advanced SQL
MariaDB MySQL fork Open source (free) MySQL-compatible; some extra storage engines MySQL workloads preferring the MariaDB lineage
Oracle Commercial RDBMS BYOL or License Included SE2 and EE; option groups for OEM, TDE, etc. Lift-and-shift of existing Oracle estates
SQL Server Commercial RDBMS (Microsoft) License Included (or BYOL via some paths) Express/Web/Standard/Enterprise editions .NET shops, lift-and-shift of SQL Server
Aurora (MySQL / PostgreSQL-compatible) AWS cloud-native engine Open source-compatible pricing Distributed cluster volume; covered below New cloud-native workloads wanting more performance/HA

A few decisions the table implies:

Engine versions and the version lifecycle. Each engine offers multiple major and minor versions. AWS marks versions as available, then eventually deprecated, then forces an upgrade at end of standard support (you can pay for Extended Support to stay on an old major version longer). You pick the version at creation; you change it later via a maintenance operation (minor versions can auto-apply if you allow it; major versions are deliberate and are best done with Blue/Green — linked at the end).

Instance classes: sizing the compute

The instance class sets vCPU, memory, network, and EBS bandwidth. RDS classes mirror EC2 families, prefixed db.:

Family Class prefix Optimised for Examples When to use
Standard (general purpose) db.m (m5, m6i, m6g/m7g) Balanced CPU:memory db.m6i.large, db.m7g.xlarge Most general OLTP workloads
Memory-optimised db.r (r5, r6i, r6g/r7g), db.x (x2g) High memory per vCPU db.r6g.2xlarge, db.x2g.large Large working sets, in-memory-heavy, big buffer pools, analytics
Burstable db.t (t3, t4g) Low/spiky baseline with CPU credits db.t4g.micro, db.t3.medium Dev/test, small/intermittent prod, the Free Tier

The key sizing levers and gotchas:

Storage: types, autoscaling, and the I/O model

RDS storage is the volume your DB instance reads and writes. You choose a type, an allocated size, and (for provisioned types) a performance level, and you can let RDS grow it automatically.

Storage type Media Performance model Max size (engine-dependent) Best for Gotcha
General Purpose gp3 SSD Baseline IOPS + throughput, scalable independently of size 64 TiB (most engines) The new default for most workloads Provision extra IOPS/throughput separately above the included baseline
General Purpose gp2 SSD IOPS scale with size (3 IOPS/GiB), burst credits 64 TiB Legacy default; smaller dbs To get more IOPS you had to grow the disk — gp3 fixes this
Provisioned IOPS io1 SSD You provision IOPS explicitly (older) 64 TiB Latency-sensitive, consistent high IOPS Largely superseded by io2
Provisioned IOPS io2 (Block Express) SSD High provisioned IOPS, higher durability, sub-ms 64 TiB Mission-critical, highest and most consistent IOPS Highest cost; for the demanding tail
Magnetic (standard) HDD Legacy, low/variable Small Backwards compatibility only Deprecated — do not pick for new databases

The storage facts that matter in practice:

The big one: Multi-AZ vs read replicas (the difference)

This is the section interviewers wait for and the one production teams get wrong. Multi-AZ is for availability. Read replicas are for read scaling. They are different mechanisms solving different problems. Read this section slowly.

Multi-AZ deployments — high availability, not scaling

A Multi-AZ deployment keeps a standby copy of your database in a different Availability Zone, kept in sync with the primary, ready to take over automatically if the primary fails. There are two flavours:

Multi-AZ instance deployment (one standby). RDS provisions a primary in one AZ and a single standby in another AZ, and replicates synchronously (the primary does not acknowledge a commit until the standby has it too — hence zero data loss / RPO ≈ 0). The standby is passive: it serves no read or write traffic; it exists purely to be promoted. If the primary fails (host failure, AZ impairment, storage failure) or you do certain maintenance (patching, instance resize), RDS performs an automatic failover: it flips the database’s DNS endpoint to point at the standby, which becomes the new primary, typically in 60–120 seconds. Your application keeps using the same endpoint name and reconnects. You pay for two instances but only ever use one for traffic.

Multi-AZ cluster deployment (two readable standbys). A newer option (MySQL and PostgreSQL) that provisions a writer plus two reader instances across three AZs. Replication is semi-synchronous (the writer waits for at least one reader to acknowledge), failover is typically faster (~35 seconds), and crucially the two standbys are readable — so you get HA and a little read offload from the same deployment. It costs more (three instances) and has engine/version constraints, but it narrows the gap between “HA” and “read scaling”.

Property Multi-AZ instance Multi-AZ cluster
Topology 1 primary + 1 standby, 2 AZs 1 writer + 2 readers, 3 AZs
Replication Synchronous Semi-synchronous (≥1 reader)
Standby serves reads? No Yes (both readers)
Typical failover time ~60–120 s ~35 s
Data loss on failover None (RPO ≈ 0) None (RPO ≈ 0)
Cost 2× instance 3× instance
Engines All RDS engines MySQL, PostgreSQL (specific versions)

The unifying point: Multi-AZ failover is automatic and lossless, and it does not add read capacity (the instance variant adds none; the cluster variant adds some via its readable standbys, but that is a side benefit, not its purpose).

Read replicas — read scaling, not (automatic) availability

A read replica is a separate instance that receives an asynchronous copy of the primary’s changes and serves read-only queries. You create one (or several — engines allow up to 5–15) to offload read traffic from the primary: reporting queries, search, read-heavy API endpoints. Each replica has its own endpoint; your application sends writes to the primary endpoint and spreads reads across replica endpoints.

The defining characteristics — and the traps:

The comparison that settles it

Question Multi-AZ Read replica
Primary purpose High availability (survive an AZ/instance failure) Read scaling (offload read traffic)
Replication Synchronous (instance) / semi-sync (cluster) Asynchronous
Failover on primary loss Automatic, ~60–120 s (instance) / ~35 s (cluster) Manual promotion, minutes
Data loss None (RPO ≈ 0) Possible (RPO > 0, lossy promotion)
Serves read traffic? No (instance) / yes (cluster readers) Yes (that is the point)
Cross-Region? No (single-region HA) Yes (DR + local reads)
You’d reach for it when… “I cannot afford downtime if an AZ fails.” “My reads exceed one instance’s capacity.”

And the answer that wins the interview: you can and often should use both together — a Multi-AZ primary for availability plus one or more read replicas for scale. They are orthogonal. (Aurora, as we will see, partly collapses this distinction: an Aurora replica is both a fast failover target and a read endpoint, because they all share one cluster volume.)

Backups: automated backups, PITR, and snapshots

RDS gives you two backup mechanisms that are easy to conflate: automated backups (which power point-in-time recovery) and manual snapshots (point-in-time images you control).

Automated backups + point-in-time recovery (PITR). When enabled (retention period 1–35 days; set retention to 0 to disable, which you should not do in production), RDS takes a daily full backup during your backup window and continuously ships the database’s transaction logs to S3. Together these let you restore to any second within the retention window — that is point-in-time recovery. A PITR does not overwrite your running database; it creates a brand-new instance restored to the chosen timestamp (you then repoint your application). The two parameters you control are the retention period (how far back you can rewind) and the backup window (a daily time slot, ideally during low traffic; it can cause a brief I/O dip, and on single-AZ a momentary pause). Automated backups are deleted when you delete the instance (unless you retain them explicitly) — a fact that surprises people, which is why long-term retention uses snapshots or AWS Backup.

Manual (DB) snapshots. A snapshot is a user-initiated, full, point-in-time image of the database that persists until you explicitly delete it — it does not expire with the backup-retention window and is not deleted when you delete the instance. Snapshots are incremental under the hood (only changed blocks are stored after the first), so they are cheaper than the name implies, but each restore reconstructs a full new instance. You take them before risky changes, for long-term archival, and to clone environments. You can copy a snapshot (including cross-Region and cross-account, optionally re-encrypting with a different KMS key) and share it with other accounts — the standard pattern for DR and for handing a copy of production to a sandbox account.

Mechanism Created by Retention Granularity Survives instance delete? Cross-Region / account
Automated backup (PITR) RDS (daily + tx logs) 1–35 days (your retention) Any second in window No (unless explicitly retained) Via copy of the automated backup / snapshots
Manual snapshot You Until you delete it The instant you took it Yes Yes (copy & share, re-encrypt)
AWS Backup AWS Backup service (policy) Per backup plan (years) Per plan schedule Yes (independent vault) Yes (vaults, vault lock)

Two restore facts to keep straight: restoring (PITR or from a snapshot) always produces a new instance — there is no in-place rewind — and the restored instance comes up with default parameter and option groups and security settings unless you specify yours, which is the source of the classic “restored DB behaves subtly wrong” incident. Always restore with your intended parameter group, option group, security group, and subnet group.

DB parameter groups and option groups

These two “groups” are how you configure the engine, and they trip people up because they look similar and do different things.

DB parameter groups — engine configuration knobs. A parameter group is a named set of engine configuration parameters — the equivalent of my.cnf/postgresql.conf settings (max_connections, work_mem, innodb_buffer_pool_size, timeouts, logging flags, and so on). The default parameter group is read-only (you cannot edit it), so to change anything you create a custom parameter group, edit it, and attach it to the instance. Parameters come in two kinds:

That reboot rule is a frequent gotcha: you change max_connections (static on some engines), nothing happens, and the value only takes effect after you reboot. Cluster engines (Aurora) additionally split these into a cluster parameter group (cluster-wide settings, e.g. logical replication flags) and a DB parameter group (per-instance settings) — get the level right or your change lands in the wrong place.

Option groups — engine add-on features. An option group enables and configures optional features of an engine that are not plain configuration parameters — bundled add-ons. The clearest examples are Oracle (options like Oracle Enterprise Manager, native network encryption, Time Zone, TDE) and SQL Server (SQLServer Audit, Transparent Data Encryption, SSAS/SSRS integration). MySQL historically used an option group for MariaDB Audit Plugin and MEMCACHED. PostgreSQL largely uses extensions (CREATE EXTENSION) rather than option-group options. Like parameter groups, the default option group is fixed; you create a custom one to add options. The interview-friendly distinction: parameter group = tune the engine’s settings; option group = switch on extra engine features/plugins.

Security: encryption, IAM auth, secrets, and the network

RDS security is layered; a production database should have every layer on.

Encryption at rest (KMS). Turn on storage encryption at creation and RDS encrypts the underlying storage, automated backups, snapshots, and read replicas, transparently, using a KMS key (AWS-managed aws/rds or, better, a customer-managed key for control and auditability). The critical limitation: encryption must be enabled at creation — you cannot encrypt an existing unencrypted instance in place. To encrypt an unencrypted database you take a snapshot, copy the snapshot with encryption enabled, and restore from the encrypted copy. Also, an encrypted snapshot can only be shared cross-account if it uses a customer-managed key you then grant the other account.

Encryption in transit (TLS/SSL). RDS provides a CA bundle and certificates so clients can connect over TLS. You can require TLS via a parameter (e.g. rds.force_ssl=1 on Postgres, require_secure_transport on MySQL) so plaintext connections are refused. Always require TLS for production and bundle the RDS root CA into your application’s trust store; rotate the CA before it expires (RDS notifies you).

IAM database authentication. Instead of a database password, applications and users can authenticate with an IAM-generated, short-lived auth token (15-minute lifetime), mapped to a database user. This removes long-lived DB passwords from your app config, centralises access in IAM, and is auditable — at the cost of a token-generation step and a connection-rate ceiling. Available for MySQL, PostgreSQL, and MariaDB.

Secrets Manager integration. RDS can integrate with AWS Secrets Manager to store the master credentials and rotate them automatically on a schedule, with the rotation Lambda managed for you (you can even have RDS manage the master password in Secrets Manager from the create flow). This is the recommended way to handle the master password and any application DB users — no plaintext passwords in code or config. (See the dedicated lesson on Secrets Manager rotation for RDS.)

Network: security groups and DB subnet groups. A database lives in a VPC, in private subnets, fronted by a VPC security group whose inbound rule should allow only the database port (3306/5432/1521/1433) from the application’s security group — reference the app SG, do not open a CIDR, and never expose the database publicly. A DB subnet group is the named set of subnets (spanning ≥2 AZs) in which RDS may place the primary, standby, and replicas — it is what makes Multi-AZ possible and is required for any VPC deployment. Set Public access = No for every real database. For least-privilege, combine this with IAM auth and Secrets Manager so that network reachability, authentication, and authorisation are all locked down independently.

Maintenance and upgrades

Aurora: the cloud-native engine

Everything above describes classic RDS, where an instance owns its volume. Aurora keeps the MySQL/PostgreSQL wire protocol and tooling but replaces the storage and replication layer entirely with a purpose-built, distributed system. This is the part interviewers love and the reason teams pay the premium.

The cluster volume — Aurora’s defining idea

In Aurora, the database instances do not each own a disk. Instead, all instances in a cluster share a single cluster volume: a distributed, auto-expanding storage layer that replicates your data six ways across three Availability Zones (two copies per AZ). The consequences are large:

Aurora replicas and failover

An Aurora cluster has one writer (primary) and up to 15 Aurora replicas, all reading the same cluster volume. Because there is no per-instance copy to fall behind, replica lag is typically single-digit milliseconds, and — crucially — an Aurora replica is simultaneously a read endpoint and a failover target. If the writer fails, Aurora promotes a replica in ~10–30 seconds (you set a failover priority/tier per replica to control which is chosen). This is where Aurora collapses the RDS distinction: the same replicas that scale your reads also provide your high availability. (For the full HA and Global Database treatment, see the linked Aurora high availability and Global Database lesson.)

Aurora endpoints — connect to the right thing

Because a cluster has many instances with changing roles, Aurora gives you managed DNS endpoints rather than asking you to track instances:

Endpoint Points to Use for Notes
Cluster (writer) endpoint The current writer All writes (and reads needing latest data) Automatically follows failover to the new writer — your app’s write connection string never changes
Reader endpoint The set of Aurora replicas (load-balanced) Read-only traffic Round-robins connections across replicas; add replicas and reads spread automatically
Custom endpoint A chosen subset of instances Routing (e.g. reporting replicas vs OLTP readers) You define membership; great for isolating analytics on bigger replicas
Instance endpoint One specific instance Targeted diagnostics/admin Avoid hard-coding in apps — it does not follow failover

The mental model: write to the cluster endpoint, read from the reader endpoint, and never hard-code instance endpoints — the cluster endpoint chases the writer through failovers so your application stays connected without a config change.

Aurora Serverless v2 — autoscaling capacity

Aurora Serverless v2 lets the database scale compute up and down automatically in fine-grained increments, measured in Aurora Capacity Units (ACUs) (each ACU ≈ 2 GiB RAM with proportional CPU). You set a minimum and maximum ACU range, and Aurora adjusts capacity in seconds in response to load — without dropping connections. It mixes with provisioned instances in the same cluster (e.g. a provisioned writer and Serverless v2 readers, or fully serverless), supports Multi-AZ and Global Database, and is ideal for variable, spiky, or unpredictable workloads and dev/test environments where a fixed instance would be over- or under-sized. (v2 replaced the older v1, which paused to zero but scaled coarsely and had more limitations.) The trade-off: at sustained high utilisation a right-sized provisioned instance can be cheaper, so use Serverless v2 where the variability is the point.

Aurora Global Database — cross-Region at scale

Aurora Global Database spans one primary Region (read/write) and up to five secondary Regions (read-only), with replication performed by the storage layer at typically sub-second lag and minimal impact on the primary. It provides low-latency local reads worldwide and disaster recovery: you can promote a secondary Region to be the new primary, with an RTO typically under a minute and an RPO usually measured in seconds (and managed planned failover for zero-data-loss region switches). This is the heavyweight answer to “what if an entire AWS Region is unavailable” — far beyond a single cross-Region read replica. The linked Aurora HA lesson covers the failover mechanics in depth.

RDS vs Aurora — the decision

Dimension Classic RDS Aurora
Storage One attached volume per instance Shared cluster volume, 6× across 3 AZs, auto-grows to 128 TiB
Replicas Async read replicas (lag seconds); separate from HA Up to 15 replicas, ~ms lag, also failover targets
Failover Multi-AZ standby, ~60–120 s Replica promotion, ~10–30 s
Read scaling Read replicas (own endpoints) Reader endpoint load-balances replicas
Cross-Region Cross-Region read replica Global Database (sub-second, ≤5 regions)
Autoscaling compute No (fixed class) Serverless v2 (ACUs)
Cost Lower hourly/I-O Higher hourly + per-I/O, but more capability
Choose when Need Oracle/SQL Server; modest scale; lowest cost on MySQL/Postgres MySQL/Postgres needing high throughput, fast/many replicas, fast failover, or global reach

The RDS & Aurora landscape at a glance

The diagram below ties the pieces together: a DB instance with its engine, class, and storage; the Multi-AZ standby in a second AZ versus read replicas for scale; where automated backups, PITR and snapshots attach; and, on the Aurora side, the shared cluster volume with the writer/reader endpoints and Global Database.

Amazon RDS & Aurora deep dive

Use it as the map for the rest of this lesson — every box (instance class, storage type, Multi-AZ standby, read replica, snapshot, Aurora cluster volume, endpoints) corresponds to a section above explaining its choices, defaults, and trade-offs.

Creating a DB instance: every setting

When you click Create database (or run aws rds create-db-instance), these are the fields, with the what/choices/default/when/trade-off/gotcha treatment:

Setting Choices Default When / trade-off / gotcha
Creation method Standard create / Easy create Standard Easy create hides options behind best-practice defaults; use Standard to control everything.
Engine MySQL / PostgreSQL / MariaDB / Oracle / SQL Server / Aurora Drives every later option; Aurora switches to the cluster model.
Edition / version Engine-specific; major + minor Latest recommended Avoid deprecated versions; pick a current major you can support.
Templates Production / Dev-Test / Free Tier “Production” pre-selects Multi-AZ + provisioned storage; “Free Tier” caps you to db.t3/t4g.micro, 20 GiB, single-AZ.
DB instance identifier Name (unique per Region/account) Cannot be changed without rename + endpoint change; choose deliberately.
Master username / password String / password or Secrets Manager admin/postgres Prefer manage in Secrets Manager so no plaintext password; rotate later.
Instance class db.t/m/r/x families db.m/db.t Graviton (g) for price/perf; t only for small/dev (see classes section).
Storage type gp3 / gp2 / io1 / io2 / magnetic gp3 gp3 default; io2 for demanding IOPS; never magnetic for new dbs.
Allocated storage GiB (engine min/max) 20–100 GiB Can grow, never shrink; cooldown between resizes.
Storage autoscaling On (+ max threshold) / Off On Always on for prod with a sane max; only grows.
Provisioned IOPS / throughput Number (gp3 above baseline, io1/io2) Baseline Provision performance independently of size on gp3/io.
Multi-AZ deployment Single / Multi-AZ instance / Multi-AZ cluster Single (Prod template: Multi-AZ) HA, not scaling; cluster variant adds readable standbys; costs 2×/3×.
VPC / DB subnet group A VPC + subnet group (≥2 AZs) Default VPC Required; must span ≥2 AZs for Multi-AZ; pick private subnets.
Public access Yes / No No Keep No for every real database.
VPC security group(s) New / existing New Allow only the DB port from the app’s SG; never a wide CIDR.
Availability Zone Pick / no preference No preference Usually leave to RDS unless co-locating with an app tier.
Database port Number 3306/5432/1521/1433 Change only if you must; security-group rule must match.
DB parameter group Default / custom Default Attach a custom group to tune the engine (default is read-only).
Option group Default / custom Default For engine add-ons (Oracle/SQL Server features).
Authentication Password / IAM / Kerberos Password Add IAM auth to drop long-lived DB passwords (MySQL/Postgres/MariaDB).
Encryption at rest On (KMS key) / Off On Must be set at creation — cannot encrypt in place later.
Backup retention 0–35 days 7 (1 for some) Never 0 in prod; sets the PITR window.
Backup window Pick / no preference No preference Low-traffic slot; brief I/O dip.
Maintenance window Pick / no preference No preference Low-traffic slot for patching.
Auto minor version upgrade On / Off On Convenient; turn off if you gate upgrades manually.
Deletion protection On / Off Off (Prod template: On) Turn on for prod — blocks accidental deletes.
Monitoring Enhanced Monitoring / Performance Insights Off Turn on Performance Insights (free tier of it) to diagnose load.
Log exports Per-engine logs to CloudWatch Off Export error/slow/audit logs to CloudWatch for retention and alerting.

aws CLI — the core operations

# Variables
REGION=ap-south-1
SUBNET_GROUP=db-subnet-group-lab
SG_ID=sg-0123456789abcdef0

# Create a DB subnet group spanning two AZs (private subnets)
aws rds create-db-subnet-group \
  --db-subnet-group-name $SUBNET_GROUP \
  --db-subnet-group-description "Lab private subnets" \
  --subnet-ids subnet-aaa111 subnet-bbb222 \
  --region $REGION

# Create a Multi-AZ PostgreSQL instance, encrypted, gp3, autoscaling, master password in Secrets Manager
aws rds create-db-instance \
  --db-instance-identifier pg-prod \
  --engine postgres --engine-version 16 \
  --db-instance-class db.m6g.large \
  --storage-type gp3 --allocated-storage 100 --max-allocated-storage 500 \
  --multi-az \
  --master-username appadmin --manage-master-user-password \
  --db-subnet-group-name $SUBNET_GROUP \
  --vpc-security-group-ids $SG_ID \
  --no-publicly-accessible \
  --storage-encrypted \
  --backup-retention-period 7 \
  --deletion-protection \
  --enable-performance-insights \
  --region $REGION

# Add a read replica (read scaling) — its own endpoint, async replication
aws rds create-db-instance-read-replica \
  --db-instance-identifier pg-prod-replica-1 \
  --source-db-instance-identifier pg-prod \
  --db-instance-class db.m6g.large \
  --region $REGION

# Create a custom parameter group and set a dynamic parameter
aws rds create-db-parameter-group \
  --db-parameter-group-name pg16-custom \
  --db-parameter-group-family postgres16 \
  --description "Custom PG16 params"
aws rds modify-db-parameter-group \
  --db-parameter-group-name pg16-custom \
  --parameters "ParameterName=log_min_duration_statement,ParameterValue=500,ApplyMethod=immediate"
aws rds modify-db-instance \
  --db-instance-identifier pg-prod \
  --db-parameter-group-name pg16-custom --apply-immediately

# Take a manual snapshot, then copy it to another Region (DR), re-encrypting
aws rds create-db-snapshot \
  --db-instance-identifier pg-prod --db-snapshot-identifier pg-prod-pre-change
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:ap-south-1:111122223333:snapshot:pg-prod-pre-change \
  --target-db-snapshot-identifier pg-prod-dr-copy \
  --kms-key-id alias/dr-key --source-region ap-south-1 --region ap-southeast-1

# Point-in-time recovery — creates a NEW instance restored to a timestamp
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier pg-prod \
  --target-db-instance-identifier pg-prod-pitr \
  --restore-time 2026-06-14T03:30:00Z \
  --db-subnet-group-name $SUBNET_GROUP

aws CLI — an Aurora cluster

# Create the Aurora PostgreSQL CLUSTER (the cluster owns the storage volume)
aws rds create-db-cluster \
  --db-cluster-identifier aurora-pg \
  --engine aurora-postgresql --engine-version 16 \
  --master-username appadmin --manage-master-user-password \
  --db-subnet-group-name $SUBNET_GROUP \
  --vpc-security-group-ids $SG_ID \
  --storage-encrypted --backup-retention-period 7 \
  --region $REGION

# Add the writer instance, then a reader (both attach to the same cluster volume)
aws rds create-db-instance \
  --db-cluster-identifier aurora-pg --db-instance-identifier aurora-pg-1 \
  --engine aurora-postgresql --db-instance-class db.r6g.large
aws rds create-db-instance \
  --db-cluster-identifier aurora-pg --db-instance-identifier aurora-pg-2 \
  --engine aurora-postgresql --db-instance-class db.r6g.large

# A Serverless v2 reader in the same cluster (set the ACU range on the cluster)
aws rds modify-db-cluster --db-cluster-identifier aurora-pg \
  --serverless-v2-scaling-configuration MinCapacity=0.5,MaxCapacity=16
aws rds create-db-instance \
  --db-cluster-identifier aurora-pg --db-instance-identifier aurora-pg-sv2 \
  --engine aurora-postgresql --db-instance-class db.serverless

# Inspect the cluster endpoints (writer + reader)
aws rds describe-db-clusters --db-cluster-identifier aurora-pg \
  --query "DBClusters[0].{writer:Endpoint, reader:ReaderEndpoint, status:Status}" --output table

After creation: what you can (and can’t) change

Operation Possible after creation? Notes
Resize instance class Yes Reboot (Multi-AZ: via failover, ~brief). Change to/from Graviton supported.
Grow storage Yes Never shrink; cooldown between resizes; online for gp3/io.
Shrink storage No Dump/reload into a smaller instance.
Change storage type (e.g. gp2→gp3) Yes Online conversion; do it for cost/flexibility.
Enable/disable Multi-AZ Yes Add a standby (sync seeding) or remove one — no data loss.
Convert single → Multi-AZ cluster Limited Often needs restore/Blue-Green, not a simple toggle.
Add/remove read replicas Yes Replicas are independent instances; create/delete freely.
Promote a read replica Yes Makes it standalone read/write; lossy for unreplicated changes.
Change backup retention / window Yes Increasing retention is immediate; decreasing drops old recovery points.
Attach a different parameter group Yes Static-parameter changes need a reboot; watch pending-reboot.
Attach a different option group Yes Some options need a reboot; some can’t be removed once data uses them.
Encrypt an unencrypted DB No (in place) Snapshot → copy with encryption → restore from the encrypted copy.
Change KMS key No (in place) Copy snapshot with the new key → restore.
Rename the instance Yes Changes the endpoint DNS name — update your app.
Major version upgrade Yes Disruptive in place — prefer Blue/Green (linked).
Enable IAM auth / Performance Insights Yes Modify operation; IAM auth needs the DB user mapping too.

The hard "no"s to memorise: you cannot shrink storage in place, and you cannot encrypt (or re-key) an existing database in place — both require the snapshot-copy-restore dance.

Hands-on lab

In this lab you create a tiny Free Tier PostgreSQL instance, connect, take a snapshot, observe a parameter change, then clean everything up. Uses the aws CLI (CloudShell or local). Stay within Free Tier limits and delete promptly to keep the cost at essentially zero.

0. Variables and a DB subnet group. (Assumes you have a VPC with two private subnets in different AZs and a security group allowing port 5432 from your client/app.)

REGION=ap-south-1
SG_ID=sg-0123456789abcdef0   # allows 5432 inbound from your app/client
aws rds create-db-subnet-group \
  --db-subnet-group-name lab-subnets \
  --db-subnet-group-description "lab" \
  --subnet-ids subnet-aaa111 subnet-bbb222 --region $REGION

1. Create a Free-Tier PostgreSQL instance (single-AZ, db.t3.micro, 20 GiB, encrypted).

aws rds create-db-instance \
  --db-instance-identifier lab-pg \
  --engine postgres --engine-version 16 \
  --db-instance-class db.t3.micro \
  --storage-type gp3 --allocated-storage 20 --max-allocated-storage 50 \
  --master-username labadmin --manage-master-user-password \
  --db-subnet-group-name lab-subnets \
  --vpc-security-group-ids $SG_ID \
  --no-publicly-accessible --storage-encrypted \
  --backup-retention-period 1 \
  --region $REGION
aws rds wait db-instance-available --db-instance-identifier lab-pg --region $REGION

Expected: after a few minutes the wait returns and the instance is available.

2. Inspect the instance and find its endpoint.

aws rds describe-db-instances --db-instance-identifier lab-pg \
  --query "DBInstances[0].{status:DBInstanceStatus, az:AvailabilityZone, multiAZ:MultiAZ, endpoint:Endpoint.Address, encrypted:StorageEncrypted}" \
  --output table --region $REGION

Expected output (abridged):

-----------------------------------------------------------
|                  DescribeDBInstances                    |
+-----------+-------------+----------+----------+----------+
| status    | az          | multiAZ  | encrypted| endpoint |
+-----------+-------------+----------+----------+----------+
| available | ap-south-1a | False    | True     | lab-pg…  |
+-----------+-------------+----------+----------+----------+

Note multiAZ: False (Free Tier is single-AZ) and encrypted: True.

3. Retrieve the managed master password from Secrets Manager and connect.

SECRET_ARN=$(aws rds describe-db-instances --db-instance-identifier lab-pg \
  --query "DBInstances[0].MasterUserSecret.SecretArn" --output text --region $REGION)
aws secretsmanager get-secret-value --secret-id "$SECRET_ARN" \
  --query SecretString --output text --region $REGION
# Then (from a host in the VPC with psql installed):
# psql "host=<endpoint> port=5432 dbname=postgres user=labadmin sslmode=require"

Expected: a JSON blob containing the password. The sslmode=require proves TLS works.

4. Take a manual snapshot (persists independently of the instance).

aws rds create-db-snapshot --db-instance-identifier lab-pg \
  --db-snapshot-identifier lab-pg-snap --region $REGION
aws rds wait db-snapshot-available --db-snapshot-identifier lab-pg-snap --region $REGION
aws rds describe-db-snapshots --db-snapshot-identifier lab-pg-snap \
  --query "DBSnapshots[0].{id:DBSnapshotIdentifier, type:SnapshotType, status:Status}" \
  --output table --region $REGION

Expected: type: manual, status: available.

5. Change a dynamic parameter via a custom parameter group (no reboot).

aws rds create-db-parameter-group --db-parameter-group-name lab-pg16 \
  --db-parameter-group-family postgres16 --description "lab" --region $REGION
aws rds modify-db-parameter-group --db-parameter-group-name lab-pg16 \
  --parameters "ParameterName=log_min_duration_statement,ParameterValue=200,ApplyMethod=immediate" \
  --region $REGION
aws rds modify-db-instance --db-instance-identifier lab-pg \
  --db-parameter-group-name lab-pg16 --apply-immediately --region $REGION

Expected: the modify is accepted; because log_min_duration_statement is dynamic, it applies without a reboot (a static parameter would have left the instance in pending-reboot).

6. Cleanup — delete the instance (skip the final snapshot) and the artefacts.

aws rds delete-db-instance --db-instance-identifier lab-pg \
  --skip-final-snapshot --delete-automated-backups --region $REGION
aws rds wait db-instance-deleted --db-instance-identifier lab-pg --region $REGION
aws rds delete-db-snapshot --db-snapshot-identifier lab-pg-snap --region $REGION
aws rds delete-db-parameter-group --db-parameter-group-name lab-pg16 --region $REGION
aws rds delete-db-subnet-group --db-subnet-group-name lab-subnets --region $REGION

Validation: aws rds describe-db-instances --db-instance-identifier lab-pg eventually returns DBInstanceNotFound. (If deletion is blocked, you left deletion protection on — modify-db-instance --no-deletion-protection first.)

Cost note (INR-aware): a db.t3.micro with 20 GiB gp3 is covered by the RDS Free Tier for the first 12 months (750 instance-hours/month, 20 GiB storage, 20 GiB backups). Outside Free Tier it is only a few hundred rupees per month if left running. The things that quietly cost money: the manual snapshot (cheap — only changed blocks — but it persists until you delete it, which is why step 6 deletes it explicitly), a forgotten read replica (a full extra instance’s worth of cost), and Multi-AZ (doubles instance cost). Deleting the instance with --skip-final-snapshot --delete-automated-backups and then removing the manual snapshot leaves nothing billing.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Added a read replica but a writer failure still caused an outage Confused read replica (scaling) with Multi-AZ (HA) — replicas don’t auto-fail-over Enable Multi-AZ for HA; keep replicas for read scale; they are orthogonal.
Reads return stale data right after a write Reading from an async read replica (replica lag) Read-after-write from the primary/cluster endpoint; only send lag-tolerant reads to replicas.
Restored database “behaves wrong” / wrong settings Restore came up with default parameter/option/security groups Always restore specifying your parameter group, option group, SG, and subnet group.
Changed a parameter, nothing happened It’s a static parameter (pending-reboot) Reboot the instance to apply static parameters; check the parameter’s “Apply type”.
“Can’t encrypt my existing database” Encryption must be set at creation Snapshot → copy with --storage-encrypted/KMS key → restore from the encrypted copy.
db.t instance throttling under steady load Burstable class exhausted CPU credits Move to db.m/db.r; reserve t for dev/spiky workloads.
Storage filled up and the DB went into storage-full Storage autoscaling disabled or max threshold hit Enable storage autoscaling with a sane max; grow storage (cannot shrink later).
Can’t connect to the database from the app Public access on/wrong SG; DB in a subnet the app can’t reach Keep public access off; allow the DB port from the app’s SG; check the DB subnet group and routing.
Snapshot won’t share to another account Snapshot encrypted with the default aws/rds key Re-encrypt the snapshot copy with a customer-managed KMS key, then share the key + snapshot.

Best practices

Security notes

Interview & exam questions

  1. What is the difference between Multi-AZ and a read replica? Multi-AZ is for high availability: a synchronous standby in another AZ that fails over automatically (~60–120 s, no data loss) and on classic RDS serves no traffic. A read replica is for read scaling: an asynchronous, read-only copy with its own endpoint that does not fail over automatically (promotion is manual and lossy). They solve different problems and are often used together.

  2. Does a Multi-AZ standby serve read traffic? For a Multi-AZ instance deployment, no — the standby is passive. For a Multi-AZ cluster deployment, yes — its two standbys are readable. That distinction is a favourite exam trap.

  3. How does point-in-time recovery work, and does it overwrite my database? RDS keeps daily backups plus continuous transaction logs (within the retention period, 1–35 days), letting you restore to any second. It does not overwrite the running instance — it creates a new instance at the chosen timestamp; you then repoint your app.

  4. Can you encrypt an existing unencrypted RDS instance? Not in place. You take a snapshot, copy it with encryption enabled (choosing a KMS key), and restore from the encrypted copy. Encryption (and the KMS key) is fixed at creation.

  5. What’s the difference between a DB parameter group and an option group? A parameter group tunes engine configuration (max_connections, work_mem, etc.); static parameters need a reboot, dynamic ones don’t. An option group enables engine add-on features (Oracle TDE/OEM, SQL Server Audit/TDE, MariaDB audit plugin). One configures, the other switches features on.

  6. What is the Aurora cluster volume and why does it matter? A distributed storage layer shared by all instances in the cluster, replicated six ways across three AZs, auto-growing to 128 TiB and self-healing. It decouples storage from compute, so adding a replica copies no data (fast, cheap replicas), failover is fast (~10–30 s), and durability/availability are very high (survives an AZ + one copy).

  7. What are the Aurora endpoints and when do you use each? The cluster (writer) endpoint for writes (it follows failover to the new writer); the reader endpoint to load-balance reads across replicas; custom endpoints to target a chosen subset (e.g. reporting); instance endpoints only for diagnostics (they don’t follow failover).

  8. When would you choose Aurora over classic RDS? For MySQL/PostgreSQL workloads needing high throughput, many/fast replicas, sub-30-second failover, autoscaling compute (Serverless v2), or global reach (Global Database). Stay on classic RDS for Oracle/SQL Server, for the lowest cost on modest MySQL/Postgres, or where Aurora’s per-I/O pricing is unfavourable.

  9. What does Aurora Serverless v2 scale, and how is it measured? It scales compute (CPU/memory) automatically and in fine increments, measured in Aurora Capacity Units (ACUs) between a min and max you set, without dropping connections — ideal for variable/spiky workloads.

  10. How is Aurora Global Database different from a cross-Region read replica? Global Database replicates at the storage layer with typically sub-second lag across up to five secondary Regions, with fast (sub-minute) managed region failover for DR — far more capable and lower-impact than a single async cross-Region read replica.

  11. Your reporting query load is crushing the primary’s CPU but availability is fine — what do you do? Add one or more read replicas (or, on Aurora, point reporting at the reader endpoint or a custom endpoint), and route read-only reporting traffic there. Do not reach for Multi-AZ — that doesn’t add read capacity (on the instance variant).

  12. What happens to automated backups when you delete an RDS instance? They are deleted with the instance (unless you explicitly retain them), whereas manual snapshots persist until you delete them. For long-term retention, use manual snapshots or AWS Backup.

Quick check

  1. Which RDS feature provides automatic failover with no data loss, and does the standby serve traffic on the instance variant?
  2. True or false: a read replica automatically becomes the new writer if the primary fails.
  3. You need to restore your database to 03:30 this morning. Which feature, and does it overwrite the current instance?
  4. You changed max_connections (a static parameter) but it hasn’t taken effect — why?
  5. Which Aurora endpoint should an application use for its write connection, and why is hard-coding an instance endpoint a bad idea?

Answers

  1. Multi-AZ. On the instance variant the standby serves no traffic; on the cluster variant the two standbys are readable.
  2. False. Read replicas do not fail over automatically; you must manually promote one (and unreplicated changes are lost).
  3. Point-in-time recovery (PITR). It does not overwrite the current instance — it creates a new instance restored to that timestamp.
  4. It’s a static parameter, so it only takes effect after a reboot; the instance is in pending-reboot until then.
  5. The cluster (writer) endpoint — it always points at the current writer and follows failover, so the app’s write connection survives a failover; an instance endpoint is pinned to one instance and won’t move after a failover.

Exercise

Design and (optionally) build the database tier for a moderately busy web application that must (a) survive the loss of an Availability Zone with no data loss, (b) offload a heavy nightly reporting job from the primary, © keep a disaster-recovery copy in a second AWS Region, and (d) hold no plaintext database credentials anywhere. Specify: the engine and instance class (justify Graviton vs Intel and the family), storage type and whether autoscaling is on, the Multi-AZ choice (instance vs cluster) for requirement (a), how you satisfy (b) without compromising (a), how you satisfy © (cross-Region read replica vs Aurora Global Database — and why), and how you satisfy (d). Then write the aws rds commands to create the primary, the read replica, and a cross-Region snapshot copy. Finally, state your RTO and RPO for an AZ failure and for a full-Region failure, and explain which feature delivers each.

Certification mapping

Glossary

Next steps

You now know RDS and Aurora end to end — every engine and class, the storage model, the all-important Multi-AZ-vs-read-replica distinction, backups and PITR, parameter and option groups, security, and the Aurora cluster volume, endpoints, Serverless v2, and Global Database. From here:

awsrdsauroramulti-azread-replicasbackups
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading