AWS Lesson 59 of 123

AWS KMS in Depth: Multi-Region Keys, Envelope Encryption, Key Policies, and Grants

Most teams treat KMS as a checkbox: tick “encryption at rest,” pick the AWS-managed key, move on. That works until you need cross-Region DR, cross-account data sharing, a key you can prove only one role can use, or 100k decrypts a second on a hot path. At that point KMS stops being a checkbox and becomes an authorization system with a latency budget and a request quota. This guide treats it that way — the key types, how envelope encryption actually moves bytes, multi-Region keys, the three-layer authorization model, and the operational edges (rotation, quotas, audit) that bite at scale.

The single most important fact, the one every later decision falls out of: your plaintext almost never goes to KMS, and KMS key material never leaves KMS. A KMS key (the CMK, formally a “KMS key”) is a logical reference to material that lives inside FIPS 140-3 validated HSMs. You cannot export it. What you can do is ask KMS to wrap and unwrap small blobs — and the standard pattern is to have KMS wrap a data key that you then use locally to encrypt the actual payload. That is envelope encryption.

By the end you will stop guessing about encryption design. When someone asks “can the DR Region read this?”, “who can actually call Decrypt on that key?”, or “why is the KMS bill suddenly four figures?”, you will know the mechanism, the exact CLI to confirm it, and the fix. Because this is a reference you will return to mid-incident, the key types, the conditions, the limits, the errors and the cost levers are all laid out as scannable tables — read the prose once, then keep the tables open when the pager goes off.

What problem this solves

Encryption-at-rest is easy to enable and hard to get right the moment your data or your blast radius crosses a boundary. The defaults — AWS-managed keys, a Region-locked CMK, no encryption context, a wide-open key policy delegated to IAM and never tightened — work in a single account, in a single Region, with one team, until exactly one of those assumptions breaks. Then you discover that an AWS-managed key cannot be shared cross-account, that a Region-scoped ciphertext is unreadable in your DR Region even though the bytes replicated fine, or that a hot path calling GenerateDataKey per object is throttling and running up a four-figure monthly bill.

What breaks without this knowledge: a team enables a DynamoDB global table, replicates client-side-encrypted PAN data to a second Region, runs a failover game-day, and cannot decrypt a single replicated record — the worst kind of DR failure, because it looks healthy until you cut over. Or an auditor asks “prove only the payments role can decrypt cardholder data,” and the answer is a key policy delegated to account root with no kms:ViaService or EncryptionContext constraint, so the honest answer is “anyone in the account with the right IAM can.” Or a launch melts under ThrottlingException because nobody enabled S3 Bucket Keys or raised the request-rate quota ahead of time.

Who hits this: any team with cross-Region DR of encrypted data, cross-account data sharing, a compliance regime that demands provable key isolation, or a high-throughput encryption path. It bites hardest on active-active applications with global tables (multi-Region key portability), regulated workloads (encryption context and least-privilege key policies), and anything that calls KMS per object (quota and cost). The fix is almost never “turn on a bigger key” — it is “make the key portable where ciphertext travels, scope authorization to the exact caller and context, and architect the call volume down.”

To frame the whole field before the deep dive, here is every problem class this article covers, the question it forces, and where to look first:

Problem class What is actually wrong First question to ask Where to confirm Most common single cause
Replicated-but-unreadable DR Region can’t decrypt replicated ciphertext Did the key travel, or only the bytes? describe-keyMultiRegion Single-Region CMK behind a global table
Locked-out key (AccessDenied) Decrypt denied despite IAM allowing it Does the key policy delegate to IAM? get-key-policyEnableIAMRoot Key policy with no IAM delegation
Throttling / surprise bill ThrottlingException, four-figure KMS spend Is it one KMS call per object? CloudTrail event count; Service Quotas Per-object GenerateDataKey, no Bucket Keys
Cross-account read fails closed Foreign principal can’t decrypt shared data Are both sides (key policy + IAM) aligned? get-key-policy + consumer IAM Only one side of the handshake granted
Context / audit gap Unexpected Decrypt, weak provable scope Is the key pinned to service + context? CloudTrail Decrypt events No kms:ViaService / EncryptionContext
Rotation that didn’t re-wrap Compliance expects fresh material on old data Did rotation re-encrypt, or just rotate? get-key-rotation-status + design review Assuming rotation re-wraps stored data

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should be comfortable with IAM policy evaluation (identity vs resource policies, explicit deny wins, condition keys), basic AWS CLI (--query, JSON output, fileb://), and the idea of at-rest vs in-transit encryption. You should know what an ARN is and how cross-account access works in principle. Familiarity with AES-GCM and the words “symmetric” and “authenticated encryption” helps but isn’t required — the article defines what it needs.

This sits in the Security & Cryptography track. It assumes the identity fundamentals from AWS IAM Fundamentals: Users, Roles, Policies & the Evaluation Engine and the least-privilege patterns in IAM Least Privilege: Permission Boundaries & Inescapable Ceilings, because a key policy is just a resource policy and the three-layer model is IAM evaluation with the key policy as the root of trust. It is upstream of every storage deep-dive: S3 storage classes, versioning, lifecycle & encryption, EBS, EFS & FSx, and RDS & Aurora all consume KMS for SSE. It pairs with Secrets Manager & Parameter Store (both wrap their secrets under a CMK), CloudTrail, Config & audit (where every Decrypt lands), and at org scale with Organizations: SCPs, guardrails & delegated admin and Resource Control Policies & the data perimeter.

A quick map of who owns which layer during an encryption design or incident, so you call the right person fast:

Layer What lives here Who usually owns it Failure classes it can cause
Application crypto Data keys, AAD, the Encryption SDK App / dev team Wrong encryption context, unbounded cache, plaintext leak
Key policy Resource policy on the CMK (root of trust) Security / platform Lockout, over-broad access, missing cross-account grant
IAM Identity policies referencing KMS actions Platform + app Decrypt denied (key policy didn’t delegate), wrong key ARN
Grants Programmatic, temporary delegations Platform + AWS services Orphaned grants, service can’t mint data keys
Multi-Region / DR Primary + replicas, ReEncrypt backfill Platform / DR owner Replicated-but-unreadable, divergent replica policy
Quota & cost Request-rate quota, Bucket Keys, caching Platform / FinOps Throttling, surprise bill, throttled launch

Core concepts

Six mental models make every later decision obvious.

KMS is a wrapping and authorization service, not a bulk cipher. Two API verbs anchor the whole service. Encrypt/Decrypt send up to 4 KB of plaintext/ciphertext and KMS does the crypto — fine for small secrets, wrong for large objects. GenerateDataKey mints a fresh symmetric data key and returns it to you both in plaintext and wrapped under your KMS key; you encrypt your gigabytes locally with the plaintext copy, throw it away, and store only the ciphertext payload plus the wrapped blob. Every design decision — quotas, caching, multi-Region portability — falls out of “KMS protects the key that protects the data.”

The key policy is the root of trust, not IAM. Unlike S3, where an IAM policy alone can grant access, a KMS key policy that does not delegate to IAM means IAM policies are ignored. The key policy is authoritative. This single fact is responsible for most KMS lockouts and most “why is my Decrypt denied when IAM clearly allows it” tickets.

KeySpec and KeyUsage are immutable. You choose symmetric vs asymmetric vs HMAC, and encrypt-vs-sign, at creation, and you can never change them. Picking wrong means creating a new key and re-encrypting. Treat key creation as a one-way door.

A normal key is Region-locked; ciphertext is portable only with a multi-Region key. Ciphertext produced in eu-west-1 can only be decrypted by calling KMS in eu-west-1. If that Region is down, the data is unreadable even though the bytes replicated elsewhere. Multi-Region keys share the same key material across Regions so an envelope decrypts in any of them — a deliberate availability/isolation trade-off you opt into, not a default.

Encryption context is authenticated, logged, additional data — your cheapest authorization and audit tool. It is not secret and not encrypted, but it is bound to the ciphertext and required, byte-for-byte, at decrypt time, appears in CloudTrail, and can be constrained in policy with kms:EncryptionContext: conditions. It is the difference between “this role can decrypt anything” and “this role can decrypt only tenant=acme invoices.”

Throughput is bounded by a shared, Region-level request-rate quota. Symmetric Decrypt/GenerateDataKey/Encrypt share a per-Region quota (tens of thousands of requests/second depending on Region). A hot path that calls KMS per object hits ThrottlingException long before you expect. The architecture answer is to call KMS less (Bucket Keys, data-key caching), not just to retry harder.

The vocabulary in one table

Before the deep sections, pin every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept One-line definition Where it lives Why it matters
KMS key (CMK) Logical reference to HSM-resident key material KMS, in-Region The thing you authorize; never exported
Data key Symmetric key minted by GenerateDataKey In your app (plaintext) + storage (wrapped) Does the actual bulk encryption
Envelope encryption Encrypt data with a data key, wrap the data key with the CMK App + storage The pattern behind almost everything
Key policy Resource policy on the CMK On the key Root of trust; authoritative over IAM
Grant Programmatic, temporary delegation On the key How AWS services and short-lived jobs get access
Encryption context Authenticated additional data (AAD) Bound to ciphertext, logged Cheap authorization + audit binding
Multi-Region key (MRK) Primary + replicas sharing key material Multiple Regions (mrk-...) Cross-Region ciphertext portability for DR
Alias Mutable, Region-scoped friendly pointer In-Region Human-friendly key reference; repointable
kms:ViaService Condition pinning a key to one AWS service Key/IAM policy condition “Only through S3,” not a human running decrypt
Request-rate quota Shared per-Region cryptographic-ops cap Region The throttling ceiling you architect around
S3 Bucket Keys Bucket-level data key caching for SSE-KMS S3 bucket setting Collapses per-object KMS calls dramatically
ReEncrypt Swap the wrapping key without exposing plaintext KMS API Re-wraps stored ciphertext during migrations

1. Key types: pick the right primitive

KMS keys are not interchangeable. The KeySpec and KeyUsage are immutable at creation, so this is a one-way door. Choose the primitive for the job, then never look back.

Type KeySpec KeyUsage Use it for API verbs Notes / limit
Symmetric SYMMETRIC_DEFAULT (AES-256-GCM) ENCRYPT_DECRYPT Envelope encryption; default for S3/EBS/RDS/Secrets Manager Encrypt, Decrypt, GenerateDataKey* Never leaves KMS; 4 KB direct limit; the workhorse
Asymmetric (encrypt) RSA_2048/3072/4096 ENCRYPT_DECRYPT Encrypt where the encryptor has no AWS creds Encrypt, Decrypt (+ public key) Public key downloadable; no GenerateDataKey; small payloads
Asymmetric (sign) ECC_NIST_P256/384/521, ECC_SECG_P256K1, RSA_* SIGN_VERIFY Code/document signing, external verification Sign, Verify (+ public key) Verifier may be outside AWS; pick curve per standard
Key agreement ECC_NIST_*, SM2 (China) KEY_AGREEMENT Derive a shared secret (ECDH) DeriveSharedSecret Niche; for negotiated session keys
HMAC HMAC_224/256/384/512 GENERATE_VERIFY_MAC MACs, signed tokens, deterministic integrity GenerateMac, VerifyMac Symmetric secret; never exported; no encrypt
Multi-Region any above + MultiRegion: true per spec DR, global tables, cross-Region ciphertext portability per spec + ReplicateKey Shares material across Regions; mrk- id prefix

Orthogonal to spec is who manages the key. This choice is not immutable, but migrating between models means re-encryption, so decide deliberately:

Management model Who controls policy Rotation Cross-account? Visible in your account? Cost When it’s acceptable
AWS-owned AWS (invisible) AWS-managed No No Free Zero audit/access-control requirement
AWS-managed (aws/s3, aws/ebs…) AWS (you can’t edit) Auto, yearly No Yes Free key; per-request charges Single-account, single-Region, no policy edits
Customer-managed (CMK) You (full policy) Optional, configurable Yes Yes ~$1/key/month + requests Anything that needs policy, sharing, or proof

The takeaway: the moment you need an editable policy, cross-account sharing, custom rotation, grants, or independent deletion, you are forced to a customer-managed key. Everything below assumes CMKs. The decision rule in one table:

If you need… AWS-owned AWS-managed Customer-managed
Edit the key policy No No Yes
Share ciphertext cross-account No No Yes
Cross-Region DR portability (MRK) No No Yes
Custom rotation cadence No No (yearly only) Yes
Grants for services/short-lived jobs No Limited Yes
See it / audit its policy No Yes Yes
Pay nothing for the key Yes Yes No (~$1/mo)
# A customer-managed symmetric key, with rotation on from day one
aws kms create-key \
  --description "app-prod data-at-rest" \
  --key-spec SYMMETRIC_DEFAULT \
  --key-usage ENCRYPT_DECRYPT \
  --tags TagKey=env,TagValue=prod TagKey=app,TagValue=payments

# Give it a human-friendly alias (aliases are Region-scoped, mutable pointers)
aws kms create-alias \
  --alias-name alias/payments-prod \
  --target-key-id <key-id>
# Terraform equivalent: key + alias + rotation in one place, reviewed in a PR
resource "aws_kms_key" "payments" {
  description             = "app-prod data-at-rest"
  key_usage               = "ENCRYPT_DECRYPT"
  customer_master_key_spec = "SYMMETRIC_DEFAULT"
  enable_key_rotation     = true
  rotation_period_in_days = 180
  deletion_window_in_days = 30
  tags = { env = "prod", app = "payments" }
}

resource "aws_kms_alias" "payments" {
  name          = "alias/payments-prod"
  target_key_id = aws_kms_key.payments.key_id
}

Aliases deserve their own note, because they are the moving part teams misuse. An alias is a Region-scoped, mutable pointer to a key — alias/payments-prod resolves to whatever key it currently targets. That makes manual rotation (repoint the alias to a new key) trivial, but it also means an alias is not a stable identity for audit. The alias quirks worth knowing:

Alias behaviour Detail Gotcha
Scope One Region only The same name in another Region is a different pointer
Mutability update-alias repoints it instantly A typo’d repoint silently sends encrypt to the wrong key
aws/ prefix Reserved for AWS-managed keys You cannot create alias/aws-...
In CloudTrail Calls log the key ARN, not the alias Audit on key ID, never on alias name
Multi-Region Use the same alias in each Region pointing at the local MRK Convention, not enforced — keep it disciplined

2. Envelope encryption: data keys and the SDK

For anything larger than 4 KB, you encrypt locally with a data key. The raw flow, before any SDK:

# 1. Mint a data key: plaintext + ciphertext (wrapped) come back together
aws kms generate-data-key \
  --key-id alias/payments-prod \
  --key-spec AES_256 \
  --query '{plaintext:Plaintext, wrapped:CiphertextBlob}' \
  --output json
# 2. Encrypt the payload locally with `plaintext` (AES-256-GCM in your app)
# 3. Persist the ciphertext payload + `wrapped` blob; ZERO the plaintext key in memory
# 4. To read: Decrypt(wrapped) -> plaintext key -> decrypt payload locally

The data-key API has variants, and picking the wrong one is a common slip — GenerateDataKeyWithoutPlaintext exists precisely for the case where the minting service shouldn’t see the key (it will be decrypted later, elsewhere):

API Returns Use it when Counts against quota
GenerateDataKey Plaintext + wrapped key You encrypt now, in this process Yes (1 op)
GenerateDataKeyWithoutPlaintext Wrapped key only A different component will decrypt later Yes (1 op)
GenerateDataKeyPair Plaintext + wrapped private key + public key Asymmetric envelope (sign/encrypt offline) Yes (heavier op)
GenerateDataKeyPairWithoutPlaintext Wrapped private + public key Mint for later asymmetric use Yes
Encrypt (direct) Ciphertext (≤4 KB plaintext) Tiny secret, no envelope needed Yes
Decrypt Plaintext (≤4 KB) Unwrap a data key or tiny secret Yes
ReEncrypt Ciphertext under a new key Migrate wrapping key without plaintext Yes (decrypt + encrypt)

Rolling your own framing (IV, AAD, key blob, algorithm tags) is where teams introduce vulnerabilities. Use the AWS Encryption SDK — it produces a portable, self-describing message format that bundles the wrapped data key with the ciphertext, handles authenticated encryption, and supports multiple wrapping keys:

import aws_encryption_sdk
from aws_encryption_sdk import CommitmentPolicy

client = aws_encryption_sdk.EncryptionSDKClient(
    commitment_policy=CommitmentPolicy.REQUIRE_ENCRYPT_REQUIRE_DECRYPT
)
key_provider = aws_encryption_sdk.StrictAwsKmsMasterKeyProvider(
    key_ids=["arn:aws:kms:eu-west-1:111122223333:key/<key-id>"]
)

ciphertext, header = client.encrypt(
    source=plaintext_bytes,
    key_provider=key_provider,
    # Encryption context is AAD: authenticated, logged in CloudTrail, NOT secret
    encryption_context={"tenant": "acme", "purpose": "invoice"},
)

Two things that matter at principal level. Key commitment (REQUIRE_ENCRYPT_REQUIRE_DECRYPT, the 2.x+ default) prevents a class of attacks where one ciphertext decrypts to different plaintexts under different keys; do not lower it to interop with ancient clients unless you understand exactly what you give up. And encryption context is your cheapest authorization and audit tool — additional authenticated data that is not encrypted but is bound to the ciphertext, required byte-for-byte at decrypt, logged in CloudTrail, and constrainable in policy. The Encryption SDK options worth knowing:

SDK concept What it controls Default (v2+) When to change
Commitment policy Whether key commitment is required REQUIRE_ENCRYPT_REQUIRE_DECRYPT Only to read legacy v1 ciphertext (temporarily)
Algorithm suite Cipher + signing + commitment AES-256-GCM + HKDF + ECDSA + commit Drop signing only if you understand the trade
Key provider Which CMK(s) wrap the data key StrictAwsKmsMasterKeyProvider Multi-CMK (multi-Region/multi-account) decrypt
Encryption context The AAD map bound to ciphertext empty Always set it — it’s free authorization + audit
Caching CMM Reuse data keys across messages off High-throughput app encryption (see below)
Discovery provider Decrypt with any CMK in an account/Region off (strict) Multi-Region decrypt where key ARN varies

What encryption context is and isn’t trips up almost everyone the first time. The boundaries:

Encryption context… Is Is NOT
Secrecy Authenticated (integrity-bound) Encrypted / secret — it’s plaintext in logs
Requirement at decrypt Required byte-for-byte Optional metadata
Order sensitivity Order-independent (it’s a map) A positional list
Policy use Constrainable via kms:EncryptionContext:<k> A substitute for the key policy
Good values tenant, purpose, table, pk Anything secret (passwords, PII, the data itself)
Audit Logged in CloudTrail on every call Hidden — assume it’s visible

Caching: trading blast radius for throughput

Calling GenerateDataKey per object is correct but expensive — every write becomes a KMS request against your quota. The data key caching layer in the Encryption SDK reuses a data key across many messages, bounded by max_age, max_messages_encrypted, and max_bytes_encrypted:

from aws_encryption_sdk.caches.local import LocalCryptoMaterialsCache
from aws_encryption_sdk.materials_managers.caching import CachingCryptoMaterialsManager

cache = LocalCryptoMaterialsCache(capacity=1000)
cmm = CachingCryptoMaterialsManager(
    master_key_provider=key_provider,
    cache=cache,
    max_age=300.0,             # rotate the cached data key every 5 minutes
    max_messages_encrypted=1000  # ...or after 1000 messages, whichever first
)

The trade-off is explicit: a larger cache and longer max_age mean fewer KMS calls (cheaper, faster, quota-friendly) but a wider blast radius per data key and weaker cryptographic isolation between messages. Tune it; never leave it unbounded. The knobs and how to reason about each:

Cache parameter What it bounds Lower value Higher value Sensible starting point
max_age Wall-clock lifetime of a cached data key More KMS calls, tighter isolation Fewer calls, wider blast radius 60–300 s
max_messages_encrypted Messages per cached data key Tighter isolation Fewer calls 100–1000
max_bytes_encrypted Bytes per cached data key Tighter isolation Fewer calls Stay well under cipher limits
capacity Distinct cache entries (by context) More cache misses More memory 100–1000
Per-context keying Separate keys per encryption context Stronger tenant isolation More entries Always key on tenant/purpose

3. Multi-Region keys: portability for DR

A normal KMS key is Region-locked: ciphertext produced in eu-west-1 can only be decrypted by calling KMS in eu-west-1. If that Region is down, your data is unreadable even though the bytes are safely replicated elsewhere. Multi-Region keys (MRKs) fix this. A primary and its replicas share the same key material and a related key ID (mrk-...), so ciphertext encrypted under the primary decrypts under any replica — no re-encryption.

# Create the primary as multi-region
aws kms create-key --multi-region \
  --description "global-table encryption primary" \
  --region eu-west-1

# Replicate it into a DR Region (same material, independent policy)
aws kms replicate-key \
  --key-id mrk-1234567890abcdef0 \
  --replica-region us-east-1 \
  --description "global-table encryption replica"
# Terraform: a primary MRK and a replica with its OWN, narrower policy
resource "aws_kms_key" "mrk_primary" {
  provider                 = aws.euwest1
  description              = "global-table encryption primary"
  multi_region             = true
  enable_key_rotation      = true
}

resource "aws_kms_replica_key" "mrk_replica" {
  provider                = aws.useast1
  description             = "global-table encryption replica (decrypt-only app)"
  primary_key_arn         = aws_kms_key.mrk_primary.arn
  policy                  = data.aws_iam_policy_document.replica_decrypt_only.json
}

The nuances that catch people are exactly the things that make MRKs powerful and dangerous:

MRK property What it means Implication
Shared key material Primary and replicas hold identical material Ciphertext is portable; isolation is weaker by design
Independent policies Each replica has its own key policy, grants, tags DR Region can be decrypt-only while primary writes
Independent rotation state Replicas track rotation but material flows from primary Old ciphertext stays decryptable everywhere
mrk- key ID prefix Related key IDs across Regions Same logical key; different ARNs per Region
Deletion protection Can’t delete the primary while replicas exist KMS prevents orphaning material
Not auto-created You explicitly replicate-key per Region No “automatic” global key — opt in per Region
Rotation propagation Auto-rotation on the primary propagates to replicas Don’t separately rotate replicas

The single most important caveat, learned the hard way in the scenario below: adopting an MRK does nothing for data already wrapped under a single-Region key. Existing envelopes are bound to the old key in the old Region. Making them portable requires a ReEncrypt backfill. The single-Region vs multi-Region decision, distilled:

If… Choose Why
Ciphertext never leaves its Region Single-Region CMK Stronger isolation; no shared material
Active-active app, both Regions read/write Multi-Region key Either Region decrypts any envelope
DynamoDB / S3 cross-Region replication of encrypted data Multi-Region key Replication moves bytes, not the key
Cross-Region copy of encrypted EBS snapshots for DR Multi-Region key (or re-encrypt on copy) Avoid an unreadable DR copy
Strict per-Region key isolation is a compliance requirement Single-Region CMK Material must not be duplicated
You “might need it someday” Single-Region CMK Don’t weaken isolation speculatively

4. Key policies vs IAM vs grants: the authorization model

This is where KMS differs sharply from most AWS services and where the dangerous mistakes live. Three layers decide whether a Decrypt succeeds, and the order of trust is not what IAM-first intuition expects:

  1. The key policy — the resource policy on the key. It is the root of trust. Unlike S3, where IAM alone can grant access, a KMS key policy that does not delegate to IAM means IAM policies are ignored. The key policy is authoritative.
  2. IAM policies — only effective if the key policy enables IAM (the canonical kms:* to the account root statement). With that statement present, IAM grants behave normally.
  3. Grants — programmatic, temporary, fine-grained delegations, ideal for AWS services and short-lived workloads.

The three layers side by side, because choosing the wrong instrument is the most common architectural error:

Mechanism Granularity Lifetime Who edits it Best for Limit / gotcha
Key policy Per principal + condition Until you edit it Key admins The authoritative allow-list; IAM delegation Omit IAM delegation → IAM ignored → lockout risk
IAM policy Per principal, per action Until you edit it IAM admins Account-internal access at scale Only works if key policy delegates to IAM
Grant Per grantee + operations + constraints Until retired/revoked Anyone with CreateGrant AWS services, short-lived jobs ~50k grants/key; can be orphaned; eventually consistent

The minimum sane key policy delegates administration and usage to IAM rather than hard-coding every principal:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnableIAMRoot",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111122223333:root" },
      "Action": "kms:*",
      "Resource": "*"
    },
    {
      "Sid": "KeyUsers",
      "Effect": "Allow",
      "Principal": { "AWS": "arn:aws:iam::111122223333:role/payments-app" },
      "Action": ["kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey*", "kms:DescribeKey"],
      "Resource": "*",
      "Condition": {
        "StringEquals": { "kms:EncryptionContext:tenant": "acme" }
      }
    }
  ]
}

The EnableIAMRoot statement is not a backdoor to root credentials — it delegates the decision to IAM in this account. Omit it and you must enumerate every principal in the key policy forever, including the admins who could otherwise fix a lockout. That is the classic way teams brick a key.

The KMS actions you will actually write into policies, grouped by what they let a principal do — least privilege means granting only the row you need:

Action What it permits Typical grantee Risk if over-granted
kms:Encrypt Encrypt ≤4 KB directly Apps writing tiny secrets Low (encrypt-only is benign)
kms:Decrypt Unwrap data keys / decrypt Read paths High — this is “read the data”
kms:GenerateDataKey* Mint data keys for envelope encryption Write paths Medium (enables new encryption)
kms:ReEncrypt* Re-wrap ciphertext under another key Migration jobs Medium (can move data between keys)
kms:DescribeKey Read key metadata Almost everything Low
kms:CreateGrant Delegate to another principal Services, admins High — can broaden access
kms:PutKeyPolicy Replace the key policy Key admins only Critical — can grant anyone
kms:ScheduleKeyDeletion Schedule destruction of the key Break-glass only Critical — can destroy data
kms:Sign / kms:Verify Asymmetric sign/verify Signing services Medium
kms:GenerateMac / kms:VerifyMac HMAC operations Token services Medium

When to reach for grants

Grants shine where a static policy is wrong. They are issued via API, carry their own constraints, and can be retired the instant the work is done:

aws kms create-grant \
  --key-id <key-id> \
  --grantee-principal arn:aws:iam::111122223333:role/batch-worker \
  --operations Decrypt GenerateDataKey \
  --constraints EncryptionContextSubset={tenant=acme} \
  --retiring-principal arn:aws:iam::111122223333:role/grant-manager

This is exactly the mechanism AWS services use on your behalf: when you attach a CMK to an Auto Scaling group or an encrypted EBS volume, the service creates a grant so it can mint data keys without you widening the key policy. Prefer grants over key-policy edits for service integrations and transient access — they are revocable and don’t bloat the resource policy. The grant constraints and lifecycle, which trip teams up under eventual consistency:

Grant aspect Detail Gotcha
Operations Explicit list (Decrypt, GenerateDataKey…) Grant only what the workload needs
EncryptionContextSubset Grantee must include these context pairs Looser than Equals; allows extra context
EncryptionContextEquals Exact context match required Strictest; use for tight scoping
Retiring principal Who can retire the grant Set it, or you can only revoke as admin
GrantToken Returned token bridges eventual consistency Pass it on immediate use or get transient denies
Revoke vs retire Admin revokes; grantee/retirer retires Both remove it; revoke is the admin lever
Limit ~50,000 grants per key Orphaned service grants accumulate — audit them

5. Cross-account encryption: S3, EBS, and snapshots

Sharing encrypted data across accounts is a two-sided handshake: the key policy in the owning account must allow the foreign principal, and that principal’s IAM policy in their own account must allow the KMS actions. Missing either side fails closed — and the failure is a generic AccessDenied, not a helpful “the other side didn’t grant you.”

Key policy in the owning account (111122223333):

{
  "Sid": "AllowConsumerAccount",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::444455556666:root" },
  "Action": ["kms:Decrypt", "kms:DescribeKey", "kms:GenerateDataKey"],
  "Resource": "*",
  "Condition": { "StringEquals": { "aws:PrincipalOrgID": "o-exampleorgid" } }
}

IAM policy in the consumer account (444455556666) — note you must name the full key ARN, because the key lives in another account:

{
  "Effect": "Allow",
  "Action": ["kms:Decrypt", "kms:DescribeKey"],
  "Resource": "arn:aws:kms:eu-west-1:111122223333:key/<key-id>"
}

The two-sided requirement is the whole game. Which side does what:

Side What it must allow Reference style If missing
Owning account (key policy) The foreign principal/root + actions Principal: {AWS: arn:...:444455556666:root} AccessDenied — owner didn’t share
Consumer account (IAM) The KMS actions on the full key ARN Resource: arn:aws:kms:...:111122223333:key/... AccessDenied — consumer didn’t grant
Both (recommended) Scope with aws:PrincipalOrgID Condition on the key policy Org-wide blast radius if *

Service-specific edges that matter, because each service layers its own access surface on top of the KMS handshake:

Service Extra surfaces beyond key policy + IAM Cross-account requirement Cost lever
S3 (SSE-KMS) Bucket policy + object ownership Bucket policy, key policy, reader IAM all align S3 Bucket Keys — collapse per-object calls
EBS / snapshots Snapshot share + grant on the key CMK (not AWS-managed); target gets Decrypt + CreateGrant Re-encrypt to target’s key on copy
RDS / Aurora Snapshot share + KMS share Share snapshot AND the CMK; target re-encrypts on copy Storage-level SSE; few KMS calls
DynamoDB Table encryption setting MRK for cross-Region global tables Table-level key; low call volume
Secrets Manager Resource policy on the secret Share secret + grant Decrypt on its CMK One CMK for many secrets

Two edges deserve calling out explicitly. For S3 with SSE-KMS, cross-account reads need the bucket policy, the object’s KMS key policy, and the reader’s IAM all aligned; set the bucket’s default encryption to your CMK and enable S3 Bucket Keys — it caches a bucket-level data key and collapses thousands of per-object KMS calls into a handful, cutting both cost and quota pressure dramatically. For EBS / snapshots, you cannot share a snapshot encrypted with an AWS-managed key cross-account — full stop. Use a CMK, share the snapshot, grant the target account Decrypt + CreateGrant on the key, and the target re-encrypts to their key on copy. This is why production fleets standardize on CMKs for EBS even when defaults would “work.”

# Enable S3 Bucket Keys on a bucket's default SSE-KMS — the single biggest KMS-call reducer
aws s3api put-bucket-encryption --bucket payments-data \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:eu-west-1:111122223333:key/<key-id>"
      },
      "BucketKeyEnabled": true
    }]
  }'

6. Key rotation: automatic vs manual

Automatic rotation is the default answer. Enable it and KMS generates new cryptographic material on a schedule (the default is yearly; the rotation period is now configurable down to ~90 days), retaining all prior material so old ciphertext stays decryptable. The key ID, ARN, and policy never change — rotation is invisible to applications.

aws kms enable-key-rotation \
  --key-id <key-id> \
  --rotation-period-in-days 180

aws kms get-key-rotation-status --key-id <key-id>

Crucial subtlety: automatic rotation rotates the KMS key material, but it does not re-encrypt your existing data or your stored data keys. Old envelopes are unwrapped with retained old material; new writes use new material. If your compliance regime demands that data actually be re-wrapped under fresh material, that is a separate, application-driven re-encryption job, not something rotation does for you:

# Conceptual re-encrypt: ReEncrypt swaps the wrapping key without exposing plaintext.
# Plaintext never returns to your process — KMS decrypts and re-wraps internally.
new_blob = kms.re_encrypt(
    CiphertextBlob=old_wrapped_blob,
    SourceKeyId="alias/payments-prod-old",
    DestinationKeyId="alias/payments-prod-new",
    DestinationEncryptionContext={"tenant": "acme"},
)["CiphertextBlob"]

Manual rotation (a brand-new key behind the same alias) is for cases automatic rotation can’t cover: changing key spec, moving to a new Region/account, or responding to suspected compromise where you must invalidate old material. You repoint the alias and run a re-encryption backfill. Asymmetric and HMAC keys historically required this; symmetric keys rarely do. The two models compared:

Aspect Automatic rotation Manual rotation (new key + alias)
What changes Backing material only A whole new key (new key ID/ARN)
Key ID / ARN Unchanged New — alias repointed
App impact Invisible Repoint alias; backfill old data
Re-encrypts old data? No (old material retained) Only if you run a ReEncrypt backfill
Changes key spec? No Yes (the reason to do it manually)
Cadence 90–2560 days, configurable Whenever you decide
Cost Slightly more material stored New key (~$1/mo) + ReEncrypt requests
Use it for The default, compliance “rotate yearly” Spec change, compromise, Region/account move

What rotation does and does not do — the table that prevents a false sense of security:

Statement about rotation True? Why it matters
New writes use fresh material Yes The point of rotation
Old ciphertext stays decryptable Yes Prior material is retained
Existing data is re-wrapped under new material No Needs a separate ReEncrypt job
Stored data keys are refreshed No They’re wrapped under retained material
The key ARN/ID changes No Apps and references are unaffected
Replica MRKs rotate too Yes (from the primary) Don’t rotate replicas separately
Compliance “data must be re-encrypted” is satisfied Not by rotation alone Run an application-driven backfill

7. Auditing and controls

Every KMS API call lands in CloudTrail — including Decrypt, with the encryptionContext, the calling principal, and the source IP. This is the highest-signal security telemetry in the account; treat unexpected Decrypt calls as you would unexpected AssumeRole.

Tighten policies with condition keys so a key is usable only in the intended context:

{
  "Sid": "OnlyViaS3FromOrg",
  "Effect": "Allow",
  "Principal": { "AWS": "arn:aws:iam::111122223333:role/payments-app" },
  "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "kms:ViaService": "s3.eu-west-1.amazonaws.com",
      "aws:PrincipalOrgID": "o-exampleorgid"
    }
  }
}

kms:ViaService pins usage to a specific AWS service (the key can only be used through S3, not by a human running aws kms decrypt). aws:PrincipalOrgID confines cross-account use to your organization. For ABAC, gate Decrypt on tag parity between the principal and the key with aws:ResourceTag / aws:PrincipalTag so access scales with tags instead of hand-written ARNs:

"Condition": {
  "StringEquals": {
    "aws:ResourceTag/project": "${aws:PrincipalTag/project}"
  }
}

The condition keys that turn a wide key policy into a tightly-scoped one — the most useful security lever in KMS:

Condition key Pins the key to… Example value Effect
kms:ViaService One AWS service only s3.eu-west-1.amazonaws.com Blocks a human running aws kms decrypt directly
aws:PrincipalOrgID Your organization o-exampleorgid Confines cross-account use to the org
kms:EncryptionContext:<k> A specific context value tenant = acme “Decrypt only acme’s data”
kms:EncryptionContextKeys Required context keys present ["tenant"] Force callers to supply context
aws:SourceVpce A specific VPC endpoint vpce-0abc... Only via the KMS interface endpoint
aws:PrincipalTag / aws:ResourceTag Tag parity (ABAC) project match Access scales with tags, not ARNs
kms:GrantIsForAWSResource AWS-service-created grants only true Limit who can create grants
kms:CallerAccount A specific account 444455556666 Scope cross-account precisely

What to watch for in CloudTrail, and what each signal usually means:

CloudTrail signal What it suggests Action
Decrypt with no kms:ViaService A human/role called KMS directly Investigate; pin the key to a service if unexpected
Decrypt with mismatched encryptionContext Wrong tenant/purpose, or probing Alarm like an unexpected AssumeRole
Decrypt from an unexpected sourceIPAddress Off-network access Correlate with VPC endpoint / aws:SourceVpce
Spike in GenerateDataKey count Per-object KMS storm Enable Bucket Keys / caching; check throttling
PutKeyPolicy / ScheduleKeyDeletion Sensitive admin change High-priority alert; should be rare and reviewed
Decrypt AccessDenied bursts Misconfig or probing Check policy delegation / context / cross-account

8. Cost and request-quota management

KMS pricing has two parts: roughly $1 per CMK per month (replicas billed separately), and per-request charges with a shared, Region-level request rate quota for cryptographic operations. Symmetric Decrypt/GenerateDataKey/Encrypt share a quota (a few tens of thousands of requests/second depending on Region and key type). A hot path that calls KMS per object will hit ThrottlingException long before you expect.

The levers, in order of impact:

Lever Effort KMS-call reduction Cost impact When to use
S3 Bucket Keys One toggle Drastic (thousands → ~one/bucket) Lowers bill + quota pressure Every SSE-KMS bucket, always
Data-key caching Code (Encryption SDK) Large within the window Lowers bill High-volume app encryption
Quota increase Service Quotas request None (raises ceiling) None directly Before a launch, proactively
Backoff + jitter Verify SDK retry config None (smooths spikes) None Always — survive transient throttles
Fewer, broader CMKs Design Indirect Fewer $1/mo keys Don’t over-fragment keys
Right key choice Design Indirect Asymmetric/ECC ops cost more Use symmetric unless you need asymmetric

The cost trap is never the $1/month per key. It is millions of unbatched Decrypt calls from a service that should have been using Bucket Keys or a data-key cache. Architect the call volume down first; optimize the bill second.

The throttling-relevant limits and what each means in practice:

Limit / quota Rough value Shared across What hitting it looks like Mitigation
Symmetric crypto request rate Tens of thousands rps / Region Encrypt/Decrypt/GenerateDataKey ThrottlingException under load Bucket Keys, caching, quota raise
GenerateDataKey (asymmetric) Lower than symmetric Asymmetric ops pool Throttling on heavy asymmetric use Cache; prefer symmetric where possible
Direct Encrypt/Decrypt payload ≤ 4 KB per request Validation error on large input Envelope-encrypt instead
Grants per key ~50,000 per key LimitExceeded on CreateGrant Audit/retire orphaned grants
Aliases per key / account Bounded per account LimitExceeded on CreateAlias Reuse aliases; clean up
Keys per account/Region Soft limit (raisable) per Region LimitExceeded on CreateKey Consolidate; request increase

Error and limit reference

The KMS errors you will actually see, what they mean, how to confirm, and the fix. The non-obvious ones are AccessDeniedException when IAM looks fine (the key policy didn’t delegate), InvalidCiphertextException from a mismatched encryption context, and KMSInvalidStateException on a key pending deletion or disabled:

Error / exception Meaning Likely cause How to confirm Fix
AccessDeniedException Caller not authorized Key policy lacks IAM delegation, or no allow statement get-key-policy (no EnableIAMRoot?); simulate-principal-policy Add IAM delegation; grant the action/condition
AccessDeniedException (cross-acct) One side of handshake missing Key policy or consumer IAM doesn’t allow Check both key policy and consumer IAM (full ARN) Align both sides; scope with PrincipalOrgID
InvalidCiphertextException Ciphertext/context invalid Wrong/missing encryption context, or corrupted blob Compare encrypt vs decrypt context byte-for-byte Pass the exact same AAD map
ThrottlingException Request-rate quota exceeded Per-object KMS storm CloudTrail call count; Service Quotas usage Bucket Keys, caching; raise quota; backoff
KMSInvalidStateException Key in a bad state for the op Key disabled or PendingDeletion describe-keyKeyState Enable the key; cancel deletion
NotFoundException Key/alias not found Wrong Region, wrong ID, deleted describe-key in the right Region Use correct ARN/Region; recreate alias
DisabledException Key is disabled Someone disabled it describe-keyEnabled:false enable-key if intended
KMSInvalidSignatureException Signature verify failed Wrong key/algorithm or tampered data Check signing key + SigningAlgorithm Use the correct key/algorithm
IncorrectKeyException Wrong CMK for this ciphertext Decrypting with the wrong key Match key ARN in the ciphertext metadata Use the key that produced the ciphertext
LimitExceededException A KMS limit hit Too many grants/aliases/keys list-grants / list-aliases counts Retire/clean up; request a limit increase
DependencyTimeoutException KMS internal timeout Transient Retry pattern in logs Exponential backoff + jitter
MalformedPolicyDocumentException Bad key-policy JSON Syntax/principal error in PutKeyPolicy Validate the policy document Fix JSON; ensure principal exists

The key states a CMK moves through, because several errors above are just “wrong state for this operation”:

Key state Can encrypt/decrypt? How you got here How to leave
Enabled Yes Normal
Disabled No disable-key enable-key
PendingDeletion No schedule-key-deletion (7–30 day window) cancel-key-deletion
PendingImport No External key material, not yet imported Import material
Creating / Updating Transient Replication / external import in progress Wait
Unavailable No Custom key store disconnected Reconnect the key store

Architecture at a glance

The diagram traces the write-then-fail-over path the way bytes actually move, then maps the five places authorization or availability breaks. Read it left to right. A workload needs at-rest crypto, so it calls GenerateDataKey (via the AWS Encryption SDK, with key commitment on and a bounded data-key cache). That call passes through the three-layer authorization stack — the key policy (root of trust, with the EnableIAMRoot delegation), then IAM and grants (effective only because the key policy delegated, scoped by kms:ViaService), then the encryption context check (the AAD that must match byte-for-byte at decrypt). Only if every layer allows does the request reach the multi-Region CMK: an mrk-... primary in eu-west-1 whose material is replicated — not re-encrypted — to a decrypt-only replica in eu-central-1. KMS returns the data key plaintext-and-wrapped; the app encrypts gigabytes locally with AES-256-GCM, zeroes the plaintext, and stores only the ciphertext and the wrapped blob in S3 (Bucket Keys on) or behind a DynamoDB global table that replicates the ciphertext to the second Region.

Notice the two things the diagram is built to teach. First, the dashed replicate-material arrow looping the primary to the replica is the whole DR story: it carries key material, so the same envelope decrypts in either Region — which is exactly why a single-Region key behind a global table produces “replicated-but-unreadable” data (badge 3). Second, every path converges on CloudTrail (every Decrypt, with principal and context) and a quota guard (watching ThrottlingException and request-rate headroom) — the audit-and-protect footer. The five numbered badges sit on the precise hops where it breaks: the key policy locking you out (1), a wrong encryption context failing closed (2), replicated-but-unreadable DR (3), a per-object KMS storm and throttling (4), and an unexpected Decrypt that signals an authorization gap (5). The legend narrates each as symptom, confirm command, and fix.

AWS KMS envelope-encryption and multi-Region architecture: a workload using the AWS Encryption SDK calls GenerateDataKey, the request passes the three-layer authorization stack — key policy (root of trust with EnableIAMRoot), then IAM and grants gated by kms:ViaService, then the encryption-context AAD check — reaching a multi-Region CMK whose primary in eu-west-1 replicates key material (not ciphertext) to a decrypt-only replica in eu-central-1; the app encrypts data locally with AES-256-GCM and stores ciphertext plus the wrapped data key in S3 with Bucket Keys enabled and a DynamoDB global table that replicates ciphertext, while CloudTrail logs every Decrypt and a quota guard watches request-rate throttling, with five numbered failure badges: key-policy lockout, wrong encryption context failing closed, replicated-but-unreadable DR, per-object KMS storm and throttling, and unexpected Decrypt as an authorization gap

Real-world scenario

Northwind Pay, a fictional but representative pan-European payments platform, ran active-active in eu-west-1 and eu-central-1 behind a DynamoDB global table, with the application performing client-side field-level encryption on PAN (card number) data before writing. They had used a standard, Region-scoped CMK in eu-west-1. The global table dutifully replicated the ciphertext to eu-central-1 — about 240 million encrypted items, growing 3 million a day. The platform team was six engineers; the KMS bill was a rounding error and nobody had thought hard about it.

The first incident was a planned regional failover game-day. At 10:00 they drained eu-west-1; by 10:02 the eu-central-1 application could not decrypt a single replicated record. Every read of PAN data threw AccessDeniedException / NotFoundException from KMS. The ciphertext was a KMS envelope bound to a key that only existed in eu-west-1, and kms:Decrypt in eu-central-1 had no key to call. Replicated-but-unreadable data is the worst kind of DR failure: it looks perfectly healthy — the bytes are there, the table is green — until you cut over and discover the ability to read them never replicated. The game-day was aborted; the DR posture was, on paper, a lie.

The fix was a multi-Region key, and critically, a backfill — adopting an MRK does nothing for the 240 million records already wrapped under the old single-Region key. They created an MRK primary in eu-west-1, replicated it into eu-central-1, repointed the application’s key provider, and ran a controlled ReEncrypt migration (in batches, throttle-aware, idempotent on a per-item version flag) over the existing items so every envelope was re-wrapped under the portable key. The backfill ran for nine days at a deliberately capped rate to stay well under the request-rate quota. They also tightened each replica’s policy independently: the eu-central-1 replica granted the app Decrypt only, while writes (and thus GenerateDataKey) stayed pinned to the active primary, so a failover couldn’t silently start minting keys in the standby Region.

# Primary in the active Region
aws kms create-key --multi-region --region eu-west-1 \
  --description "pan-field-encryption primary"

# Replica in the standby Region (independent, decrypt-only policy attached after)
aws kms replicate-key \
  --key-id mrk-0a1b2c3d4e5f6a7b8 \
  --replica-region eu-central-1

A second, quieter incident surfaced during the same project. With the global table now genuinely portable, traffic doubled into both Regions and the payments service began throwing intermittent ThrottlingException on GenerateDataKey during the evening peak. The cause: client-side field encryption was calling GenerateDataKey per item, and at peak that exceeded the Region’s request-rate quota. They fixed it two ways: added data-key caching in the Encryption SDK (keyed per tenant, max_age=120s, max_messages_encrypted=500), which cut KMS calls by ~98%, and proactively raised the request-rate quota via Service Quotas ahead of the next quarter’s launch. The S3 archive of settlement files, separately, had Bucket Keys enabled in the same sweep — it had been making a GenerateDataKey call per object and was the single largest line on the (newly noticed) KMS bill.

The lesson the team wrote into their standards, in three lines: if a ciphertext can travel between Regions, the key that protects it must travel too — and you must re-encrypt the data that predates that decision. Replication moves bytes; it does not move the ability to read them. And: architect the KMS call volume down before you scale up the quota. The incident timeline, because the order of discovery is the lesson:

Time / phase Symptom Action taken Effect What it should have been
Game-day 10:02 DR Region can’t decrypt anything Abort failover Outage avoided in prod, DR exposed Pre-test decrypt in DR, not just replication
Day 0 Root cause: single-Region key describe-keyMultiRegion: false Diagnosis confirmed
Day 0–1 Plan the fix Create MRK primary + replica Key now portable Standardize MRK for replicated data from day one
Day 1–10 240M legacy items still bound to old key ReEncrypt backfill, throttle-capped Every envelope re-wrapped Backfill is mandatory, not optional
Day 5 ThrottlingException at peak Per-item GenerateDataKey found Diagnosed quota pressure Cache data keys from the start
Day 6 Crush the call volume Data-key caching + S3 Bucket Keys KMS calls −98% Bucket Keys on every SSE-KMS bucket
Day 7 Pre-empt the next launch Service Quotas request-rate increase Headroom secured Raise quota before incidents
+1 month Re-run game-day Failover with portable key + scoped replicas Clean cutover, decrypt works The DR posture is now real

Advantages and disadvantages

KMS’s design — keys that never leave the HSM, a resource policy as the root of trust, envelope encryption as the pattern — is what makes it both powerful and full of sharp edges. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Key material never leaves FIPS 140-3 HSMs; you can’t accidentally export or leak it You can’t hold the key either — every decrypt is a network call against a quota
The key policy is an authoritative, auditable allow-list independent of IAM sprawl A key policy without IAM delegation makes IAM ignored — the classic lockout
Envelope encryption keeps bulk plaintext out of KMS — fast, cheap, quota-friendly You must implement the envelope correctly (use the SDK) or you’ll roll a vuln
Encryption context gives free, logged, byte-bound authorization and audit A context mismatch fails closed with an unhelpful InvalidCiphertext
Multi-Region keys make ciphertext portable for real cross-Region DR They weaken isolation by design, and don’t re-encrypt your legacy data
Grants delegate to services and short jobs without bloating the key policy Orphaned grants accumulate (50k/key) and are eventually consistent
Every Decrypt is in CloudTrail with principal + context — top-tier telemetry High-volume per-object calls flood CloudTrail and throttle before you expect
Automatic rotation is invisible and keeps old ciphertext readable Rotation does not re-wrap stored data — compliance needs a separate job

The model is right whenever you need provable, auditable, centrally-controlled key management — which is almost every regulated or multi-account workload. It bites hardest on teams that treat the key policy like an IAM afterthought (lockouts), that call KMS per object (throttling, cost), that assume rotation re-encrypts data (compliance gap), or that replicate ciphertext cross-Region without making the key portable (the Northwind failure). Every disadvantage is manageable — but only if you know it exists, which is the entire point of this article.

Hands-on lab

Prove envelope encryption, multi-Region portability, encryption-context enforcement, and the key-policy-is-root-of-trust behaviour end to end — all in the AWS Free Tier shape (a CMK is ~$1/month prorated; one MRK replica adds a second; delete at the end and the cost is pennies). Run in CloudShell with credentials that can manage KMS. You will create a multi-Region key, round-trip a data key, prove cross-Region decrypt, and prove that the wrong encryption context fails closed.

Step 1 — Variables.

PRIMARY_REGION=eu-west-1
DR_REGION=eu-central-1
ACCT=$(aws sts get-caller-identity --query Account --output text)
echo "account=$ACCT primary=$PRIMARY_REGION dr=$DR_REGION"

Step 2 — Create a multi-Region primary key and an alias.

KEY_ID=$(aws kms create-key --multi-region --region $PRIMARY_REGION \
  --description "kms-lab mrk primary" \
  --query KeyMetadata.KeyId --output text)
aws kms create-alias --region $PRIMARY_REGION \
  --alias-name alias/kms-lab --target-key-id $KEY_ID
echo "primary key: $KEY_ID"

Expected: a KeyId beginning mrk-. Confirm it is multi-Region:

aws kms describe-key --region $PRIMARY_REGION --key-id alias/kms-lab \
  --query 'KeyMetadata.{MR:MultiRegion, State:KeyState}'
# -> "MR": true, "State": "Enabled"

Step 3 — Enable rotation on a 180-day cadence.

aws kms enable-key-rotation --region $PRIMARY_REGION \
  --key-id $KEY_ID --rotation-period-in-days 180
aws kms get-key-rotation-status --region $PRIMARY_REGION --key-id $KEY_ID
# -> "KeyRotationEnabled": true, "RotationPeriodInDays": 180

Step 4 — Replicate into the DR Region with a (decrypt-only) policy.

aws kms replicate-key --region $PRIMARY_REGION \
  --key-id $KEY_ID --replica-region $DR_REGION \
  --description "kms-lab mrk replica"
aws kms create-alias --region $DR_REGION \
  --alias-name alias/kms-lab --target-key-id $KEY_ID
aws kms describe-key --region $DR_REGION --key-id alias/kms-lab \
  --query 'KeyMetadata.{MR:MultiRegion, State:KeyState}'
# -> "MR": true, "State": "Enabled" in the DR Region too

Step 5 — Envelope round-trip: mint a data key, then decrypt the wrapped blob.

WRAPPED=$(aws kms generate-data-key --region $PRIMARY_REGION \
  --key-id alias/kms-lab --key-spec AES_256 \
  --encryption-context tenant=acme \
  --query CiphertextBlob --output text)

# Decrypt the wrapped data key back, in the PRIMARY Region (must supply same context)
aws kms decrypt --region $PRIMARY_REGION \
  --ciphertext-blob fileb://<(echo "$WRAPPED" | base64 --decode) \
  --encryption-context tenant=acme \
  --query KeyId --output text
# -> returns the key ARN, proving decrypt authorization + correct context

Step 6 — Prove cross-Region portability (the DR claim). Decrypt the same wrapped blob by calling KMS in the DR Region:

aws kms decrypt --region $DR_REGION \
  --ciphertext-blob fileb://<(echo "$WRAPPED" | base64 --decode) \
  --encryption-context tenant=acme \
  --query KeyId --output text
# -> returns the DR-Region key ARN: the multi-Region key made the envelope portable

Step 7 — Prove encryption context fails closed. Decrypt with the wrong context — it must error:

aws kms decrypt --region $PRIMARY_REGION \
  --ciphertext-blob fileb://<(echo "$WRAPPED" | base64 --decode) \
  --encryption-context tenant=wrong 2>&1 | grep -i "InvalidCiphertext\|AccessDenied" \
  && echo "GOOD: wrong context correctly rejected"

Validation checklist. You created a multi-Region key, enabled rotation, replicated it, round-tripped a data key, decrypted the same envelope in two Regions (portability), and proved that the wrong encryption context fails closed. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 Create MRK primary Key is multi-Region (mrk-, MultiRegion:true) Standardizing portable keys for replicated data
3 Enable rotation Rotation is on and on your cadence Compliance “rotate yearly” baseline
4 Replicate to DR Replica shares material, own policy DR Region gets decrypt-only scope
5 Data-key round-trip Envelope encryption + correct context decrypts Every app write/read path
6 Cross-Region decrypt Same envelope decrypts in DR Region The DR claim, actually tested
7 Wrong context rejected Context is enforced authorization “Decrypt only acme’s data”

Teardown (avoid the ~$1/key/month charge). A replica must be deleted; deleting an MRK has a scheduling window:

# Schedule deletion in BOTH Regions (7-day minimum window), then remove aliases
aws kms delete-alias --region $DR_REGION --alias-name alias/kms-lab
aws kms schedule-key-deletion --region $DR_REGION --key-id $KEY_ID --pending-window-in-days 7
aws kms delete-alias --region $PRIMARY_REGION --alias-name alias/kms-lab
aws kms schedule-key-deletion --region $PRIMARY_REGION --key-id $KEY_ID --pending-window-in-days 7

Cost note. Two keys for a few days, scheduled for deletion at the 7-day minimum, costs a fraction of the ~$1/key/month — well under ₹100 total. Crypto requests in this lab are a handful and effectively free.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First a scannable table you can read mid-incident, then the full reasoning for the entries that bite hardest. Every row is symptom → root cause → confirm (exact command) → fix.

# Symptom Root cause Confirm (exact cmd) Fix
1 DR Region can’t decrypt replicated data after failover Single-Region CMK behind a cross-Region replicated store aws kms describe-key --key-id <id> --query KeyMetadata.MultiRegion = false Adopt an MRK; replicate; ReEncrypt the backlog
2 AccessDeniedException on Decrypt though IAM clearly allows it Key policy omits IAM delegation → IAM ignored aws kms get-key-policy — no EnableIAMRoot statement Add kms:* to account-root statement
3 InvalidCiphertextException on a blob that decrypted before Encryption context mismatch at decrypt Diff the encrypt-time vs decrypt-time --encryption-context Pass the exact same AAD map, byte-for-byte
4 ThrottlingException on a hot path; surprising KMS bill Per-object GenerateDataKey/Decrypt CloudTrail event count per minute; Service Quotas usage S3 Bucket Keys + data-key caching; raise quota
5 Cross-account read fails with generic AccessDenied Only one side of the handshake granted Check key policy AND consumer IAM (full key ARN) Align both sides; scope with aws:PrincipalOrgID
6 Can’t share an encrypted EBS snapshot cross-account Snapshot encrypted with an AWS-managed key describe-snapshots KMS key is aws/ebs Re-encrypt to a CMK; share + grant Decrypt+CreateGrant
7 Rotated the key but auditors say “data not re-encrypted” Assumed rotation re-wraps stored data (it doesn’t) get-key-rotation-status on, but no backfill ran Run an application-driven ReEncrypt job
8 KMSInvalidStateException on every operation Key is disabled or PendingDeletion aws kms describe-key --query KeyMetadata.KeyState enable-key / cancel-key-deletion
9 Service (ASG/EBS) can’t mint data keys Service grant missing or revoked aws kms list-grants --key-id <id> for the service principal Let the service recreate the grant; don’t revoke service grants
10 Unexpected Decrypt from a human running aws kms decrypt Key not pinned to a service CloudTrail event has no kms:ViaService Add kms:ViaService condition; alarm on direct decrypt
11 IncorrectKeyException decrypting an old object Decrypting with the wrong CMK (alias repointed) Compare the key ARN in the ciphertext vs the alias target Use the key that produced the ciphertext; keep old key alive
12 Deleted a key and lost data ScheduleKeyDeletion ran; material gone after window CloudTrail ScheduleKeyDeletion; key PendingDeletion/gone cancel-key-deletion within the window; use long windows
13 Asymmetric Verify fails on a valid signature Wrong SigningAlgorithm or wrong public key Check SigningAlgorithm + the verifying key Match algorithm + key; re-download public key
14 Grant “works” then intermittently AccessDenied Grant eventual consistency, no GrantToken used Compare grant creation time vs first use Pass the returned GrantToken on immediate use

The expanded form, for the entries that cause the most lost hours:

1. The DR Region cannot decrypt replicated data after a failover. Root cause: A single-Region CMK sits behind a cross-Region replicated store (DynamoDB global table, S3 CRR of SSE-KMS objects, cross-Region snapshot copies). The bytes replicate; the key does not. Confirm: aws kms describe-key --key-id <id> --query 'KeyMetadata.{MR:MultiRegion}' returns false; the ciphertext’s key ARN names the source Region only. Fix: Create a multi-Region key, replicate-key into the DR Region, repoint the app’s key provider, and run a ReEncrypt backfill over pre-existing ciphertext. Test decrypt in DR, not just replication, in every game-day.

2. AccessDeniedException on Decrypt even though the IAM policy plainly allows kms:Decrypt. Root cause: The key policy omits the IAM-delegation statement, so IAM is ignored and the key policy is the only authority — and it doesn’t list this principal. Confirm: aws kms get-key-policy --key-id <id> --policy-name default shows no kms:* to :root statement (no EnableIAMRoot). Fix: Add the account-root delegation statement so IAM is honored; then the existing IAM allow works. This is the most common KMS lockout.

3. InvalidCiphertextException on a blob that decrypted yesterday. Root cause: Encryption context mismatch. The context passed at decrypt differs (a key, a value, or a missing pair) from encrypt — it’s required byte-for-byte. Confirm: Diff the --encryption-context (or SDK encryption_context) used at encrypt vs decrypt; check whether a kms:EncryptionContext: policy condition changed. Fix: Pass the exact same AAD map. Treat context as required authorization, not optional metadata; store/propagate it alongside the ciphertext.

4. ThrottlingException on a hot path, and a four-figure KMS bill nobody expected. Root cause: One KMS call per object — per-object GenerateDataKey on writes or Decrypt on reads — exceeding the Region’s shared request-rate quota and metering every call. Confirm: CloudTrail shows one GenerateDataKey/Decrypt per object; Service Quotas → KMS shows you near the request-rate cap. Fix: Enable S3 Bucket Keys (collapses thousands of calls to ~one per bucket), add data-key caching in the Encryption SDK, verify backoff+jitter, and raise the request-rate quota ahead of launches.

5. A cross-account read fails with a bare AccessDenied and no hint which side. Root cause: The cross-account handshake is two-sided and one side is missing — either the owner’s key policy doesn’t allow the foreign principal, or the consumer’s IAM doesn’t allow the KMS action on the full key ARN. Confirm: Read both the owning account’s key policy (get-key-policy) and the consumer’s IAM policy; the consumer must reference the full key ARN. Fix: Grant both sides; scope the key policy with aws:PrincipalOrgID rather than a bare *.

6. You cannot share an encrypted EBS snapshot with another account. Root cause: The snapshot is encrypted with the AWS-managed aws/ebs key, which cannot be shared cross-account — full stop. Confirm: aws ec2 describe-snapshots --snapshot-ids <id> --query 'Snapshots[].KmsKeyId' shows the aws/ebs alias/key. Fix: Copy the snapshot to one encrypted with a CMK, share the snapshot, grant the target account Decrypt + CreateGrant, and let the target re-encrypt to their key on copy.

7. The key was rotated, but a compliance review fails because “the data wasn’t re-encrypted.” Root cause: A misunderstanding — automatic rotation rotates key material, not your stored ciphertext. Old envelopes are unwrapped with retained old material. Confirm: get-key-rotation-status shows rotation enabled, but there’s no ReEncrypt job in your pipeline. Fix: If the regime requires fresh material on existing data, run a separate, application-driven ReEncrypt backfill. Don’t claim rotation does it.

Best practices

Security notes

The security controls that also prevent the operational incidents — secure and resilient pull the same way here:

Control Mechanism Secures against Also prevents
IAM-delegation statement kms:* to :root in key policy Permanent lockout “AccessDenied though IAM allows” tickets
kms:ViaService condition Pin key to one service Direct human/role decrypt Unexpected Decrypt audit noise
Encryption context + condition AAD bound + kms:EncryptionContext: Cross-tenant decrypt Silent wrong-context reuse
aws:PrincipalOrgID Org-scoped cross-account External-account access Over-broad sharing blast radius
Decrypt-only replica policy Narrow MRK replica key policy Standby Region minting keys Divergent failover behaviour
Long deletion window + review --pending-window-in-days 30 Accidental/malicious key deletion Unrecoverable data loss
S3 Bucket Keys Bucket-level data-key caching (cost/quota) Per-object KMS storm + throttling

Cost & sizing

KMS pricing has two components and one trap. The components: roughly $1 per CMK per month (each MRK replica is billed as its own key, so a primary + one replica ≈ $2/month), plus per-request charges for cryptographic operations (a small fee per 10,000 requests, with asymmetric and the heavier GenerateDataKeyPair operations costing more). The trap: the bill is never the $1/month per key — it is millions of unbatched Decrypt/GenerateDataKey calls from a service that should have used S3 Bucket Keys or a data-key cache.

A rough monthly picture for a mid-size regulated workload — and what each line actually buys:

Cost driver What you pay for Rough INR / month What it buys Watch-out
10 CMKs (single-Region) Per-key monthly charge ~₹850 Clean policy/blast-radius boundaries Don’t fragment to a key per object
1 MRK primary + 1 replica Two keys, two Regions ~₹170 Cross-Region ciphertext portability (DR) Replica is a second billed key
Crypto requests (with Bucket Keys + caching) Per-10k requests, drastically reduced ~₹400–1,500 The actual encrypt/decrypt work Without Bucket Keys this can be 50–100×
Crypto requests (naive, per-object) Per-10k requests, unbatched ~₹40,000+ (same work, done wrong) The classic surprise bill
Quota increase Service Quotas request Free Headroom before a launch Request early; approval isn’t instant
CloudHSM (if mandated) Dedicated HSM hours ~₹1,20,000+ Single-tenant FIPS HSM Only for hard HSM mandates

The sizing rule in one line: the only KMS number that ever surprises a CFO is request volume. Enable Bucket Keys and caching, raise the quota proactively, and the bill stays in the low thousands of rupees; skip them and a per-object hot path turns a rounding error into a five-figure line.

Interview & exam questions

1. Why does KMS “never encrypt your data,” and what does it encrypt instead? KMS is a wrapping and authorization service, not a bulk cipher: key material lives in FIPS 140-3 HSMs and never leaves. For anything over 4 KB you call GenerateDataKey, which returns a data key both in plaintext and wrapped under your CMK; you encrypt the payload locally with the plaintext key, discard it, and store only the ciphertext plus the wrapped blob. KMS protects the key that protects the data — envelope encryption.

2. A Decrypt is denied even though the caller’s IAM policy clearly allows kms:Decrypt. Why, and how do you confirm? The key policy almost certainly omits the IAM-delegation statement (kms:* to the account root), so IAM is ignored and the key policy is the sole authority — and it doesn’t list this principal. Confirm with aws kms get-key-policy; the absence of an EnableIAMRoot-style statement is the smoking gun. Fix by adding the delegation so IAM is honored.

3. A DynamoDB global table replicates client-side-encrypted data to a second Region, but the DR Region can’t decrypt it. What happened and how do you fix it? The data was encrypted under a single-Region CMK; replication moved the ciphertext but not the key, so kms:Decrypt in the DR Region has no key to call. Fix with a multi-Region key (replicate it into the DR Region) and a ReEncrypt backfill of the pre-existing ciphertext — adopting the MRK alone does nothing for data already wrapped under the old key.

4. Explain the three-layer KMS authorization model and which layer is authoritative. The key policy (resource policy on the key) is the root of trust and is authoritative; IAM policies are effective only if the key policy delegates to IAM; grants are programmatic, temporary, fine-grained delegations for services and short-lived jobs. Unlike S3, IAM alone cannot grant KMS access — the key policy must enable it.

5. What is encryption context, and what is it good for? It’s additional authenticated data (AAD): not secret, not encrypted, but bound to the ciphertext, required byte-for-byte at decrypt, and logged in CloudTrail. It’s the cheapest authorization and audit tool in KMS — constrain it with kms:EncryptionContext: conditions to say “this role can decrypt only tenant=acme data,” and read it in CloudTrail to see exactly what was decrypted.

6. Does automatic key rotation re-encrypt your existing data? If not, what does? No — automatic rotation generates new key material and retains the old, so new writes use fresh material while old ciphertext is still unwrapped with retained material; your stored data and data keys are untouched. To actually re-wrap existing data (a compliance requirement in some regimes) you run a separate, application-driven ReEncrypt backfill.

7. How do you share an encrypted EBS snapshot across accounts, and why won’t the default key work? You cannot share a snapshot encrypted with the AWS-managed aws/ebs key — it’s not shareable cross-account. Use a customer-managed key, share the snapshot, grant the target account Decrypt + CreateGrant on the key, and the target re-encrypts to their key on copy. This is why production fleets standardize on CMKs for EBS.

8. A hot path is throwing ThrottlingException and the KMS bill spiked. Diagnose and fix. The path is making one KMS call per object (GenerateDataKey/Decrypt), exceeding the Region’s shared request-rate quota and metering every call. Confirm via CloudTrail event counts and Service Quotas usage. Fix by enabling S3 Bucket Keys, adding data-key caching in the Encryption SDK, verifying backoff+jitter, and raising the request-rate quota ahead of launches.

9. When do you reach for a grant instead of editing the key policy? For AWS services and short-lived workloads — a grant is issued by API, carries its own operation list and encryption-context constraints, is retirable the instant the work is done, and doesn’t bloat the resource policy. It’s exactly how services like Auto Scaling and EBS mint data keys on your behalf. Prefer grants for transient/service access; key-policy edits for the durable allow-list.

10. How do you confine a KMS key so only a specific service in your org can use it? Add two condition keys to the policy: kms:ViaService (e.g. s3.eu-west-1.amazonaws.com) so the key is usable only through that service and not by a human running aws kms decrypt, and aws:PrincipalOrgID so cross-account use is confined to your organization. Add aws:SourceVpce to pin it to your KMS interface endpoint for off-network defense.

11. What’s the difference between a single-Region and a multi-Region key, and when is each correct? A single-Region key is Region-locked — its ciphertext decrypts only in that Region, and it offers the strongest isolation. A multi-Region key shares identical material across a primary and replicas so the same envelope decrypts in any of them. Use single-Region by default; use multi-Region only where ciphertext must cross Regions (global tables, cross-Region DR), accepting weaker isolation as the trade.

12. You scheduled a key for deletion. What are the risks and the safety net? Deleting a CMK destroys the material after a 7–30 day waiting window, rendering all ciphertext under it permanently unreadable. The safety net is cancel-key-deletion within the window, and the discipline is long windows (close to 30 days) plus multi-party review, because ScheduleKeyDeletion is a data-destroying action.

These map across several certs: AWS Certified Security – Specialty (SCS-C02) covers KMS authorization, grants, encryption context, cross-account, and CloudTrail auditing in depth; Solutions Architect – Professional (SAP-C02) covers multi-Region keys and DR portability; Solutions Architect – Associate (SAA-C03) covers envelope encryption, SSE integrations, and key types. A compact cert-mapping:

Question theme Primary cert Objective area
Key policy vs IAM vs grants Security – Specialty Identity & access management; data protection
Encryption context, kms:ViaService Security – Specialty Data protection; logging & monitoring
Multi-Region keys, DR portability SA Professional Design for resilience / continuity
Envelope encryption, key types, SSE SA Associate Secure architectures; storage encryption
Quotas, Bucket Keys, cost SA Associate / Specialty Cost-optimized & performant architectures
Rotation, deletion windows, audit Security – Specialty Data protection; incident response

Quick check

  1. A Decrypt is denied even though the caller’s IAM allows kms:Decrypt. What is the most likely cause, and the one command to confirm it?
  2. A global table replicated your encrypted data to a DR Region, but the DR Region can’t decrypt it. What single property of the key explains this, and what two-part fix is required?
  3. True or false: enabling automatic key rotation re-encrypts your existing stored data under the new material.
  4. You’re seeing ThrottlingException on an S3-backed hot path and a surprising KMS bill. Name the single biggest lever to fix both.
  5. You want a CMK that can only be used through S3 and only by principals in your organization. Which two condition keys do you add?

Answers

  1. The key policy omits the IAM-delegation statement, so IAM is ignored and the key policy is authoritative — and it doesn’t list this principal. Confirm with aws kms get-key-policy --key-id <id> --policy-name default; the missing kms:*-to-:root (EnableIAMRoot) statement is the cause. Add it.
  2. The key is single-Region (MultiRegion: false) — replication moved the ciphertext but not the key. The fix is two parts: adopt a multi-Region key (replicate it into the DR Region) and run a ReEncrypt backfill of the pre-existing ciphertext; the MRK alone does nothing for data already wrapped under the old key.
  3. False. Rotation generates new material and retains the old, so new writes use fresh material while old ciphertext is unwrapped with retained material — your stored data is untouched. Re-wrapping existing data needs a separate, application-driven ReEncrypt job.
  4. Enable S3 Bucket Keys. It collapses thousands of per-object KMS calls into roughly one per bucket, cutting the request line of the bill and relieving the request-rate quota that’s causing the throttling. (Add data-key caching and a quota increase as follow-ups.)
  5. kms:ViaService (e.g. s3.eu-west-1.amazonaws.com) to pin the key to S3 only — blocking a human running aws kms decrypt — and aws:PrincipalOrgID to confine use to your organization.

Glossary

Next steps

You can now architect KMS as an authorization system with a latency budget and a quota — pick key types deliberately, scope authorization to the exact caller and context, make ciphertext portable where it travels, and crush per-object call volume. Build outward:

awskmsencryptionsecuritykey-management
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments