GCP Security

Cloud KMS in Depth: CMEK, Envelope Encryption, Cloud HSM, and External Key Manager

Every byte at rest in GCP is already encrypted with Google-managed keys you never see. So why does anyone bother with Cloud KMS? Because “encrypted by default” answers the wrong question. The question auditors, regulators, and your own incident-response team actually ask is: who can revoke access to the plaintext, and how fast? With default encryption the answer is “Google, and you have no lever.” Customer-Managed Encryption Keys (CMEK) put that lever in your hands — disable one key version and a petabyte of BigQuery becomes ciphertext nobody can read until you re-enable it. This guide builds the full picture: the key hierarchy, how CMEK actually wires into services, the envelope-encryption mechanics underneath, rotation, the Cloud HSM and EKM boundaries, and the separation-of-duties controls that stop a single admin from destroying it all.

1. KMS concepts: key rings, keys, versions, protection levels

Cloud KMS has a four-level hierarchy, and getting the vocabulary exact saves you from IAM and rotation mistakes later:

Create a ring and a software key:

PROJECT=sec-kms-prod
LOCATION=us-central1

gcloud kms keyrings create app-keyring \
  --project="$PROJECT" --location="$LOCATION"

gcloud kms keys create gcs-cmek \
  --project="$PROJECT" --location="$LOCATION" \
  --keyring=app-keyring \
  --purpose=encryption \
  --protection-level=software \
  --rotation-period=90d \
  --next-rotation-time="$(date -u -v+90d +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -d '+90 days' +%Y-%m-%dT%H:%M:%SZ)"

The resource name you will paste everywhere is the fully-qualified key path: projects/PROJECT/locations/LOCATION/keyRings/RING/cryptoKeys/KEY. CMEK bindings reference the key, not a version — KMS always encrypts with the current primary and can decrypt with any enabled version.

2. Wiring CMEK into GCS, BigQuery, Cloud SQL, and Persistent Disk

The mechanism is consistent across services: each service runs a service agent (a Google-managed service account in your project), and you grant that agent the roles/cloudkms.cryptoKeyEncrypterDecrypter role on the key. The service agent — not your user identity — calls KMS at write and read time. Get the IAM grant wrong and resource creation fails with a permission error on the agent, which trips people up because the error is about an identity they did not create.

Cloud Storage. The Storage service agent is service-PROJECTNUMBER@gs-project-accounts.iam.gserviceaccount.com:

PROJECT_NUMBER=$(gcloud projects describe "$PROJECT" --format='value(projectNumber)')
KEY=projects/$PROJECT/locations/$LOCATION/keyRings/app-keyring/cryptoKeys/gcs-cmek

# Force the agent to exist, then grant it
gcloud storage service-agent --project="$PROJECT"

gcloud kms keys add-iam-policy-binding gcs-cmek \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --member="serviceAccount:service-${PROJECT_NUMBER}@gs-project-accounts.iam.gserviceaccount.com" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

# Set a default CMEK on the bucket: every new object is wrapped with it
gcloud storage buckets update gs://my-cmek-bucket --default-encryption-key="$KEY"

BigQuery. Grant the BigQuery service agent, then set a default key on the dataset (and/or per-table). The agent is bq-PROJECTNUMBER@bigquery-encryption.iam.gserviceaccount.com:

gcloud kms keys add-iam-policy-binding bq-cmek \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --member="serviceAccount:bq-${PROJECT_NUMBER}@bigquery-encryption.iam.gserviceaccount.com" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

bq update --default_kms_key="$KEY" "$PROJECT:analytics_ds"

Cloud SQL. The Cloud SQL service agent gets the grant, and the key is set at instance creation — you cannot retrofit CMEK onto an existing instance, you must create a new one (typically restore from backup into a CMEK instance):

SQL_SA="service-${PROJECT_NUMBER}@gcp-sa-cloud-sql.iam.gserviceaccount.com"
gcloud kms keys add-iam-policy-binding sql-cmek \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --member="serviceAccount:${SQL_SA}" \
  --role="roles/cloudkms.cryptoKeyEncrypterDecrypter"

gcloud sql instances create pg-cmek \
  --project="$PROJECT" --region="$LOCATION" \
  --database-version=POSTGRES_16 --edition=ENTERPRISE \
  --tier=db-custom-2-8192 \
  --disk-encryption-key="projects/$PROJECT/locations/$LOCATION/keyRings/app-keyring/cryptoKeys/sql-cmek"

Persistent Disk / Compute. The Compute Engine service agent is service-PROJECTNUMBER@compute-system.iam.gserviceaccount.com; the disk takes the key at create time:

gcloud compute disks create data-disk \
  --project="$PROJECT" --zone="${LOCATION}-a" --size=200 \
  --kms-key="projects/$PROJECT/locations/$LOCATION/keyRings/app-keyring/cryptoKeys/disk-cmek"

The Terraform shape for the binding is identical regardless of service — grant the agent, then reference the key:

resource "google_kms_crypto_key_iam_member" "gcs_agent" {
  crypto_key_id = google_kms_crypto_key.gcs_cmek.id
  role          = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
  member        = "serviceAccount:service-${data.google_project.p.number}@gs-project-accounts.iam.gserviceaccount.com"
}

resource "google_storage_bucket" "data" {
  name                        = "my-cmek-bucket"
  location                    = "US"
  uniform_bucket_level_access = true
  encryption {
    default_kms_key_name = google_kms_crypto_key.gcs_cmek.id
  }
  depends_on = [google_kms_crypto_key_iam_member.gcs_agent]
}

That depends_on matters: without the IAM binding in place first, bucket creation with CMEK races and fails.

3. Envelope encryption: DEKs, KEKs, and the encrypt/decrypt flow

CMEK at the service layer hides a pattern you should implement yourself whenever you encrypt application payloads, because calling KMS to encrypt every record directly is slow, rate-limited, and size-capped (the encrypt API tops out at 64 KiB of plaintext). The pattern is envelope encryption:

  1. Generate a random Data Encryption Key (DEK) locally — a 256-bit AES key.
  2. Encrypt your data with the DEK locally (fast, unlimited size, your own AES-GCM).
  3. Call KMS to encrypt (wrap) the DEK with a Key Encryption Key (KEK) that never leaves KMS.
  4. Store the wrapped DEK next to the ciphertext. Discard the plaintext DEK from memory.

To decrypt: read the wrapped DEK, call KMS decrypt to unwrap it, decrypt the data locally, drop the DEK again. KMS only ever sees the tiny DEK, never your data.

import os
from google.cloud import kms
from cryptography.hazmat.primitives.ciphers.aead import AESGCM

client = kms.KeyManagementServiceClient()
KEK = "projects/sec-kms-prod/locations/us-central1/keyRings/app-keyring/cryptoKeys/app-kek"

def encrypt(plaintext: bytes, aad: bytes = b"") -> dict:
    dek = AESGCM.generate_key(bit_length=256)          # 1. local DEK
    nonce = os.urandom(12)
    ciphertext = AESGCM(dek).encrypt(nonce, plaintext, aad)  # 2. local encrypt
    wrapped = client.encrypt(                            # 3. wrap DEK in KMS
        request={"name": KEK, "plaintext": dek,
                 "additional_authenticated_data": aad}
    ).ciphertext
    return {"wrapped_dek": wrapped, "nonce": nonce, "ciphertext": ciphertext}

def decrypt(blob: dict, aad: bytes = b"") -> bytes:
    dek = client.decrypt(                                # unwrap DEK in KMS
        request={"name": KEK, "ciphertext": blob["wrapped_dek"],
                 "additional_authenticated_data": aad}
    ).plaintext
    return AESGCM(dek).decrypt(blob["nonce"], blob["ciphertext"], aad)

Two production notes. First, pass Additional Authenticated Data (AAD) — the same value must be supplied on encrypt and decrypt, binding the wrapped DEK to a context (e.g. a tenant ID), so a stolen ciphertext cannot be unwrapped against the wrong record. Second, for hot paths, don’t reach for raw KMS — use Tink, Google’s crypto library, with a KMS-backed KEK. Tink does envelope encryption correctly, caches nothing dangerous, and removes the foot-guns of hand-rolling nonces.

4. Rotation: automatic, manual, and re-encryption reality

Set --rotation-period and KMS automatically generates a new primary version on schedule. This is cheap because it does not re-encrypt anything. New writes use the new primary; existing ciphertext stays wrapped under whichever version created it, and old versions remain enabled for decrypt. Rotation limits the blast radius of a single version’s compromise and satisfies “keys must rotate every N days” controls — it does not, by itself, re-protect old data.

# Inspect and force a manual rotation
gcloud kms keys versions list --location="$LOCATION" \
  --keyring=app-keyring --key=gcs-cmek

gcloud kms keys versions create --location="$LOCATION" \
  --keyring=app-keyring --key=gcs-cmek --primary   # new primary now

# Adjust the schedule
gcloud kms keys update gcs-cmek --location="$LOCATION" \
  --keyring=app-keyring --rotation-period=30d \
  --next-rotation-time="$(date -u -d '+30 days' +%Y-%m-%dT%H:%M:%SZ)"

If a control genuinely requires that old data be re-wrapped under the new version (true key compromise, or a hard “no data older than the current key” mandate), you must actively re-encrypt:

Do not destroy old versions just because you rotated. Any object still wrapped under version 3 becomes permanently unreadable the moment version 3 is destroyed. Disable, observe crypto_key_version usage in logs for your full retention window, then schedule destruction.

5. Cloud HSM and the FIPS 140-2 Level 3 boundary

Software-protected keys are FIPS 140-2 Level 1. Many regulated workloads require Level 3 — tamper-evident, tamper-responsive hardware with identity-based authentication. Cloud HSM gives you exactly that: keys with --protection-level=hsm are generated and used inside Google-operated, FIPS 140-2 Level 3 validated HSMs, and the private material provably never leaves the hardware in plaintext. The API surface is identical to software keys — same encrypt/decrypt, same CMEK wiring — only the protection level and (modestly higher) price change.

gcloud kms keys create payments-hsm \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --purpose=encryption --protection-level=hsm --rotation-period=90d \
  --next-rotation-time="$(date -u -d '+90 days' +%Y-%m-%dT%H:%M:%SZ)"

Cloud HSM also supports attestation: each version can return a signed statement from the HSM proving the key was created in genuine Google HSM hardware, which auditors increasingly ask for. Two constraints to plan around: HSM keys are regional only (no global/multi-region HSM key rings — pick the region deliberately), and HSM has its own cryptographic-operation quotas, so a high-QPS envelope workload should cache unwrapped DEKs rather than calling the HSM per request.

# Retrieve the signed attestation for an HSM key version
gcloud kms keys versions describe 1 \
  --location="$LOCATION" --keyring=app-keyring --key=payments-hsm \
  --attestation-file=attestation.dat

6. External Key Manager (EKM) and EKM via VPC for hold-your-own-key

For “hold-your-own-key” / key-externalization mandates — where the organization (or its regulator) insists the key material live outside Google entirely, in a third-party manager like Fortanix, Thales, or Equinix SmartKey — use External Key Manager. With --protection-level=external, the key material stays in your external HSM/KMS; Cloud KMS holds only a reference (a key URI) and proxies crypto operations out to it. Pull your key from the external manager and Google instantly loses the ability to decrypt: that is the entire value proposition, and the entire risk (your external manager is now a hard availability dependency for your data plane).

gcloud kms keys create ekm-key \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --purpose=encryption --protection-level=external \
  --skip-initial-version-creation

gcloud kms keys versions create \
  --location="$LOCATION" --keyring=app-keyring --key=ekm-key \
  --external-key-uri="https://my-ekm.example.com/v0/keys/abc-123" \
  --primary

The original EKM reached the external manager over the public internet via HTTPS, which many security teams will not accept. EKM via VPC removes that: Cloud KMS connects to your external manager over a private path through your VPC (no public exposure of the external endpoint). You first create an ekmConnection pointing at a service-attachment or hostname reachable in your VPC, then bind key versions to it:

gcloud kms ekm-connections create ekm-vpc-conn \
  --project="$PROJECT" --location="$LOCATION" \
  --service-resolvers-from-file=resolvers.yaml

gcloud kms keys versions create \
  --location="$LOCATION" --keyring=app-keyring --key=ekm-key \
  --ekm-connection-key-path="/keys/abc-123" --primary

Latency and availability are real here. Every CMEK read on an EKM-backed resource is a network round-trip to your external manager. Size its HA accordingly, and prefer EKM for the keys that must be externalized (the crown-jewels dataset), not blanket across every bucket.

7. IAM separation of duties and key destruction safeguards

The whole point of CMEK collapses if one person can both destroy the key and read the data. Enforce three distinct roles, granted at the key ring level, never bundling them on one identity:

Role Predefined role Can do Must NOT also have
Key admin roles/cloudkms.admin create keys, set rotation, schedule destruction data-reader access to protected resources
Crypto operator (services) roles/cloudkms.cryptoKeyEncrypterDecrypter encrypt/decrypt (the service agents) admin / destroy
Auditor roles/cloudkms.viewer read key metadata, no crypto, no admin any write
# Key admins (a small group) — admin only, no decrypt
gcloud kms keyrings add-iam-policy-binding app-keyring \
  --project="$PROJECT" --location="$LOCATION" \
  --member="group:kms-admins@example.com" \
  --role="roles/cloudkms.admin"

Two safeguards stop accidental or malicious destruction:

Destruction is a scheduled, reversible delay. Destroying a version moves it to DESTROY_SCHEDULED for a configurable period (default 24 hours, settable up to 120 days at key-ring creation) before the material is actually gone. During that window you can restore it. Set this window deliberately — 24 hours is too short to catch a bad change over a long weekend:

gcloud kms keyrings create app-keyring \
  --project="$PROJECT" --location="$LOCATION"
# destroyScheduledDuration is set per-key at creation, e.g. 30 days:
gcloud kms keys create critical-cmek \
  --project="$PROJECT" --location="$LOCATION" --keyring=app-keyring \
  --purpose=encryption --destroy-scheduled-duration=2592000s

# If someone schedules a destroy in error, restore within the window:
gcloud kms keys versions restore 5 \
  --location="$LOCATION" --keyring=app-keyring --key=critical-cmek

An Org Policy can forbid destruction tooling-wide as a backstop, and Cloud KMS Autokey (where available) can centralize key creation so app teams never hold cloudkms.admin at all.

8. Auditing key usage and handling disabled-key incidents

Cloud KMS data-access audit logs are not on by default — and without them you are blind to who decrypted what. Turn them on, then alert on the events that matter.

# In the project/org IAM policy auditConfigs, enable KMS data-access logs
auditConfigs:
- service: cloudkms.googleapis.com
  auditLogConfigs:
  - logType: DATA_READ
  - logType: DATA_WRITE
  - logType: ADMIN_READ

Every crypto op then lands in Cloud Logging, tagged with the exact key version used:

resource.type="cloudkms_crypto_key"
protoPayload.serviceName="cloudkms.googleapis.com"
protoPayload.methodName="Decrypt"
protoPayload.resourceName=~"cryptoKeys/payments-hsm"

The single highest-value alert is on administrative state changes — disable and destroy:

resource.type="cloudkms_crypto_key_version"
protoPayload.methodName=("DestroyCryptoKeyVersion" OR
  "UpdateCryptoKeyVersion")

Handling a disabled-key incident. When a key version flips to DISABLED, every dependent resource starts failing reads — GCS returns 403s on objects wrapped by that version, BigQuery queries error, a Cloud SQL or Compute instance whose disk key is disabled will eventually fail to start. The recovery is fast precisely because disable is reversible:

# Triage: which version, what state, who touched it
gcloud kms keys versions list --location="$LOCATION" \
  --keyring=app-keyring --key=payments-hsm \
  --format='table(name.scope(cryptoKeyVersions), state)'

# Re-enable to restore access immediately
gcloud kms keys versions enable 4 \
  --location="$LOCATION" --keyring=app-keyring --key=payments-hsm

Disabling a key is the fastest “logical shred” you have — flip it and the data is unreadable everywhere instantly, without touching the data. That makes it a deliberate incident-response tool (kill access to a breached dataset in one command) and a self-inflicted outage waiting to happen. Alert on it, document who is allowed to do it, and never wire it into automation that can fire by accident.

Enterprise scenario

A payments platform team running a regulated tokenization service had a contractual hold-your-own-key requirement: their bank partner mandated that the bank, not Google, control the key protecting the cardholder dataset, and that key material never reside in Google’s infrastructure. The naive read was “use EKM” — but their first EKM design reached the external Fortanix cluster over the public internet, and their own VPC Service Controls perimeter plus the partner’s security review both rejected any public egress from the data plane.

The fix was EKM via VPC combined with strict separation of duties. They stood up an ekmConnection so Cloud KMS reached the external manager privately through their Shared VPC host project — no public endpoint, all traffic inside the perimeter. The cardholder BigQuery dataset and the GCS bucket holding raw card files were bound to the EKM-backed key; the bank held the actual material in Fortanix and could revoke it unilaterally. Crucially, no human at the platform team held both cloudkms.admin and BigQuery data-reader on that dataset, and a destroy-scheduled duration of 30 days plus an alert on every Disable/Destroy event meant an accidental or malicious key kill could be caught and restored long before material was lost.

# The load-bearing binding: dataset CMEK pinned to the EKM-via-VPC key,
# whose material the bank controls externally.
bq update --default_kms_key=\
"projects/pay-prod/locations/us-central1/keyRings/pci-ring/cryptoKeys/ekm-card-key" \
  pay-prod:cardholder_ds

The result satisfied the audit: the bank could prove sole control of the key, GCP never saw the material, the path was private, and a single disabled key version became the documented, alarmed kill-switch for the entire cardholder dataset rather than a silent outage.

Verify

Checklist

gcpcloud-kmscmekencryptionsecurity

Comments

Keep Reading