A retail-analytics company runs Databricks workspaces in three places at once — production lakehouse on AWS, a marketing data science team on Azure, and a recently-acquired subsidiary still on GCP. Their data lands in S3, ADLS Gen2, and GCS buckets owned by different cloud accounts, and right now every job authenticates with a long-lived access key pasted into a cluster’s Spark config. The security team found one of those keys in a notebook checked into Git (the same class of mistake that has bitten this org before), and the mandate landed the next morning: no static cloud keys on any cluster, every storage path governed centrally, and an audit trail of who read which prefix. Unity Catalog is the answer — a metastore-level governance layer where a storage credential holds the cloud IAM principal Databricks assumes, and an external location binds that credential to a specific bucket path with grantable permissions. This guide builds that, concretely, across all three clouds, with the keys gone for good.
Prerequisites
- A Unity Catalog metastore already created and attached to each workspace (one metastore per region; this guide assumes it exists). You are a metastore admin or hold
CREATE STORAGE CREDENTIAL/CREATE EXTERNAL LOCATION. - Databricks CLI v0.205+ (the new unified CLI) authenticated with an OAuth U2M or service-principal token:
databricks auth login --host https://<workspace>.cloud.databricks.com. - Cloud admin rights to create roles/identities: AWS IAM (role + trust policy), Azure (an Access Connector for Databricks managed identity + RBAC), GCP (a service account + bucket IAM).
- Terraform 1.6+ with the
databricks/databricksprovider~> 1.40and the relevant cloud provider; HashiCorp Vault reachable if you choose to source bootstrap secrets from it. - Workforce SSO already federated Okta → Microsoft Entra ID (Okta as the IdP, Entra issuing the tokens Azure RBAC consumes); SCIM-provisioned groups synced into Databricks so grants target real teams.
aws,az, andgcloudCLIs authenticated for whichever clouds you target.
Target topology
The governance model is a three-link chain, identical in shape on every cloud:
- A cloud IAM principal (an AWS IAM role, an Azure managed identity via an Access Connector, or a GCP service account) is granted read/write on a specific bucket — and only that bucket — by the cloud’s own IAM.
- A Unity Catalog storage credential wraps that principal. It is the secret-free handle: Databricks assumes the role/identity at query time using a short-lived token. No access key ever exists on the cluster.
- A Unity Catalog external location binds one credential to one
cloud://bucket/prefixURL and becomes the object youGRANTon. Tables, volumes, andCOPY INTOpaths under that prefix inherit governance from it.
Around that chain sit the operational tools: Terraform (with Jenkins or GitHub Actions running plan/apply, and Argo CD syncing the workspace-config repo) provisions every credential and location as code; HashiCorp Vault holds the few bootstrap secrets and the Databricks PAT used by CI; Okta/Entra ID authenticate the humans and service principals; Wiz (and Wiz Code scanning the IaC pre-merge) continuously checks that no bucket drifted to public and that no credential over-grants; Dynatrace/Datadog ingest the Unity Catalog audit and billable-usage logs for access dashboards; ServiceNow carries the change request that approves a new external location into production.
1. Create the cloud IAM principal (per cloud)
Start on the cloud side. The storage credential cannot validate until the principal exists and trusts Databricks.
AWS — IAM role with the Databricks trust policy. Databricks assumes this role; the trust policy must name the Databricks AWS account and the external ID (your Databricks account ID). First create the role with a self-assuming placeholder trust, then tighten it in step 2 once you have the credential’s external ID.
# Permission policy: scope to the exact bucket, nothing wider
cat > /tmp/uc-s3-policy.json <<'JSON'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject","s3:PutObject","s3:DeleteObject","s3:ListBucket","s3:GetBucketLocation"],
"Resource": [
"arn:aws:s3:::acme-lakehouse-prod",
"arn:aws:s3:::acme-lakehouse-prod/*"
]
}
]
}
JSON
aws iam create-role \
--role-name databricks-uc-lakehouse-prod \
--assume-role-policy-document file:///tmp/uc-trust-bootstrap.json
aws iam put-role-policy \
--role-name databricks-uc-lakehouse-prod \
--policy-name uc-s3-access \
--policy-document file:///tmp/uc-s3-policy.json
Azure — Access Connector for Databricks (managed identity). This first-party resource is the only supported way to give Unity Catalog an Azure identity. Create it, then grant its managed identity Storage Blob Data Contributor on the storage account.
az databricks access-connector create \
--resource-group rg-data-prod \
--name ac-uc-lakehouse-prod \
--location eastus \
--identity-type SystemAssigned
PRINCIPAL_ID=$(az databricks access-connector show \
-g rg-data-prod -n ac-uc-lakehouse-prod \
--query identity.principalId -o tsv)
az role assignment create \
--assignee-object-id "$PRINCIPAL_ID" \
--assignee-principal-type ServicePrincipal \
--role "Storage Blob Data Contributor" \
--scope "/subscriptions/<sub-id>/resourceGroups/rg-data-prod/providers/Microsoft.Storage/storageAccounts/acmelakehouseprod"
GCP — service account, granted at credential-creation time. On GCP, Unity Catalog generates the service account for you when you create the credential (step 2); here you just confirm the target bucket and reserve the IAM binding you will apply once you have the generated SA email.
gsutil ls -b gs://acme-lakehouse-prod # confirm the bucket exists and is the right one
# (IAM binding applied in step 2, after Databricks returns the generated SA email)
2. Create the storage credential in Unity Catalog
The storage credential is the metastore-level object that wraps the principal. Create it once per cloud principal; many external locations can share one credential.
AWS — reference the role ARN. Databricks returns an external ID; use it to finalize the trust policy.
databricks storage-credentials create --json '{
"name": "sc-lakehouse-aws-prod",
"aws_iam_role": { "role_arn": "arn:aws:iam::111122223333:role/databricks-uc-lakehouse-prod" },
"comment": "Prod lakehouse S3 access",
"skip_validation": false
}'
Read back the external_id Databricks assigned, then replace the bootstrap trust with the real one and re-validate:
EXT_ID=$(databricks storage-credentials get sc-lakehouse-aws-prod \
| python3 -c 'import json,sys;print(json.load(sys.stdin)["aws_iam_role"]["external_id"])')
cat > /tmp/uc-trust.json <<JSON
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL" },
"Action": "sts:AssumeRole",
"Condition": { "StringEquals": { "sts:ExternalId": "${EXT_ID}" } }
}]
}
JSON
aws iam update-assume-role-policy \
--role-name databricks-uc-lakehouse-prod \
--policy-document file:///tmp/uc-trust.json
databricks storage-credentials validate --storage-credential-name sc-lakehouse-aws-prod
The principal ARN
414351767826:...UCMasterRole...is Databricks’ fixed Unity Catalog role for your account region — copy the exact value the Databricks UI shows for your account; it is account-specific.
Azure — reference the Access Connector resource ID:
databricks storage-credentials create --json '{
"name": "sc-lakehouse-azure-prod",
"azure_managed_identity": {
"access_connector_id": "/subscriptions/<sub-id>/resourceGroups/rg-data-prod/providers/Microsoft.Databricks/accessConnectors/ac-uc-lakehouse-prod"
},
"comment": "Prod lakehouse ADLS Gen2 access"
}'
GCP — create the credential, then bind the generated service account on the bucket:
databricks storage-credentials create --json '{
"name": "sc-lakehouse-gcp-prod",
"databricks_gcp_service_account": {},
"comment": "Prod lakehouse GCS access"
}'
GCP_SA=$(databricks storage-credentials get sc-lakehouse-gcp-prod \
| python3 -c 'import json,sys;print(json.load(sys.stdin)["databricks_gcp_service_account"]["email"])')
gcloud storage buckets add-iam-policy-binding gs://acme-lakehouse-prod \
--member="serviceAccount:${GCP_SA}" \
--role="roles/storage.objectAdmin"
3. Create the external location
The external location binds a credential to one storage URL and is the object you grant on. Use distinct prefixes per domain (/bronze, /silver, /gold) so you can grant teams different paths under the same bucket.
# AWS
databricks external-locations create --json '{
"name": "ext-lakehouse-aws-silver",
"url": "s3://acme-lakehouse-prod/silver",
"credential_name": "sc-lakehouse-aws-prod",
"comment": "Silver-layer curated tables (AWS)"
}'
# Azure (abfss scheme, container@account)
databricks external-locations create --json '{
"name": "ext-lakehouse-azure-silver",
"url": "abfss://lakehouse@acmelakehouseprod.dfs.core.windows.net/silver",
"credential_name": "sc-lakehouse-azure-prod"
}'
# GCP
databricks external-locations create --json '{
"name": "ext-lakehouse-gcp-silver",
"url": "gs://acme-lakehouse-prod/silver",
"credential_name": "sc-lakehouse-gcp-prod"
}'
Databricks rejects overlapping URLs — you cannot have two external locations where one path is a prefix of the other — which is exactly the guardrail that stops a careless re-grant of a parent path.
4. Grant least-privilege access
Grants are where governance becomes real. Target the SCIM-synced groups that came from Okta/Entra ID, never individuals. READ FILES / WRITE FILES govern raw path access; CREATE EXTERNAL TABLE and CREATE EXTERNAL VOLUME let teams register objects under the location.
-- Engineers write silver; analysts read it
GRANT READ FILES, WRITE FILES, CREATE EXTERNAL TABLE
ON EXTERNAL LOCATION `ext-lakehouse-aws-silver`
TO `data-engineers`;
GRANT READ FILES
ON EXTERNAL LOCATION `ext-lakehouse-aws-silver`
TO `marketing-analysts`;
-- Never grant on the storage credential itself for data access;
-- CREATE EXTERNAL LOCATION on the *credential* is an admin-only delegation
GRANT CREATE EXTERNAL LOCATION
ON STORAGE CREDENTIAL `sc-lakehouse-aws-prod`
TO `platform-admins`;
5. Codify everything in Terraform
Manual CLI is for the first proof; production lives in code. Jenkins (or GitHub Actions) runs terraform plan on every PR and apply on merge to main; Argo CD keeps a separate app-of-apps reconciling the workspace-config repo; Wiz Code scans the plan for over-broad IAM before it can merge. Pull the Databricks PAT and any cloud bootstrap secret from HashiCorp Vault rather than CI variables.
data "vault_generic_secret" "dbx" {
path = "secret/data/databricks/prod"
}
provider "databricks" {
host = var.workspace_url
token = data.vault_generic_secret.dbx.data["pat"]
}
resource "databricks_storage_credential" "aws_prod" {
name = "sc-lakehouse-aws-prod"
aws_iam_role { role_arn = aws_iam_role.uc_lakehouse.arn }
comment = "Prod lakehouse S3 access"
}
resource "databricks_external_location" "aws_silver" {
name = "ext-lakehouse-aws-silver"
url = "s3://acme-lakehouse-prod/silver"
credential_name = databricks_storage_credential.aws_prod.name
comment = "Silver-layer curated tables (AWS)"
}
resource "databricks_grants" "aws_silver" {
external_location = databricks_external_location.aws_silver.id
grant {
principal = "data-engineers"
privileges = ["READ_FILES", "WRITE_FILES", "CREATE_EXTERNAL_TABLE"]
}
grant {
principal = "marketing-analysts"
privileges = ["READ_FILES"]
}
}
A new external location flows: PR opened, Wiz Code + terraform plan post to the PR, a ServiceNow change request links the PR for the prod approval gate, merge triggers apply, Argo CD confirms drift-free.
6. Wire identity and observability
The whole point was to remove static keys and gain an audit trail — finish that.
- Identity: service principals that run jobs authenticate via OAuth (M2M) tokens minted through the Entra-federated trust, not PATs. Sync
data-engineers,marketing-analysts,platform-adminsfrom Okta → Entra ID via SCIM so a leaver removed in Okta loses Databricks access automatically. - Audit & monitoring: enable the Unity Catalog audit log and system tables (
system.access.audit,system.access.table_lineage). Ship them to Dynatrace or Datadog so you have a dashboard of “who read which external location,” with alerts onWRITE FILEStogoldor any credential-creation event. Wiz independently watches the underlying buckets for public-exposure drift; CrowdStrike Falcon runs on any self-managed driver/worker images and the bootstrap virtual appliances (e.g., a NAT or proxy VM fronting on-prem connectivity) for runtime threat detection. If the org fronts notebook/portal access through a CDN, Akamai terminates TLS and applies WAF at the edge. (A separate Moodle LMS hosts the internal enablement course engineers take before they getplatform-admins.)
Validation
Confirm the chain end to end before handing it to teams.
# 1. Credential and location validate cleanly (no IAM/trust errors)
databricks storage-credentials validate --storage-credential-name sc-lakehouse-aws-prod
databricks external-locations validate \
--external-location-name ext-lakehouse-aws-silver
# 2. List shows expected URL + credential binding
databricks external-locations get ext-lakehouse-aws-silver
Then prove real data access from a cluster as a member of data-engineers, and prove the negative — that a static key is no longer needed and an ungranted team is blocked:
-- Should succeed for data-engineers:
LIST 's3://acme-lakehouse-prod/silver/';
CREATE TABLE silver.orders USING DELTA
LOCATION 's3://acme-lakehouse-prod/silver/orders';
-- Run as marketing-analysts: WRITE must fail with PERMISSION_DENIED
-- COPY INTO ... -> expect "User does not have WRITE FILES on External Location"
Confirm in system.access.audit that the getCredential / LIST events for your user appear with the external-location name — that is the audit trail the mandate required.
Rollback / teardown
Order matters: you cannot delete a credential while a location references it, nor a location while tables sit under it. Unwind inside-out.
# 1. Revoke grants (so nothing new can be created under the path)
databricks grants update --securable-type external_location \
--full-name ext-lakehouse-aws-silver \
--json '{"changes":[{"principal":"marketing-analysts","remove":["READ_FILES"]}]}'
# 2. Drop dependent tables/volumes (or repoint them) first, then the location
databricks external-locations delete ext-lakehouse-aws-silver --force
# 3. Delete the credential (force fails if any location still references it)
databricks storage-credentials delete sc-lakehouse-aws-prod
# 4. Cloud side: detach IAM last
aws iam delete-role-policy --role-name databricks-uc-lakehouse-prod --policy-name uc-s3-access
aws iam delete-role --role-name databricks-uc-lakehouse-prod
With Terraform, the same unwind is a single terraform destroy -target per resource in reverse dependency order; let the provider compute the graph rather than racing deletes by hand. Keep the bucket itself — teardown of governance should never delete data.
Common pitfalls
- Trust policy / external ID mismatch (AWS). The single most common failure: the credential is created but
validatereturnsAbacExceptionorInvalidClientTokenId. Cause is a missingsts:ExternalIdcondition or the wrong DatabricksUCMasterRoleARN. Always read the real external ID back from the credential and copy the account-specific master-role ARN from your UI. - Wrong RBAC role (Azure).
Storage Blob Data Readeris read-only; jobs that write Delta needStorage Blob Data Contributor. And the role must sit on the storage account / container, not the resource group, or validation silently passes but writes fail. - Overlapping URLs. Creating
ext-...-prodons3://bucket/and another ons3://bucket/silveris rejected. Plan the prefix hierarchy first; grant at the narrowest prefix a team needs. - Granting on the credential instead of the location. Data access is granted on the external location; the credential only receives the admin-level
CREATE EXTERNAL LOCATIONdelegation. Mixing these up either over-grants or denies access confusingly. - Firewalled storage with no Databricks exception. If the storage account/bucket has a network firewall, add the Databricks workspace’s VPC/VNet (or the NCC private endpoint / virtual appliance egress) to its allow-list, or validation times out rather than erroring clearly.
- Leftover legacy keys. Removing the external-location grant does not remove a key still pasted in a cluster’s Spark config. Grep cluster policies and notebooks, strip
fs.s3a.access.keyand friends, and let Wiz Code fail any PR that reintroduces one.
Security notes
The architecture is secret-free by design: no cloud access key ever lands on a cluster — Databricks assumes a role/identity for a short-lived token, scoped by IAM to one bucket and by the external-location grant to one prefix and one team. Keep the chain tight: least-privilege IAM on the cloud side, least-privilege GRANT on the Databricks side, and SCIM from Okta/Entra ID so deprovisioning is automatic. Treat CREATE EXTERNAL LOCATION and storage-credential ownership as privileged — only platform-admins, gated behind a ServiceNow change and the Moodle enablement sign-off. Let Wiz verify continuously that no governed bucket drifted public and no credential over-grants, with Wiz Code catching the same in IaC before merge, and CrowdStrike Falcon covering runtime on any self-managed or virtual-appliance compute. Pipe the Unity Catalog audit log to Dynatrace/Datadog with alerts on credential creation and gold-layer writes, so a misuse is a page, not a forensic discovery.
Cost notes
Unity Catalog governance itself adds no Databricks charge — you pay for the DBUs the clusters consume and the underlying cloud storage and egress. The cost levers are operational: (a) cross-cloud egress — a job in an AWS workspace reading a GCS external location pays GCP egress and cross-cloud transfer, so co-locate compute with the bucket it reads and reserve cross-cloud reads for genuine federation; (b) consolidate credentials — one storage credential can back many external locations, so you do not need (and should not create) a credential per bucket; © ship audit/system-table data to Datadog/Dynatrace with sensible retention rather than indexing every event forever; (d) right-size and auto-terminate clusters via cluster policies, since the governance layer does nothing to stop an idle all-purpose cluster from burning DBUs. The Wiz and CrowdStrike subscriptions are fixed platform costs amortized across all workspaces, not per-location.
The destination is a lakehouse where every storage path on every cloud is governed by one model: a credential Databricks assumes without a key, an external location you grant on by prefix, and an audit log that names every reader. The day a key leaks again — and someone will try — there is no key to leak, the blast radius is one prefix, and the team that touched it is on a dashboard before the incident review opens. That is the win the mandate was really asking for.