A national health insurer’s chief data officer walks out of an audit with a finding that cannot be argued away: the actuarial team, the fraud team, and three regional marketing teams each maintain their own copy of the member-claims data, extracted on different days through different scripts, and nobody can say who can see protected health information, where a given column came from, or whether the “risk score” in the board deck was computed from the same data the regulator was shown. Seven copies of the truth, four of them stale, none of them governed. The mandate that lands on the data platform team is precise and uncomfortable: one governed copy of every dataset, row- and column-level control over PHI that survives an HHS audit, end-to-end lineage from raw feed to the dashboard, and self-service for 600 analysts and data scientists who will revolt if you make them file a ticket for every table. A pile of notebooks pointed at a storage account is not the answer. This article is the reference architecture for building that platform properly on Azure Databricks with Unity Catalog — a private-networked, identity-gated, lineage-tracked lakehouse that a HIPAA compliance officer and a CISO will actually sign.
The pressures stack the way they always do in regulated data. Regulation (HIPAA, plus state insurance law) means every access to PHI needs an enforced policy and an audit trail, and the data cannot sit on the public internet. Scale means tens of terabytes of claims, eligibility, and clinical feeds landing daily, queried by 600 people who each think their workload is the priority. Governance means one catalog, one permission model, one lineage graph — not a per-team patchwork. And cost means compute that today runs 24/7 because nobody trusts auto-termination, billed to a single cost center instead of charged back. The lakehouse is the pattern that satisfies all four at once: open Delta tables on cheap object storage with the ACID guarantees, governance, and performance people used to buy a warehouse for — one copy of the data, governed in place, serving SQL analytics and ML from the same tables.
Why not the obvious shortcuts
The naive fixes each fail predictably, and naming why matters because someone on the project will propose all three.
A classic data warehouse (Synapse dedicated pools, Snowflake) governs beautifully but forces a copy: claims data lands in a lake, then gets loaded again into the warehouse, and now you maintain two systems, two security models, and a nightly ETL that is the thing most likely to break at 3 a.m. — and the data scientists still can’t train models on warehouse-locked tables. A plain data lake (just ADLS Gen2 with notebooks) is cheap and open but has no real governance: permissions are storage-account ACLs that nobody can reason about, there is no column masking, no lineage, and “who queried this PHI” is unanswerable. Per-workspace Hive metastores — the old Databricks default — scatter your metadata across every workspace, so a table governed in the actuarial workspace is invisible and ungoverned in the fraud one, which is exactly the seven-copies problem you were hired to kill.
The lakehouse with Unity Catalog threads the needle. Data lives once as Delta tables on ADLS Gen2. Unity Catalog sits above every workspace as a single account-level governance plane: one three-level namespace (catalog.schema.table), one place to grant access, fine-grained row filters and column masks for PHI, and automatic lineage captured down to the column. Governance becomes a property of the data, enforced by the engine on every query, instead of a hope in each team’s scripts.
Architecture overview
The platform runs two distinct concerns that share storage and governance but live on different schedules: data engineering pipelines that refine raw feeds into trustworthy tables, and consumption workloads — SQL analytics, BI, and ML — that read those tables. Keeping them separate in your head is the first step to operating this well.
The defining property of the topology is the one the security team cares about most: storage and the control plane are private, and every public data-plane surface is disabled. ADLS Gen2, the Databricks workspaces, Key Vault, and the supporting PaaS expose private IPs inside the VNet via Private Endpoints; the workspaces use Secure Cluster Connectivity (no-public-IP) so cluster nodes have no inbound public address, and Private Link carries both the front-end (user → workspace) and back-end (cluster → control plane) traffic. No claims record and no query result ever traverses the public internet — which is what makes a HIPAA story defensible.
The medallion data flow, following the path a claim takes:
- Source systems — the claims adjudication platform, the eligibility system, an HL7/FHIR clinical feed, and a fraud-signal vendor — land raw files and CDC streams into the Bronze zone on ADLS Gen2. Ingestion uses Databricks Auto Loader (incremental file discovery) and, for streaming feeds, Event Hubs. Bronze is the immutable landing record: raw, append-only, exactly as received, so you can always replay.
- Delta Live Tables (DLT) pipelines refine Bronze → Silver: schema is enforced, types are cast, duplicates are dropped, member identifiers are standardized, and DLT expectations (declarative data-quality constraints) quarantine or drop rows that fail —
EXPECT member_id IS NOT NULL ON VIOLATION DROP ROW. Silver is the cleaned, conformed, queryable record of claims and members. - DLT continues Silver → Gold: business-level aggregates and features — per-member risk scores, fraud-likelihood features, monthly loss ratios — modeled for consumption. Gold tables are what the board deck and the actuarial models read, and because they are derived through governed, lineage-tracked DLT, you can prove what every number came from.
- Unity Catalog governs all three layers as one namespace. A SQL analyst opens Databricks SQL against a serverless SQL warehouse and queries
prod.gold.member_risk; the engine checks UC permissions, applies the row filter that scopes the analyst to their permitted regions, and applies the column mask that redacts SSN and diagnosis codes unless the caller is in thephi_clearedgroup — all before a single row is returned. - Data scientists work in the same catalog from notebooks and MLflow, reading governed Silver/Gold tables, registering models in the Unity Catalog model registry, and serving features through the online store — never extracting PHI to a laptop, because the governed table is easier to use than a copy would be.
Identity and provisioning, independent and event-driven: the insurer’s workforce IdP is Okta, federated to Microsoft Entra ID so Azure and Databricks see a first-class Entra token. Okta provisions users and groups into the Databricks account via SCIM — when HR moves an analyst into the actuarial group in Okta, SCIM pushes that group into Databricks within minutes, and Unity Catalog grants attached to the group take effect immediately. Deprovisioning is the same path in reverse, which is the control an auditor actually tests. Secrets that are not managed identities — the fraud vendor’s API key, JDBC credentials for the source systems — live in HashiCorp Vault (or Azure Key Vault behind a Databricks secret scope), leased dynamically and never written into notebook source.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge / BI delivery | Akamai | TLS, anycast, WAF for the analytics portal and embedded dashboards | WAF rules on the BI front door; origin shield to the private workspace endpoint |
| Identity / SSO | Okta + Microsoft Entra ID | Workforce SSO (Okta) federated to Entra for native Azure + Databricks auth | OIDC federation; Okta SCIM provisions users/groups to the Databricks account |
| Governance plane | Unity Catalog | One account-level catalog: grants, row filters, column masks, lineage | Three-level namespace; storage credentials + external locations; ABAC tags |
| Storage | ADLS Gen2 | Single governed copy of Bronze/Silver/Gold Delta tables | Hierarchical namespace; Private Endpoint; public access disabled |
| Ingestion | Auto Loader + Event Hubs | Incremental file and stream ingestion into Bronze | cloudFiles schema evolution; checkpointed exactly-once |
| Pipelines | Delta Live Tables | Declarative medallion refinement with quality expectations | Expectations (drop/quarantine); Enhanced Autoscaling; serverless DLT |
| SQL analytics | Databricks SQL (serverless) | Governed BI/SQL on Gold; auto-suspend warehouses | Serverless warehouses; auto-stop 5–10 min; Photon engine |
| ML | MLflow + UC model registry | Train on governed tables, register/serve models with lineage | Models in UC; feature/online store; no PHI extraction |
| Secrets | HashiCorp Vault / Key Vault | Source-system creds, vendor API keys | Dynamic leases; Databricks secret scope backed by Key Vault |
| Catalog / lineage (enterprise) | Microsoft Purview | Enterprise-wide catalog, classification, business glossary, lineage rollup | UC ↔ Purview integration; auto-classify PHI; scan ADLS |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, sensitive-data exposure, attack paths; IaC scanning | Agentless scan of ADLS/Databricks; Wiz Code gates Terraform PRs |
| Runtime security | CrowdStrike Falcon | Runtime threat detection on cluster/driver compute | Sensor via init script on node pools; detections to the SOC |
| Observability | Dynatrace / Datadog | Job/pipeline telemetry, cost-per-workload, SLA dashboards | System tables + REST metrics ingested; anomaly detection |
| ITSM / approvals | ServiceNow | Access-request approvals, change gates, incident records | Catalog access request → grant; auto-ticket on policy breach |
| CI / IaC | GitHub Actions + Terraform | Asset-bundle deploys; infrastructure as code | OIDC to Azure (no stored creds); Databricks Asset Bundles |
A few of these choices deserve the why, because they are the ones teams get wrong.
Why Unity Catalog, not per-workspace Hive metastores. The single most consequential decision is making governance account-level, above the workspaces, not bolted into each one. Unity Catalog gives you one identity model (UC trusts the Entra/Okta-provisioned principals), one permission grammar (GRANT SELECT ON ... TO <group>), one lineage graph that spans every workspace, and one audit log. With per-workspace metastores, a table governed in the actuarial workspace is a different, ungoverned object in the fraud workspace — the architecture guarantees divergence. UC makes prod.gold.member_risk mean exactly one governed thing everywhere.
Why row filters and column masks belong in the catalog, not the query. It is tempting to put PHI redaction in each report’s SQL or in the BI tool. Do not — that means every new query is a new place to get masking wrong, and one missed WHERE clause leaks diagnosis codes across a HIPAA line. Instead, attach the policy to the table in Unity Catalog so the engine enforces it on every read, from SQL, notebooks, and BI alike:
-- Column mask: redact SSN unless the caller is PHI-cleared
CREATE FUNCTION prod.security.mask_ssn(ssn STRING)
RETURN CASE WHEN is_account_group_member('phi_cleared')
THEN ssn ELSE 'XXX-XX-' || right(ssn, 4) END;
ALTER TABLE prod.silver.member
ALTER COLUMN ssn SET MASK prod.security.mask_ssn;
-- Row filter: scope analysts to their permitted regions
CREATE FUNCTION prod.security.region_filter(region STRING)
RETURN is_account_group_member('phi_cleared')
OR region = current_recipient('region'); -- via session/group tag
ALTER TABLE prod.gold.member_risk
SET ROW FILTER prod.security.region_filter ON (region);
Protection stays a property of the data, enforced by the engine — not a discipline you have to re-impose in every notebook.
Why Delta Live Tables, not hand-written notebook ETL. Medallion pipelines built as a chain of scheduled notebooks become a tangle of MERGE statements, manual checkpoints, and silent data-quality drift. DLT makes the pipeline declarative: you describe the Silver and Gold tables and their constraints, and DLT manages dependency ordering, incremental processing, retries, and autoscaling. The expectations are the part that pays for itself — a row that fails EXPECT valid_claim_date is dropped or quarantined with a recorded count, so a bad upstream feed shows up as a quality metric instead of a wrong number in the board deck three weeks later.
Implementation guidance
Provision with Terraform, and treat the network and the metastore as the first deliverables. Two things, gotten wrong, sink this architecture silently: private DNS (endpoints that resolve to firewalled public IPs and hang) and metastore topology (workspaces attached to the wrong or to multiple metastores).
- A hub/spoke or single VNet with the two delegated subnets every Databricks workspace requires (host + container), plus a subnet for the Private Endpoints.
- Private Endpoints and private DNS zones for ADLS Gen2 (
privatelink.dfs.core.windows.net,privatelink.blob.core.windows.net), the Databricks back-end and browser-auth (privatelink.azuredatabricks.net), and Key Vault — linked to the VNet. Forgetting one zone is the single most common failure on this architecture. - The Databricks workspaces with Secure Cluster Connectivity (no public IP) and VNet injection, public network access disabled, front-end and back-end Private Link enabled.
- One Unity Catalog metastore per region, with a dedicated ADLS Gen2 container as its managed root, attached to every workspace in that region. One metastore is the whole point — resist the urge to make several.
- Storage credentials + external locations in UC pointing at your ADLS containers through an Entra managed identity (an Access Connector for Azure Databricks), so UC — not a storage key — brokers every data-plane read.
A minimal Terraform shape for the metastore and its storage binding communicates the intent — UC owns the storage, no keys:
resource "databricks_metastore" "this" {
name = "uc-insurer-cin"
region = "centralindia"
storage_root = "abfss://uc-root@stucinsurer.dfs.core.windows.net/"
force_destroy = false
}
resource "databricks_metastore_data_access" "mi" {
metastore_id = databricks_metastore.this.id
name = "uc-access-connector"
azure_managed_identity {
access_connector_id = azurerm_databricks_access_connector.uc.id
}
is_default = true # UC brokers storage via managed identity, not account keys
}
resource "databricks_metastore_assignment" "prod" {
metastore_id = databricks_metastore.this.id
workspace_id = azurerm_databricks_workspace.prod.workspace_id
}
The pipeline that applies this runs in GitHub Actions, authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak. Application and pipeline code ship as Databricks Asset Bundles from the same pipeline, and Wiz Code scans the Terraform and bundle definitions on every pull request — flagging a workspace defined with public access or a storage account missing its Private Endpoint before it merges, not after an auditor finds it.
Identity: provision the humans, kill the keys. Human SSO flows Okta → Entra → Databricks: analysts log in with the bank’s Okta credentials and conditional-access policies, Okta federates to Entra over OIDC, and — critically — Okta SCIM provisions the user and their group memberships into the Databricks account, so the Entra group an analyst belongs to is the same principal Unity Catalog grants against. You manage access by granting to groups in UC and managing group membership in Okta; deprovisioning a leaver in Okta removes their Databricks access on the next SCIM sync. Storage is reached only through UC’s managed identity — there are no account keys in any notebook — and the residual secrets (source-system JDBC creds, the fraud vendor’s API key) live in HashiCorp Vault or a Key Vault-backed secret scope, leased short and never committed.
Governance wiring. Model the namespace deliberately: a catalog per environment (dev, staging, prod) or per domain, schemas for bronze/silver/gold, and grants only ever to groups. Tag PHI columns with UC tags (pii=phi) so policy and discovery can key off the tag rather than column names. Carry the medallion contract in the table itself — Bronze append-only, Silver conformed, Gold consumption — and let DLT’s lineage plus UC’s column lineage answer “where did this number come from” without a meeting.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: identity-based access only (UC principals from Okta/Entra), least-privilege grants to groups, no public data-plane surface, no storage keys. Layer on top: (a) row filters and column masks so PHI exposure is enforced by the engine, not by query discipline; (b) Wiz running continuous CSPM and sensitive-data-exposure scanning across ADLS and the workspaces, alerting the moment a container drifts to public exposure or a grant widens access too far — the posture backstop behind UC’s controls — with Wiz Code shifting that check left into IaC pull requests; © CrowdStrike Falcon sensors deployed via cluster init script on the driver and worker nodes for runtime threat detection, feeding the insurer’s SOC; (d) UC and workspace audit logs (and system tables) streamed to the SIEM so “who queried which PHI table, when, and what policy applied” is a query, not an investigation; (e) a policy breach — a failed access attempt, a grant that violates a guardrail — auto-raises a ServiceNow incident so security has a ticket, not just a log line. Azure Policy denies any storage account or workspace created with public network access, and Wiz independently verifies the policy is actually holding.
Cost optimization. Compute dominates a lakehouse bill, and the default failure is clusters that never turn off.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Serverless + auto-stop | Databricks SQL serverless warehouses auto-suspend after 5–10 min idle | Stops paying for idle BI compute overnight |
| DLT Enhanced Autoscaling | Pipelines scale workers to backlog, down to zero between runs | Right-sizes ingestion to actual load |
| Spot / Photon | Spot workers for fault-tolerant jobs; Photon for vectorized SQL | Lower $/job and faster queries (fewer node-hours) |
| Storage tiering + OPTIMIZE | OPTIMIZE/Z-ORDER + Vacuum; cool tier for cold Bronze |
Less data scanned per query; cheaper at-rest |
| Tagging + system tables | Tag clusters/warehouses by team; bill from system.billing.usage |
Real chargeback per cost center |
Tag every cluster and warehouse by team, read system.billing.usage for actual consumption, and pipe it to Dynatrace or Datadog for the chargeback dashboard the CFO sees — so the fraud team owns its spend and the marketing teams can no longer hide a 24/7 cluster.
Scalability. Each concern scales independently. Ingestion scales with Auto Loader’s incremental discovery and Event Hubs partitions; DLT scales workers to backlog via Enhanced Autoscaling. Databricks SQL scales up (warehouse t-shirt size) for heavier queries and out (multi-cluster load balancing) for analyst concurrency at month-end close. ML training scales with job clusters sized per run. Delta itself scales through partitioning, Z-ORDER clustering on common filter columns (member_id, region, claim_date), and OPTIMIZE compaction so the 600th analyst’s query doesn’t scan ten thousand small files. The natural ceiling is ADLS throughput and your regional Databricks quota, which is why a 600-seat rollout plans capacity and a paired region early.
Failure modes, and what each one looks like. Name them before they page you.
- A missing private DNS zone link — the workspace or storage endpoint deploys clean but resolves to a firewalled public IP, and clusters fail to start or jobs hang until timeout. Mitigation: assert every zone link in Terraform and in a post-deploy smoke test.
- The small-files problem — streaming ingestion writes millions of tiny Delta files and query latency degrades badly over weeks. Mitigation: scheduled
OPTIMIZE+VACUUM, and auto-compaction on the streaming tables. - Schema drift from an upstream feed — the claims vendor adds a column or changes a type and a naive pipeline silently corrupts Silver. Mitigation: Auto Loader schema evolution plus DLT expectations that quarantine offending rows and surface a quality alert.
- Ungoverned escape hatch — a team spins up a workspace not attached to the metastore, or reads ADLS with a storage key, and creates an ungoverned copy. Mitigation: Azure Policy + Wiz to forbid key access and unattached workspaces; UC managed identity as the only storage path.
- Concurrent-write conflicts — overlapping
MERGEs on the same Delta table throw conflicts under heavy ingestion. Mitigation: partition to isolate writers, and let Delta’s optimistic concurrency retry. - Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. The durable source of truth is ADLS Gen2 with geo-redundant storage (RA-GRS), so the data survives a regional loss with near-zero RPO. Delta’s transaction log makes recovery point-in-time precise. For the platform, DR means a paired-region workspace and metastore kept warm, with Delta Deep Clone replicating critical Gold tables cross-region on a schedule and DLT pipelines redeployable from the Asset Bundle in the secondary. A pragmatic target for this platform: RTO 1–2 hours to bring analytics and pipelines back in the paired region, RPO 15 minutes for replicated Gold tables, with full rebuild from geo-redundant Bronze possible within hours if needed. Akamai health checks drive failover for the BI front door.
Observability. Instrument pipelines and warehouses end to end in Dynatrace or Datadog by ingesting UC system tables (system.access.audit, system.billing.usage, query history) and DLT/job run metrics via the REST API. Emit the metrics the business actually cares about — pipeline freshness/SLA (is Gold current?), DLT expectation pass-rate (data-quality health), cost per team, query p95 latency, and access anomalies (a principal reading a PHI table it never touched before). New datasets and access grants pass through a ServiceNow request-and-approval flow before going live, giving compliance a documented gate.
Governance & lineage. This is the heart of the mandate. Unity Catalog captures automatic column-level lineage for every table, view, and DLT pipeline, so “this board-deck risk score derives from these Silver claims columns through this Gold transform” is a graph you can show an auditor, not a tribal memory. Roll that up to enterprise scope with Microsoft Purview: Purview scans and classifies the catalog (auto-tagging PHI/PII), maintains the business glossary and data-product catalog the whole insurer browses, and federates UC lineage into an organization-wide view alongside non-Databricks sources. Pin the contract: grants only to groups, PHI columns tagged and masked, model versions registered in UC, pipeline code in version control and reviewable. Together UC (enforcement + fine-grained lineage) and Purview (enterprise catalog + classification) answer the three questions that started this project — who can see PHI, where every column came from, and whether the board’s number matches the regulator’s.
Explicit tradeoffs
Accept these or do not build it. Unity Catalog is a real migration if you are coming from per-workspace Hive metastores — you must consolidate to a single metastore, move tables to external locations brokered by managed identity, and rewrite key-based access; it is worth it, but it is not free. Some legacy patterns (certain init-script tricks, direct DBFS-root writes, a few third-party connectors) are constrained on UC-enabled clusters, so audit your existing notebooks before you flip the switch. The fully private posture that makes the security team sign costs you setup complexity — VNet injection, no-public-IP, multiple private DNS zones, front-end and back-end Private Link — and the price of forgetting one piece is a silent cluster-start hang, not a clear error. The Okta-to-Entra-to-Databricks SCIM chain adds moving parts the single-IdP shops will not need, but it is the only way group changes and deprovisioning stay honest at 600 seats. And DLT’s declarative model trades some low-level control for managed reliability — most teams should take that trade; the handful with exotic streaming logic may keep those few jobs as Structured Streaming jobs and let DLT own the rest.
The alternatives, and when they win. If your governance needs are modest and you live entirely in the Microsoft stack, Microsoft Fabric / Synapse with Purview is a coherent first-party option — fewer vendors, tighter Power BI integration, less flexibility for heavy data science. If you are a pure-SQL analytics shop with no ML and no streaming, a cloud data warehouse (Snowflake, Synapse dedicated) is simpler to operate — accept the copy and the second security model. If your data volumes are small and static, you may not need a lakehouse at all; a governed warehouse or even Postgres will do. Graduate to this full private-networked, Unity-Catalog-governed lakehouse when one governed copy, fine-grained PHI control, column-level lineage, and self-service for hundreds of users at scale are all simultaneously non-negotiable — which, for a regulated insurer, they are.
The shape of the win
For the insurer, the payoff is not “Databricks.” It is that an actuary queries prod.gold.member_risk, sees exactly the regions and the non-PHI columns they are cleared for — enforced by the catalog, not by a hope in the query — and the lineage graph proves the risk score traces cleanly from the raw claims feed through governed Silver and Gold transforms, so the number in the board deck is provably the same number the regulator was shown. Seven copies of the truth collapse to one. The CDO can answer “who can see PHI and where did this come from” in seconds, the analysts got self-service instead of a ticket queue, and the HHS auditor got an enforced policy and a complete audit trail. Everything upstream — the Private Link posture, the Okta-to-Entra SCIM provisioning, the Vault-held secrets, the Wiz and Wiz Code posture scanning, the CrowdStrike sensors, the Purview classification, the Dynatrace cost-and-freshness dashboards — exists to make a CISO, a compliance officer, and a CFO each say yes. The architecture here is the destination; consolidate one domain first if you must, but this is where a governed enterprise lakehouse has to land.