A regional hospital network — eleven hospitals, an outpatient arm, and a newly acquired telehealth startup — has a problem that its CIO can no longer push down the road. Patient data lives in three electronic health record (EHR) systems that do not speak to each other, a lab information system, an imaging archive, and a pile of CSV exports that an analytics team emails around to build readmission models. Two things happened in the same quarter: a payer launched a value-based-care contract that pays the network on outcomes and demands a clean, longitudinal patient record, and a junior analyst left an unencrypted extract of 60,000 records on a shared drive. The board’s instruction is blunt: build one governed place for clinical and operational data, make it useful for analytics and the new contract, and make it the kind of thing a HIPAA auditor and the network’s privacy officer will sign off on. This article is the reference architecture for that platform on Azure — a private-networked, encrypted, governed data platform built on Azure Health Data Services, a Data Lake, and Synapse, with a real de-identification path and an audit trail that survives a Business Associate Agreement (BAA) review.
The pressures in healthcare stack differently from finance, but they stack just as hard. Regulation here is HIPAA — the Privacy Rule and the Security Rule — which means protected health information (PHI) must be access-controlled, encrypted at rest and in transit, and every touch of it logged in a tamper-evident audit trail; it also means Microsoft must sign a BAA for every Azure service you put PHI into, and you must only use BAA-covered services. Interoperability means the platform has to ingest and normalize the messy reality of HL7 v2 feeds and FHIR APIs into one model. Analytics means data scientists need access to data at scale without ever needing raw PHI for most of their work. And cost means a non-profit hospital network cannot fund a platform sized for a hyperscaler. The design below satisfies all four without pretending any of them away.
Why not the obvious shortcuts
Three shortcuts will be proposed in the first design meeting, and each fails in a way worth naming.
A single big SQL database for everything collapses under the shape of the data: FHIR resources are deeply nested JSON, imaging is binary, HL7 v2 is pipe-delimited, and clinical notes are free text. Forcing all of it into relational tables loses fidelity and makes the FHIR interoperability story — the thing the payer contract requires — impossible. Letting analysts query the EHR directly puts a reporting workload on a clinical system of record (a patient-safety risk if it slows order entry) and, worse, exposes raw PHI to every analyst for work that rarely needs identities at all. A lift-and-shift of the CSV-emailing habit into a cloud share is exactly what caused the 60,000-record incident; it just moves the breach surface to a new drive.
The platform threads these by separating concerns: a FHIR service as the interoperability and clinical-API layer, a Data Lake as the durable, multi-format store organized in quality tiers, Synapse as the analytics engine, and a de-identification pipeline that means the analytics tier is, by default, working on data with no direct identifiers in it. PHI is concentrated where it must be and removed everywhere it need not be.
Architecture overview
The platform runs two cooperating flows: an ingestion-and-normalization flow that lands clinical data and turns it into FHIR, and an analytics flow that refines, de-identifies, and serves it. They share a lake and a governance plane but run on different schedules and trust boundaries.
The defining property of the whole topology is the one the privacy officer cares about most: every PHI-bearing Azure service is reachable only over a Private Endpoint, every public endpoint is disabled, and every data store is encrypted with customer-managed keys (CMK). Azure Health Data Services, the Data Lake (ADLS Gen2), Synapse, the Service Bus, and Key Vault expose private IPs inside the VNet only. PHI never traverses the public internet, and the network does the heavy lifting that a HIPAA Security Rule risk assessment will scrutinize.
Ingestion and normalization flow, following the data:
- Source systems push to the platform at the edge through Akamai, which terminates TLS, provides global anycast for the telehealth front door, and runs WAF/bot protection before any traffic reaches Azure. Inbound clinical feeds — HL7 v2 from the lab and the legacy EHRs, FHIR bundles from the modern EHR — arrive at an ingestion endpoint fronted by Azure API Management (APIM) in internal VNet mode, the single audited front door for every external data exchange.
- APIM validates the caller (a partner system or an internal interface engine), attaches identity, and routes the message. HL7 v2 messages flow to the MLLP/HL7 ingestion tier — a set of virtual appliances (interface-engine VMs such as a Mirth/Rhapsody-class engine) running in a locked-down subnet, because hospital integration still lives and dies on MLLP over HL7 v2, and these appliances do the protocol handling Azure-native services do not.
- Normalized events land on Azure Service Bus, decoupling bursty inbound feeds from downstream processing so a lab batch at shift change cannot overwhelm the converter.
- The FHIR converter (Azure Health Data Services’
$convert-dataoperation, with Liquid templates) turns HL7 v2 and C-CDA into FHIR R4 resources, which are persisted in the Azure Health Data Services FHIR service — the clinical system-of-reference and the API the payer’s value-based-care platform calls. - In parallel, the FHIR service exports to the lake via
$export(bulk NDJSON) on a schedule, and raw inbound files are archived to the Bronze tier of the Data Lake as the immutable source of truth.
Analytics flow, refining and protecting the data:
- Azure Data Factory / Synapse pipelines promote data through a medallion architecture in ADLS Gen2: Bronze (raw, as-received), Silver (flattened FHIR resources, validated, deduplicated, parsed into tabular Delta/Parquet), and Gold (analytics-ready marts — a longitudinal patient record, readmission features, quality measures).
- Between Silver and Gold sits the de-identification pipeline: the Health Data Services de-identification service (and the
$de-identifycapability / the open-source FHIR de-id tooling) strips or transforms the 18 HIPAA Safe Harbor identifiers — names, full dates, geographies finer than state, MRNs, device IDs — producing a de-identified Gold dataset that data scientists use by default. Re-identification keys, where a limited data set is genuinely needed, live in HashiCorp Vault, not in the lake. - Synapse (serverless SQL for ad-hoc exploration over the lake, dedicated SQL pools or Spark for heavy transformation and ML feature engineering) serves the analytics and reporting workloads. Power BI dashboards for clinical quality and the payer contract read from Gold.
- Microsoft Purview scans Bronze through Gold, classifies columns (it has built-in healthcare/PHI classifiers), maps lineage from source to dashboard, and enforces that sensitive columns are labeled — the governance plane an auditor walks.
Component breakdown
| Component | Service / tool | Role in the platform | Key configuration choices |
|---|---|---|---|
| Edge | Akamai | TLS, anycast, WAF, bot mitigation for the telehealth and partner front doors | WAF rules for the public API; origin shield to APIM’s private origin |
| API front door | Azure API Management | Single audited gateway for FHIR APIs and partner data exchange | Internal VNet mode; validate-jwt; subscription keys per partner; rate limiting |
| Identity / SSO | Okta + Microsoft Entra ID | Clinician/analyst SSO (Okta) federated to Entra for native Azure RBAC | OIDC federation; SMART-on-FHIR scopes via Entra; conditional access |
| HL7 ingestion | Virtual appliances (interface engine) | MLLP/HL7 v2 handling, channel routing the platform cannot do natively | Hardened VMs in an isolated subnet; CrowdStrike sensor; no public IP |
| Messaging | Azure Service Bus | Decouple bursty inbound feeds from conversion | Premium tier (VNet/PE); dead-letter queues; sessions per source |
| Clinical data | Azure Health Data Services — FHIR service | FHIR R4 system-of-reference and interoperability API | CMK; $convert-data; $export; SMART-on-FHIR; local RBAC via Entra |
| Lake | ADLS Gen2 (Bronze/Silver/Gold) | Durable multi-format store, medallion tiers | Hierarchical namespace; CMK; PE only; immutable Bronze; lifecycle tiering |
| De-identification | Health Data Services de-id / FHIR de-id tools | Safe Harbor de-identification before analytics | Config-driven redact/transform of the 18 identifiers; date-shift with patient key |
| Analytics | Azure Synapse Analytics | Serverless + Spark + dedicated SQL over the lake | Managed VNet; data exfiltration protection on; serverless for ad-hoc |
| Governance | Microsoft Purview | Classification, lineage, PHI labeling, catalog | Healthcare classifiers; scheduled scans; lineage to Power BI |
| Secrets / keys | HashiCorp Vault + Key Vault (HSM) | Re-id keys, partner tokens; CMK key material in HSM-backed Key Vault | Entra auth method; Managed HSM for CMK; rotation policies |
| CSPM / data posture | Wiz + Wiz Code | Cloud posture, PHI-exposure detection, IaC scanning pre-merge | Agentless scan of lake/FHIR/Synapse; Wiz Code gates Terraform PRs |
| Runtime security | CrowdStrike Falcon | Runtime protection on the interface-engine and Spark/VM compute | Sensor on appliance VMs and node pools; detections to the SOC |
| Observability | Dynatrace / Datadog | Pipeline health, latency, freshness, audit-export monitoring | OneAgent/agents on compute; SLOs on data freshness and conversion success |
| ITSM / approvals | ServiceNow | Data-access requests, change approvals, breach/incident records | Access-request workflow; change gate; auto-ticket on audit anomaly |
| CI / IaC | GitHub Actions / Jenkins + Argo CD; Terraform / Ansible | Pipeline build/test; GitOps deploy; infra and VM config as code | OIDC to Azure; Argo CD syncs Synapse/AKS manifests; Ansible hardens appliances |
| Training | Moodle | HIPAA workforce training, tracked completion as an audit artifact | SSO via Okta; completion records exported for compliance evidence |
A few choices deserve the why, because they are where healthcare data platforms go wrong.
Why a FHIR service, not just a JSON column. You could dump FHIR bundles into the lake and parse them in Spark, and for analytics you partly will. But the FHIR service gives you a queryable clinical API with the right semantics — search by patient, by encounter, by observation code — plus SMART-on-FHIR authorization scopes, $convert-data, and $export. The payer’s value-based-care platform expects to call FHIR; the modern EHR speaks FHIR; building the interoperability layer yourself is reinventing a standard Microsoft already operates under a BAA.
Why de-identify between Silver and Gold, not at query time. It is tempting to keep one identified Gold dataset and mask in the BI tool or in views. Don’t — that means raw PHI sits in the analytics tier and one misconfigured view or one over-broad role leaks it. Instead, materialize a de-identified Gold dataset as the default analytics surface, so a data scientist building a readmission model never has identifiers in scope at all. The much smaller population that genuinely needs a limited data set (with dates and zip for, say, geographic outcome analysis) gets a separate, tightly-RBAC’d path with a Vault-held re-identification key and a ServiceNow-approved justification on record.
Why the de-identification matters legally, not just technically. HIPAA’s Safe Harbor method de-identifies data by removing 18 specified identifiers; data that clears it is, for most purposes, no longer PHI and can be used far more freely for research and model-building. That single fact is what lets the network’s data scientists move fast on the analytics the value-based contract needs without every project becoming a privacy review.
Implementation guidance
Provision with Terraform, and treat the network and keys as the first deliverables. The order matters: get private DNS or CMK wrong and services either hang silently or refuse to start.
- A hub/spoke VNet with subnets for the interface-engine appliances, the private endpoints, APIM (delegated subnet), and Synapse’s managed-VNet integration runtime.
- Private DNS zones for each PaaS service —
privatelink.azurehealthcareapis.com(Health Data Services),privatelink.dfs.core.windows.net(ADLS Gen2),privatelink.sql.azuresynapse.netandprivatelink.dev.azuresynapse.net(Synapse),privatelink.servicebus.windows.net,privatelink.vaultcore.azure.net(Key Vault) — linked to the VNet. A forgotten zone is the single most common silent failure here. - An HSM-backed Key Vault (or Managed HSM) holding the CMK, before the data stores, because each store references the key at creation.
- The data stores — FHIR service, ADLS Gen2, Synapse, Service Bus — each with
public_network_access = Disabled, a Private Endpoint, and CMK configured. - The interface-engine VMs, hardened by Ansible (CIS baseline, MLLP listeners, no public IP), with CrowdStrike Falcon sensors.
A minimal Terraform shape for the lake communicates the intent — CMK on, public access off, hierarchical namespace for the medallion layout:
resource "azurerm_storage_account" "lake" {
name = "stlakehealthprodcin"
account_tier = "Standard"
account_replication_type = "ZRS"
is_hns_enabled = true # ADLS Gen2 / medallion
public_network_access_enabled = false # PE only
shared_access_key_enabled = false # Entra auth only, no account keys
identity { type = "UserAssigned" identity_ids = [azurerm_user_assigned_identity.lake.id] }
customer_managed_key {
key_vault_key_id = azurerm_key_vault_key.cmk.id # HSM-backed
user_assigned_identity_id = azurerm_user_assigned_identity.lake.id
}
}
The pipeline that applies this runs in GitHub Actions (or Jenkins where the network standardizes on it), authenticating to Azure via OIDC federation so there is no stored service-principal secret to leak — the platform team has read the news. Wiz Code scans the Terraform on every pull request and blocks a merge that would create a store with public access or without CMK, catching the misconfiguration before it exists. Application and Synapse/Spark manifests deploy via Argo CD in a GitOps model, so the running state always matches git.
Identity: federate the humans, scope the machines. Clinicians and analysts authenticate through Okta as the workforce IdP, federated to Microsoft Entra ID over OIDC, so Azure resources see a first-class Entra token carrying the group and role claims that drive RBAC. SMART-on-FHIR scopes on the FHIR service are mapped to Entra app roles, so an application gets patient/Observation.read and nothing more. Set shared_access_key_enabled = false on the lake and disable local auth on Synapse and the FHIR service, so the only way in is Entra and every access is attributable to a named principal — which is what makes the audit trail meaningful. Pipeline identities use managed identity with least-privilege role assignments (Storage Blob Data Contributor scoped to the relevant container, FHIR Data Exporter for the export job). The few residual secrets that are not managed identities — partner-system credentials, re-identification keys — live in HashiCorp Vault, leased dynamically, never written to a config file.
The audit trail, BAA-aligned. This is the deliverable an auditor opens first. Enable diagnostic settings on the FHIR service, the lake, Synapse, Key Vault, and APIM, streaming to an immutable Log Analytics workspace (and an append-only Storage archive with a legal-hold policy for long retention). The FHIR service emits an AuditLogs stream recording who read or wrote which resource; that, plus the storage and Synapse access logs, gives you the who-touched-what-PHI-when record the HIPAA Security Rule expects. Purview lineage ties a Gold dashboard back to its Bronze source so you can prove provenance. Workforce HIPAA training completion is tracked in Moodle and exported as evidence, because “we trained our people” is an audit question with a documented answer.
Enterprise considerations
Security & Zero Trust. The architecture is Zero Trust by construction: identity-based access only, least-privilege RBAC scoped per resource and per FHIR scope, no public data-plane surface, and CMK so the network controls the keys that protect PHI. Layer on top: (a) Wiz running continuous CSPM and PHI-exposure detection across the lake, FHIR service, and Synapse, alerting the moment any resource drifts to public exposure or a sensitive column lands in an unexpected place — the posture backstop behind the policy controls; (b) Wiz Code shifting that left into the IaC pipeline; © CrowdStrike Falcon sensors on the interface-engine appliances and the Spark/analytics compute for runtime threat detection, feeding the network’s SOC — the interface engines are the most exposed tier because they terminate external feeds, so they get the most scrutiny; (d) an access anomaly or a policy breach auto-raises a ServiceNow incident, so the privacy officer gets a ticket, not just a log line. Azure Policy denies any healthcare-relevant resource created without CMK or with public network access, and Wiz independently verifies the policy is actually holding.
Cost optimization. A non-profit network funds this carefully, and the lake/analytics shape gives you real levers.
| Lever | Mechanism | Typical effect |
|---|---|---|
| Synapse serverless for ad-hoc | Pay per TB scanned on the lake instead of a running cluster | No idle dedicated-pool cost for exploration |
| Lake lifecycle tiering | Auto-move cold Bronze to Cool/Archive after N days | Large saving on the immutable raw tier |
| Delta + partition pruning | Partition Gold by date/facility; query prunes files | Cuts TB scanned (and serverless cost) per query |
| Spark autoscale + spot | Autoscale pools; spot nodes for non-urgent batch ML | Lower compute for tolerant workloads |
| Right-size FHIR throughput | Provision FHIR capacity to steady load, scale for bulk | Avoids paying for peak $export continuously |
Meter pipeline and query cost and pipe it to Dynatrace/Datadog, which the platform team uses for the chargeback view per service line (cardiology’s models vs. the payer-contract reporting).
Scalability. Each tier scales independently. The interface-engine appliances scale out behind a load balancer per feed volume; Service Bus Premium absorbs bursts so the converter is never the bottleneck. The lake scales effectively without limit. Synapse scales serverless automatically and Spark pools on demand; dedicated SQL scales by DWU when a quality-measure refresh needs muscle. The FHIR service scales its provisioned throughput for normal API traffic and bursts for scheduled bulk $export. The natural ceiling is conversion throughput at shift-change batch peaks, which is why Service Bus and a horizontally-scaled converter tier are in the design from day one.
Failure modes, and what each one looks like. Name them before they page you.
- A missing private DNS zone link — the endpoint deploys clean but resolves to a firewalled public IP and every pipeline call hangs until timeout. Mitigation: assert all zone links in Terraform and in a post-deploy smoke test.
- CMK key rotation or access loss — if the lake’s identity loses
Key Vault Crypto Service Encryption Useror the key is disabled, the store goes inaccessible and the platform stops. Mitigation: soft-delete + purge protection on Key Vault, careful rotation with versioned keys, and an alert on key-access failures. - HL7 conversion failure — a malformed v2 message or an unmapped segment fails
$convert-data, and silently dropping it loses a clinical record. Mitigation: dead-letter the message on Service Bus, alert, and never acknowledge upstream until conversion succeeds. - De-identification gap — a new source introduces a free-text field carrying a name the Safe Harbor config did not anticipate, leaking an identifier into “de-identified” Gold. Mitigation: Purview PHI classifiers scan Gold and alert on any identifier detected where none should be; treat de-id config as reviewed code.
- Regional outage — see DR below.
Reliability & DR (RTO/RPO). Decide the numbers per tier. The lake is the durable source of truth — use ZRS or GZRS so Bronze survives a zone or region event, and Bronze immutability means you can always rebuild Silver and Gold by re-running pipelines. The FHIR service is regional; for DR, replicate via scheduled $export to a paired region’s lake and stand the service back up from FHIR-formatted exports, or run a warm secondary. A pragmatic target for this platform: RTO 4 hours, RPO 1 hour for the analytics platform (it is not life-critical — the EHRs remain the clinical systems of record), with the FHIR API restored faster where the payer contract has its own SLA. Akamai health checks drive edge failover for the telehealth and partner endpoints.
Observability. Instrument the pipeline end to end in Dynatrace or Datadog: a trace per ingestion covering receive → convert → persist → export → promote, with conversion success rate, data freshness / latency-to-Gold, dead-letter depth, and audit-export health as first-class SLOs. The business-facing metrics that matter are freshness (how stale is the longitudinal record), conversion error rate (are we losing clinical data), and de-identification coverage. New data sources and new analyst access pass through a ServiceNow change/access workflow, giving compliance a documented gate and the data team a queue.
Governance. Microsoft Purview is the spine: it scans every tier, classifies PHI with healthcare-aware classifiers, maintains lineage from source feed to Power BI tile, and publishes a searchable catalog so an analyst finds the right Gold table instead of re-deriving it from raw. Apply Azure Policy to require CMK, deny public network access, and require diagnostic settings on every relevant resource, with Wiz as the independent check the controls are real. Pin FHIR converter templates and de-identification configs in version control, reviewable and revertable. And keep the BAA scope tight — only put PHI into Azure services Microsoft covers under the BAA, and let Purview and Policy enforce that boundary so a well-meaning engineer cannot route PHI into an uncovered service.
Explicit tradeoffs
Accept these or do not build it. The medallion-plus-FHIR design has more moving parts than a single database: an interface-engine tier to operate, a conversion step that can fail, three lake tiers to keep promoting, and a de-identification pipeline whose config is now safety-critical code. The private-networking and CMK posture that makes the privacy officer sign costs you setup complexity — several private DNS zones, an HSM-backed key whose access you must never lose, no public debugging shortcuts — and the price of forgetting a piece is a silent hang or a store that will not open, not a clear error. The Okta-to-Entra federation adds a hop the single-IdP shops will not need. And the de-identified-by-default analytics surface means the small set of projects that genuinely need a limited data set take a deliberate, approved detour — which is the point, but it is friction.
The alternatives, and when they win. If your data is small, already FHIR-native, and analytics is light, you can skip the lake and run analytics straight off the FHIR service’s $export into Synapse serverless — simpler, fewer tiers. If you are not doing cross-system interoperability and never will, the FHIR service is overhead and a plain lake-plus-Synapse stack is enough. If your organization is large enough to want a turnkey clinical-research substrate, Microsoft Fabric’s healthcare data solutions package much of this medallion-and-FHIR pattern with less assembly — graduate to or from this hand-built platform as your control and customization needs dictate. And if you are a tiny clinic, none of this is for you; a BAA-covered SaaS analytics product is the right answer. This architecture earns its complexity precisely when you have many source systems, a regulated analytics mandate, and an auditor who will read the logs.
The shape of the win
For the hospital network, the payoff is not “a data lake.” It is that the value-based-care contract gets a clean, current, longitudinal patient record over a FHIR API the payer already speaks; that a data scientist builds the readmission model on a de-identified Gold dataset that was never PHI in their hands, so the project is not a privacy review; and that when the auditor asks “show me who accessed this patient’s record and prove this dashboard’s data came from a legitimate source,” the answer is a Purview lineage graph and an immutable FHIR audit log, not a shrug. The 60,000-record incident becomes structurally hard to repeat, because the place analysts work has no identifiers in it and the place identifiers live is private, encrypted with the network’s own keys, and logged on every touch. Everything upstream — the Private Endpoints, the CMK in a Managed HSM, the Okta-to-Entra federation, the Vault-held re-id keys, the Wiz posture scanning, the Safe Harbor de-identification, the Purview lineage — exists to make a HIPAA auditor, a privacy officer, and a CFO each say yes. Start narrower if you must, but this is where a regulated, multi-source healthcare data platform has to land.