Confidential Computing for Sensitive Analytics on Azure

A national pharmacy retailer and a large health insurer want to answer one commercially explosive question together: which of the insurer’s members are abandoning a newly prescribed chronic-disease medication in the first 90 days, so the two can fund a targeted adherence programme that keeps those patients on therapy. The retailer holds dispensing and refill records. The insurer holds claims, diagnoses, and member contact details. Joined, the two datasets are gold — and joining them the obvious way is illegal. This is PHI under HIPAA on the insurer side and prescription records under state pharmacy law on the retailer’s, and neither legal team will let a single raw record of the other party’s data land in their counterpart’s environment, even briefly, even encrypted at rest, because “at rest” is not the exposure they fear — they fear a privileged engineer, a compromised host, or a subpoena reaching into memory while the join runs. Each side has a standing rule: we do not hand the other party our members’ data, full stop. And yet the analysis needs both datasets in plaintext, in the same process, at the same instant, to compute the overlap.

This is the canonical clean room problem, and it is exactly what Azure confidential computing exists to solve. The trick is to build a neutral compute environment that neither party operates and that both parties can cryptographically prove is exactly the code they agreed to — running inside a hardware-encrypted boundary that even Microsoft, the cloud operator, cannot read into. Inside that boundary the two datasets are decrypted, joined, the cohort is computed, and only an aggregate result — counts, a de-identified target list under an agreed minimum cohort size — ever comes back out. The raw rows are decrypted only inside silicon that the host OS, the hypervisor, and any human with root cannot inspect. This article is the reference architecture for building that clean room properly on Azure.

Why the obvious approaches fail

Every team that hits this problem proposes the same three shortcuts first, and naming why each fails is the fastest way to get to the real design.

“Just encrypt the data and run the join.” Encryption at rest and in transit is table stakes and solves nothing here. To join two datasets you must decrypt them into memory, and the moment plaintext sits in RAM on an ordinary VM it is readable by the host hypervisor, by Microsoft’s operators in principle, by a memory-scraping attacker who escapes the guest, and by any administrator with a debugger. The legal objection is specifically about plaintext-in-use, which classical encryption never addresses.

“One party hands the other a hashed/tokenised file.” Hashing identifiers for a private set intersection helps at the margins but breaks on real healthcare data: names and dates of birth are low-entropy and trivially brute-forced from a hash, the two sides normalise fields differently, and the moment you need any attribute beyond the join key (the dispensing date, the diagnosis) you are back to sharing real data. It also produces a brittle, single-purpose pipeline that cannot answer the next question.

“Use a trusted third party.” A consultancy that ingests both datasets is the breach the lawyers are describing — you have just moved the plaintext into a fourth party with worse controls. The whole point is to remove the need to trust any operator, including the cloud and including each other.

Confidential computing threads the needle by shrinking the trusted computing base down to the CPU package and the specific, attested code you both signed off on. Trust moves from “we trust the people and the company running the machine” to “we trust AMD’s silicon and a binary whose measurement we both verified.” That is a boundary a CISO and a regulator can reason about.

Architecture overview

Confidential Computing for Sensitive Analytics on Azure — architecture

The design has three planes that are deliberately kept distinct: a trust/attestation plane that proves the environment before any key is released, a data plane where the encrypted datasets flow in and only aggregates flow out, and a control plane that builds, governs, and operates the whole thing without ever holding the data. The single most important property of the topology is this: the decryption keys are released to the workload only after Microsoft Azure Attestation has cryptographically verified that the code running inside the SEV-SNP boundary is exactly the agreed binary on an genuine confidential platform. No attestation, no key, no plaintext. Everything else serves that one gate.

The hardware foundation is AMD SEV-SNP (Secure Encrypted Virtualization — Secure Nested Paging). The clean-room workload runs either on Azure confidential VMs (the DCasv5/ECasv5 families) for a single batch join, or on a confidential AKS node pool built from those same SEV-SNP VMs when the analysis needs to scale out or run as a service. SEV-SNP encrypts the VM’s memory with a key held in the CPU’s secure processor that the hypervisor never sees, and — the part that matters most for clean rooms — it adds integrity protection so a malicious host cannot corrupt, replay, or remap guest memory without detection. The host can schedule the VM; it cannot read into it.

Following the data and control flow, end to end:

Governed build. The clean-room application — a small, auditable program that reads two encrypted inputs, performs the agreed join and aggregation, and emits only the approved output — is built by a GitHub Actions pipeline (or Jenkins, in shops standardised on it) into a container image. Argo CD then syncs that image to the confidential AKS cluster via GitOps, so what runs in the clean room is exactly what is in the signed, reviewed Git commit — there is no kubectl apply by a human. The pipeline records the image’s measurement; both parties review and approve the exact build that will touch their data.
Provision the boundary. Terraform stands up the confidential VM / AKS node pool with securityType = ConfidentialVM, a vTPM, and Secure Boot enabled; Ansible handles any in-guest hardening that is not baked into the image. The infrastructure is itself code-reviewed and policy-checked before it exists.
Each party encrypts and uploads. The pharmacy and the insurer each independently encrypt their dataset with a data-encryption key and place the ciphertext in storage the clean room can reach. Crucially, the key that unwraps each dataset lives in that party’s own Azure Key Vault Managed HSM — a single-tenant, FIPS 140-3 Level 3 hardware module — under a secure key release (SKR) policy that the party controls. Neither raw key ever leaves the HSM except into an attested enclave.
The workload boots and requests attestation. When the clean-room container starts inside the SEV-SNP VM, it asks the platform for a hardware attestation report — a signed quote from the AMD secure processor and the vTPM describing the firmware, the boot state, and a measurement of the workload. It sends that report to Microsoft Azure Attestation (MAA).
MAA verifies and issues a token. MAA validates the AMD signature chain (proving this is genuine SEV-SNP silicon, not an emulator), checks the report against a policy both parties agreed to, and returns a signed attestation token (JWT) asserting “this is the approved code on a real confidential platform.” This token is the linchpin of the whole system.
Secure key release. The workload presents the MAA token to each party’s Managed HSM. The HSM’s SKR policy says, in effect, release this wrapping key only to a holder of a valid MAA token whose claims match this exact measurement and this issuer. If — and only if — the token matches, the HSM releases the key, wrapped to the enclave. The workload now holds both parties’ keys, inside encrypted memory only.
Decrypt, join, aggregate — all in the encrypted boundary. The two datasets are decrypted inside SEV-SNP-protected RAM, joined on the agreed identifiers, the 90-day-abandonment cohort is computed, a minimum-cohort-size (k-anonymity) threshold is enforced so no small group is re-identifiable, and only the aggregate result is produced. The plaintext never exists outside the silicon boundary.
Emit the approved output only. The result — counts, and a de-identified target list above the agreed k — is written out, signed with a provenance record, and handed to a downstream system (for example a ServiceNow workflow that kicks off the outreach programme). The raw inputs are discarded with the VM.

Component breakdown

Component	Service / tool	Role in the clean room	Key configuration choices
Hardware boundary	AMD SEV-SNP (DCasv5 / ECasv5)	Memory encryption + integrity for VM/in-use data	`securityType=ConfidentialVM`; vTPM + Secure Boot on
Compute	Confidential VM or confidential AKS node pool	Runs the join/aggregation workload	CVM for batch; SEV-SNP node pool for scale/service
Attestation	Microsoft Azure Attestation (MAA)	Verifies the boundary + code, issues signed JWT	Custom attestation policy mutually agreed; pinned issuer
Key custody (party A)	Azure Key Vault Managed HSM	Holds pharmacy’s wrapping key, releases only on attestation	FIPS 140-3 L3; SKR policy bound to MAA claims
Key custody (party B)	Azure Key Vault Managed HSM	Holds insurer’s wrapping key, same guarantees	Separate HSM, separate quorum admins
Identity	Microsoft Entra ID (+ Okta federation)	Workload identity to HSM/storage; human SSO to consoles	Workload Identity on AKS; conditional access on operators
Secrets / DEKs	HashiCorp Vault	Non-HSM secrets, lease management for build/ops creds	Entra auth; dynamic short-lived leases; never holds the join keys
CI / GitOps	GitHub Actions / Jenkins + Argo CD	Reproducible signed build → attested deploy	OIDC to Azure; image digest pinning; signed manifests
IaC / config	Terraform + Ansible	Provision CVM/AKS boundary; in-guest hardening	`confidential_vm` flags; policy-as-code gate pre-apply
Cloud posture	Wiz / Wiz Code	CSPM + IaC scanning; proves boundary stays configured	Alert on any node dropping out of confidential mode; scan Terraform pre-merge
Runtime security	CrowdStrike Falcon	Threat detection on the host fleet and control plane	Sensor on node pool; detections to both SOCs
Observability	Dynatrace / Datadog	Health, attestation success rate, job telemetry — metadata only	OneAgent/agent emits no record-level data; trace the attestation hop
ITSM / approvals	ServiceNow	Joint change approval per analysis; triggers outreach on result	Dual-party change gate before a job runs; auto-ticket on attestation failure
Edge / delivery	Akamai	TLS, WAF, anycast for the operator consoles and result API	Bot mitigation; private origin to the gateway

A few choices carry the design and deserve the why.

Why SEV-SNP integrity, not just memory encryption. Earlier confidential-VM technology encrypted memory but did not fully protect against a malicious hypervisor replaying or remapping pages. For a clean room where the host is, by assumption, outside your trust boundary, encryption alone is insufficient — you need the SNP integrity guarantees that detect host tampering. That is precisely why the architecture pins SEV-SNP specifically and verifies it through attestation rather than trusting the platform label.

Why secure key release is the heart of it. Anyone can stand up a confidential VM; that proves nothing by itself. The security comes from binding the release of each party’s data key to a fresh attestation of the exact code. The Managed HSM’s SKR policy is the enforcement point: it will not export the wrapping key to anything that cannot present a valid MAA token whose measurement matches the approved binary. Change one line of the clean-room code and its measurement changes, the token no longer matches, and the HSM refuses the key — the data simply cannot be decrypted by tampered code. This is the property that lets each legal team say “our key never leaves our HSM except into code we approved.”

Why two separate Managed HSMs. Each party keeps custody of its own key in its own single-tenant HSM with its own quorum of admins. There is no shared key and no shared HSM. Either party can revoke at any time by changing its SKR policy, instantly and unilaterally killing the clean room’s ability to read that party’s data. Mutual, independent revocation is what makes the arrangement politically and legally acceptable.

Implementation guidance

Provision the boundary as code, and make confidentiality a hard flag. Terraform is the source of truth; the confidential properties are non-negotiable settings, not defaults you hope are on. A minimal confidential-VM shape communicates the intent:

resource "azurerm_linux_virtual_machine" "cleanroom" {
  name                = "cvm-cleanroom-prod"
  size                = "Standard_DC4as_v5"   # AMD SEV-SNP family
  # ...network, image...

  vtpm_enabled        = true
  secure_boot_enabled = true

  security_encryption {
    security_type = "DiskWithVMGuestState"    # confidential OS disk + guest state
  }
}

For the scale-out path, the AKS confidential node pool carries the equivalent flags so every pod that touches data lands on SEV-SNP silicon:

resource "azurerm_kubernetes_cluster_node_pool" "conf" {
  name                  = "conf"
  vm_size               = "Standard_DC4as_v5"
  # confidential VM node pool — pods here run inside SEV-SNP
  node_labels = { "kloudvin.io/confidential" = "true" }
  node_taints = ["confidential=true:NoSchedule"]   # only attested workloads land here
}

The taint matters: it guarantees the join workload cannot accidentally be scheduled onto a non-confidential node where its memory would be readable.

Write the secure key release policy carefully — it is your real perimeter. The SKR policy on each Managed HSM key is what the whole guarantee rests on. It binds release to the MAA issuer and to specific claims from the attestation token. Conceptually:

release if:
  token.iss            == "https://cleanroom-maa.<region>.attest.azure.net"
  token["x-ms-isolation-tee"]["x-ms-attestation-type"] == "sevsnpvm"
  token["x-ms-isolation-tee"]["x-ms-compliance-status"] == "azure-compliant-cvm"
  token.measurement    == <approved-image-digest>   # the binary both parties signed

Pin the issuer to your MAA instance, require the SEV-SNP isolation type, and bind to the approved measurement so a different binary cannot pull the key. Treat the MAA policy itself as a jointly reviewed artifact in Git — both parties approve the attestation policy and the SKR policy together, and changes route through ServiceNow dual approval.

Kill standing keys; attest everything. No API keys to storage or HSM — the workload authenticates with Entra Workload Identity on AKS, and human operators reach the consoles via Okta federated to Entra ID with conditional access, so the people who operate the platform are strongly authenticated yet, by design, can never see plaintext. The few non-HSM secrets the build and ops tooling need (registry tokens, third-party API creds) live in HashiCorp Vault with short dynamic leases — and Vault deliberately never holds the dataset-wrapping keys, which belong only in the Managed HSMs.

Make the running code provably the reviewed code. The clean-room image is built by GitHub Actions (OIDC to Azure, no stored secrets) or Jenkins, its digest is recorded, and Argo CD deploys by digest via GitOps so there is no out-of-band path to run unreviewed code. Wiz Code scans the Terraform and manifests before merge; Wiz continuously verifies in production that every node in the pool is still in confidential mode and alerts the instant one is not.

Enterprise considerations

Security and the trust boundary. State the trusted computing base explicitly, because that clarity is the entire value proposition: you trust the AMD CPU package, the attested firmware/boot chain, and the specific measured binary — and nothing else, not the hypervisor, not the host OS, not Microsoft’s operators, not the other party’s administrators, not your own cluster admins. Layer defence in depth on top: CrowdStrike Falcon sensors on the host fleet and control plane feed both organisations’ SOCs; Wiz provides continuous CSPM and catches configuration drift such as a node silently falling out of confidential mode or a public-exposure change; a failed or unexpected attestation auto-raises a ServiceNow incident so security has a ticket, not just a log line. Akamai fronts the operator consoles and the result-retrieval API with TLS, WAF, and bot mitigation. And critically, observability is constrained: Dynatrace or Datadog instrument health, throughput, and the all-important attestation success rate, but the agents emit metadata only — never a record-level field — so the telemetry pipeline can never become the leak the architecture was built to prevent.

Cost. Confidential SKUs carry a premium and the design adds moving parts, so engineer for it.

Lever	Mechanism	Typical effect
Right-size the boundary	Batch join on an ephemeral CVM, not a standing cluster	Pay only for the hours a job runs
CVM vs confidential AKS	Use a single CVM for one-off joins; reserve the node pool for recurring/service use	Avoids a 24×7 cluster premium
Managed HSM is per-pool	One HSM pool covers many keys; share within a party	Amortise the fixed HSM cost across analyses
Spot for non-data build	Run CI/build on cheap nodes; only the join needs confidential silicon	Keeps the expensive SKU scoped to the actual workload
Auto-teardown	Terraform-destroy the boundary after the result is emitted	No idle confidential capacity

The honest framing: confidential VMs cost noticeably more than equivalent general-purpose VMs, and Managed HSM has a real monthly floor. The premium buys a legal capability — a collaboration that otherwise simply cannot happen — so the comparison is not “cheaper than a normal VM” but “cheaper than the deal not existing.”

Scalability. A single SEV-SNP VM has a fixed memory ceiling, which bounds how large a join you can hold in encrypted RAM. Two paths scale past it: move to the confidential AKS node pool and partition the join (shard by a hash of the identifier so each pod joins a slice inside its own attested boundary), or pre-aggregate so the in-enclave step works on summaries rather than raw rows. Attestation adds a few seconds to cold start — fine for batch, and amortised for a long-running service. The realistic ceiling is confidential-SKU regional capacity, so a large recurring programme plans region and quota early.

Failure modes, named before they page you.

Attestation fails after a platform update. Azure host firmware updates can change the attestation report; an over-tight MAA policy then rejects a legitimately healthy platform and the HSM withholds the key, so the job cannot start. Mitigation: track the attestation success-rate metric, version the MAA policy, and test policy changes against current platform reports before rolling them.
SKR policy / measurement drift. A rebuilt image changes its measurement; if the SKR policy was not updated in lockstep, key release fails and the join silently cannot decrypt. Mitigation: treat image digest and SKR policy as a single versioned, jointly-approved unit promoted together through the pipeline.
Accidental scheduling onto a non-confidential node. Without the taint/label discipline, a pod handling plaintext could land on ordinary silicon. Mitigation: the NoSchedule taint and a Wiz check that fails the deploy if any data-touching workload is not confined to the confidential pool.
Result leakage through small cohorts. A join that returns a cohort of three is effectively re-identifying. Mitigation: enforce the minimum-cohort-size threshold inside the enclave and suppress or generalise anything below it before output.
HSM unavailability. If a party’s Managed HSM is unreachable, no key is released and the job halts — by design. Mitigation: this is fail-closed and correct; plan availability and a documented re-run, never a bypass.

Reliability and DR. Decide the numbers per plane, and accept that the clean room is fail-closed: if attestation or key release cannot complete, the analysis does not run, which is the safe direction. The encrypted inputs live in geo-redundant storage and are the durable source of truth — a job is simply re-run from them, so the conversational RTO/RPO conversation is replaced by “can we re-execute the join in the paired region.” Managed HSM supports backup/restore and multi-availability-zone resilience; replicate the HSM and its SKR policy to a paired region, and keep the MAA policy and Terraform in Git so the entire boundary is reproducible. A pragmatic target: re-run a failed job within hours in-region, and reconstruct the whole clean room in a paired region from code and geo-redundant ciphertext within a day — there is no live state to lose, only the (re-runnable) computation.

Governance and auditability. This is where confidential computing shines for regulators. Every run produces an attestation evidence record — the MAA token, the image measurement, the SKR decision — that proves which exact code touched the data and under what policy, retained as the audit trail. Pin image versions by digest, never a floating tag; keep the clean-room code, the MAA attestation policy, and the SKR policies all in version control under joint review; and route every new analysis through a ServiceNow dual-party change approval so both legal teams sign each specific question before a single record is decrypted. Wiz independently verifies the controls are actually holding in production.

Comparison: how the privacy techniques line up

Confidential computing is one of several privacy-enhancing technologies, and the honest engineering choice depends on the workload. The clean room above can even compose these — for instance, running a secure-multiparty protocol inside an attested enclave for defence in depth.

Technique	What it protects	Strength	Cost / limitation	Best fit
Confidential computing (SEV-SNP + attestation)	Plaintext in use, in memory	Near-native speed; runs ordinary code; provable code identity	Trust roots in the CPU vendor; SKU premium	A real join/analysis on full datasets between distrustful parties — this case
Homomorphic encryption	Data stays encrypted even during compute	No plaintext ever, even in RAM	Orders-of-magnitude slower; limited operations	Narrow, fixed computations on highly sensitive inputs
Secure multi-party computation	No single party sees others’ inputs	Strong cryptographic guarantee, no trusted hardware	Heavy network rounds; complex to engineer at scale	Specific agreed functions (e.g., a private set intersection)
Differential privacy	Re-identification from outputs	Mathematical privacy bound on results	Adds noise; degrades precision	Publishing aggregate statistics safely (complements the above)

The pattern in this article picks confidential computing for the compute and borrows differential-privacy-style suppression (the minimum-cohort threshold) for the output — hardware protects the join while the k-anonymity gate protects the answer.

Explicit tradeoffs

Accept these or do not build it. You are moving your trust to AMD’s silicon and firmware and to Microsoft’s attestation service — a different, smaller, but real trust assumption than “trust the operator,” and one your security team must consciously endorse. The confidential SKUs cost more and the moving parts multiply: attestation policies that can reject healthy platforms after an update, SKR policies that must track image measurements exactly, GitOps discipline so no unreviewed code can run, and taints so no plaintext touches ordinary nodes. The system is deliberately fail-closed, which means a misconfigured policy or an unreachable HSM stops the analysis cold rather than degrading — correct, but operationally demanding. Debugging is genuinely harder: you cannot attach a memory debugger to a workload whose entire value is that no one can read its memory, so you lean on metadata-only telemetry and the attestation-success metric instead of a heap dump.

When something simpler wins. If only one party’s data is involved and the threat you care about is at-rest or in-transit, ordinary Key Vault encryption and Private Endpoints are enough — you do not need confidential computing. If the question is a single, fixed function like “how many identifiers overlap,” a purpose-built private set intersection via secure multi-party computation may be lighter than standing up enclaves. If you only need to publish aggregate statistics and never join raw rows, differential privacy on each party’s own outputs sidesteps the joint-compute problem entirely. And if there is no genuine mutual distrust — same legal entity, same trust domain — a normal governed analytics platform is far cheaper. Confidential computing earns its complexity precisely when two parties who will not share raw data must nonetheless compute over it together, with a result both can trust and a regulator can audit.

The shape of the win

For the pharmacy and the insurer, the payoff is not “a secure VM.” It is that two organisations who were legally forbidden from sharing a single member record can now, together, identify exactly the patients abandoning their therapy in the first 90 days — fund an intervention that keeps those people on treatment — and do it with a cryptographic audit trail proving that neither party ever saw the other’s data in the clear and that the only code that touched it was the binary both legal teams signed. The attestation token is the document that makes the deal legal; the Managed HSM’s refusal to release a key to unapproved code is the control that makes it safe; and the minimum-cohort gate is what makes the output safe to act on. Everything upstream — the SEV-SNP boundary, the MAA policy, the dual HSMs, the GitOps build, the Wiz posture checks, the metadata-only Dynatrace telemetry — exists so a CISO, a compliance officer, and two competing legal teams each say yes to a collaboration that, done any other way, would never have left the meeting room.

Confidential Computing for Sensitive Analytics on Azure

Why the obvious approaches fail

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Comparison: how the privacy techniques line up

Explicit tradeoffs

The shape of the win

Written by Vinod

Comments

Keep Reading

The AWS Architecting Ladder: From a Static Site to Multi-Region Active-Active

The Azure Architecting Ladder: From a Simple Web App to Mission-Critical

Azure Architecture Case Studies: Real Proposal Walkthroughs (Easy → Complex)