Terraform Module: Azure Microsoft Purview — a governed, private-by-default data catalog account

Quick take — A reusable hashicorp/azurerm ~> 4.0 Terraform module for azurerm_purview_account: managed identity, a named managed resource group, public-access lockdown, optional private endpoints for the portal and Atlas Kafka ingestion, and the endpoints/identity wired out for downstream RBAC. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "purview" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-purview?ref=v1.0.0"

  name                = "..."  # Purview account name (3-63 chars, lowercase alphanumeri…
  resource_group_name = "..."  # Resource group for the account and private endpoints (n…
  location            = "..."  # Azure region for the account.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Microsoft Purview is Azure’s unified data-governance service: a managed data map, catalog, and lineage engine that scans your data estate (ADLS Gen2, Azure SQL, Synapse, Power BI, even on-prem and multi-cloud sources), classifies columns against built-in and custom sensitivity rules, and stitches the results into a searchable catalog with end-to-end lineage. The thing Terraform provisions — the Purview account — is the long-lived control-plane object behind all of that. Each account is created through a single resource, azurerm_purview_account, and quietly stands up a whole supporting cast in a separate managed resource group: a locked-down Storage account and an Event Hubs namespace that back the Apache Atlas data map and stream scan/lineage events.

The raw resource has two sharp edges that make it worth wrapping. First, the identity block is mandatory — a Purview account simply will not create without a managed identity, because that identity is the principal you later grant Storage Blob Data Reader / SQL db_datareader so scans can actually read your sources. Second, the account ships with its public endpoint on, and its data map exposes an Atlas Kafka ingestion endpoint plus a portal endpoint that, in any regulated estate, you want reachable only over a private link. This module bakes in the correct shape: a system-assigned (or user-assigned) identity, a deterministically-named managed resource group, public_network_enabled = false by default, and optional private endpoints for both the account/portal and the atlas_kafka sub-resources — then it surfaces the identity principal ID and the catalog/scan/Atlas endpoints as outputs so the calling configuration can grant RBAC and bootstrap scans without screen-scraping the portal.

When to use it

You are rolling out data governance across many subscriptions or business units and want every Purview account born identical — same identity model, same private-networking posture, same managed-RG naming — instead of hand-rolled accounts that drift.
Your security baseline mandates no public data-plane access: you need public_network_enabled = false plus private endpoints on the account (portal) and atlas_kafka (ingestion) sub-resources, resolving through privatelink.purview.azure.com and privatelink.servicebus.windows.net.
You want the account’s managed identity to be the single scanning principal, surfaced as an output so the root module can grant it reader roles on ADLS Gen2 / Azure SQL / Synapse via azurerm_role_assignment.
You need the managed resource group name codified (for cost allocation, policy scoping, and lock placement) rather than letting Azure auto-generate a managed-rg-<guid> you can’t predict.
Skip the module for a throwaway evaluation account in a sandbox — a bare azurerm_purview_account with a system-assigned identity is enough there, and private endpoints just slow the demo down.

Module structure

terraform-module-azure-purview/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # azurerm_purview_account + optional portal/atlas private endpoints
├── variables.tf     # typed, validated inputs
└── outputs.tf       # id, name, identity principal_id, managed resources, endpoints

# versions.tf
terraform {
  required_version = ">= 1.6.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

# main.tf

locals {
  # Purview always materialises a managed RG (Storage + Event Hubs that back
  # the Atlas data map). Pin its name deterministically so it can be policy-
  # scoped, locked, and shown in cost reports instead of "managed-rg-<guid>".
  managed_resource_group_name = coalesce(
    var.managed_resource_group_name,
    "${var.name}-managed-rg",
  )

  # A user-assigned identity is only legal when at least one ID is supplied.
  identity_type = length(var.user_assigned_identity_ids) > 0 ? "UserAssigned" : "SystemAssigned"

  common_tags = merge(
    {
      managed_by = "terraform"
      module     = "terraform-module-azure-purview"
    },
    var.tags,
  )
}

resource "azurerm_purview_account" "this" {
  name                = var.name
  resource_group_name = var.resource_group_name
  location            = var.location

  # The managed RG must NOT pre-exist; Purview creates and owns it.
  managed_resource_group_name = local.managed_resource_group_name

  # Private-by-default: lock the portal + Atlas ingestion to private link.
  public_network_enabled = var.public_network_enabled

  # Identity is REQUIRED. This principal is what you grant reader roles on
  # the data sources you want scanned.
  identity {
    type         = local.identity_type
    identity_ids = local.identity_type == "UserAssigned" ? var.user_assigned_identity_ids : null
  }

  tags = local.common_tags
}

# Private endpoint for the portal/account sub-resource. Resolves the Purview
# studio and catalog API privately via privatelink.purview.azure.com.
resource "azurerm_private_endpoint" "portal" {
  count = var.account_private_endpoint == null ? 0 : 1

  name                = coalesce(var.account_private_endpoint.name, "${var.name}-account-pe")
  resource_group_name = var.resource_group_name
  location            = var.location
  subnet_id           = var.account_private_endpoint.subnet_id

  private_service_connection {
    name                           = "${var.name}-account-psc"
    private_connection_resource_id = azurerm_purview_account.this.id
    subresource_names              = ["account"]
    is_manual_connection           = false
  }

  dynamic "private_dns_zone_group" {
    for_each = length(var.account_private_endpoint.private_dns_zone_ids) == 0 ? [] : [1]
    content {
      name                 = "default"
      private_dns_zone_ids = var.account_private_endpoint.private_dns_zone_ids
    }
  }

  tags = local.common_tags
}

# Private endpoint for the Atlas Kafka ingestion sub-resource. This is the
# Event Hubs ingestion path scans use to publish lineage/asset events;
# resolves via privatelink.servicebus.windows.net.
resource "azurerm_private_endpoint" "atlas_kafka" {
  count = var.atlas_kafka_private_endpoint == null ? 0 : 1

  name                = coalesce(var.atlas_kafka_private_endpoint.name, "${var.name}-atlas-pe")
  resource_group_name = var.resource_group_name
  location            = var.location
  subnet_id           = var.atlas_kafka_private_endpoint.subnet_id

  private_service_connection {
    name                           = "${var.name}-atlas-psc"
    private_connection_resource_id = azurerm_purview_account.this.id
    subresource_names              = ["atlas_kafka"]
    is_manual_connection           = false
  }

  dynamic "private_dns_zone_group" {
    for_each = length(var.atlas_kafka_private_endpoint.private_dns_zone_ids) == 0 ? [] : [1]
    content {
      name                 = "default"
      private_dns_zone_ids = var.atlas_kafka_private_endpoint.private_dns_zone_ids
    }
  }

  tags = local.common_tags
}

# variables.tf

variable "name" {
  description = "Name of the Purview account (3-63 chars, lowercase letters/numbers/hyphens, must start with a letter and end alphanumeric). Forms part of the catalog/scan endpoint hostnames."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,61}[a-z0-9]$", var.name))
    error_message = "name must be 3-63 chars, lowercase alphanumeric or hyphens, start with a letter and end with a letter or number."
  }
}

variable "resource_group_name" {
  description = "Resource group that will hold the Purview account and any private endpoints (NOT the managed resource group)."
  type        = string
}

variable "location" {
  description = "Azure region for the account (e.g. westeurope, eastus2). Purview is region-bound; pick a region close to the data sources you scan."
  type        = string
}

variable "managed_resource_group_name" {
  description = "Name for the Purview-managed resource group (holds the backing Storage + Event Hubs). Must not already exist. Defaults to \"<name>-managed-rg\" when null."
  type        = string
  default     = null

  validation {
    condition     = var.managed_resource_group_name == null || can(regex("^[-\\w._()]{1,90}$", var.managed_resource_group_name))
    error_message = "managed_resource_group_name must be a valid resource group name (1-90 chars: letters, numbers, and -_.()) ."
  }
}

variable "public_network_enabled" {
  description = "Whether the portal and Atlas ingestion endpoints are reachable over the public network. Defaults to false (private-by-default); set true only for dev or when private endpoints are not yet in place."
  type        = bool
  default     = false
}

variable "user_assigned_identity_ids" {
  description = "Optional list of user-assigned managed identity resource IDs. When empty, a system-assigned identity is created instead. Exactly one identity model is used; Purview requires an identity."
  type        = list(string)
  default     = []

  validation {
    condition     = length(var.user_assigned_identity_ids) <= 1
    error_message = "azurerm_purview_account supports at most one user-assigned identity."
  }
}

variable "account_private_endpoint" {
  description = "Optional private endpoint for the portal/account sub-resource. private_dns_zone_ids should point at the privatelink.purview.azure.com zone."
  type = object({
    name                 = optional(string)
    subnet_id            = string
    private_dns_zone_ids = optional(list(string), [])
  })
  default = null
}

variable "atlas_kafka_private_endpoint" {
  description = "Optional private endpoint for the Atlas Kafka (Event Hubs) ingestion sub-resource. private_dns_zone_ids should point at the privatelink.servicebus.windows.net zone."
  type = object({
    name                 = optional(string)
    subnet_id            = string
    private_dns_zone_ids = optional(list(string), [])
  })
  default = null
}

variable "tags" {
  description = "Additional tags merged onto the account and private endpoints."
  type        = map(string)
  default     = {}
}

# outputs.tf

output "id" {
  description = "Resource ID of the Purview account."
  value       = azurerm_purview_account.this.id
}

output "name" {
  description = "Name of the Purview account."
  value       = azurerm_purview_account.this.name
}

output "identity_principal_id" {
  description = "Object (principal) ID of the account's managed identity — grant THIS reader RBAC on the data sources (ADLS Gen2, SQL, Synapse) you want scanned."
  value       = azurerm_purview_account.this.identity[0].principal_id
}

output "identity_tenant_id" {
  description = "Tenant ID of the account's managed identity."
  value       = azurerm_purview_account.this.identity[0].tenant_id
}

output "catalog_endpoint" {
  description = "Catalog (data map) API endpoint of the account."
  value       = azurerm_purview_account.this.catalog_endpoint
}

output "scan_endpoint" {
  description = "Scan API endpoint of the account — used to register sources and trigger scans."
  value       = azurerm_purview_account.this.scan_endpoint
}

output "managed_resources" {
  description = "IDs of the Purview-managed backing resources: the managed resource group, Storage account, and Event Hubs namespace."
  value = {
    resource_group_id      = azurerm_purview_account.this.managed_resources[0].resource_group_id
    storage_account_id     = azurerm_purview_account.this.managed_resources[0].storage_account_id
    event_hub_namespace_id = azurerm_purview_account.this.managed_resources[0].event_hub_namespace_id
  }
}

output "atlas_kafka_endpoint_primary_connection_string" {
  description = "Primary Atlas Kafka (Event Hubs) endpoint connection string for the data map. Sensitive."
  value       = azurerm_purview_account.this.atlas_kafka_endpoint_primary_connection_string
  sensitive   = true
}

How to use it

The example below provisions a private-by-default Purview account with a system-assigned identity, a predictable managed resource group, and private endpoints for both the portal and the Atlas Kafka ingestion path. It then references the module’s identity_principal_id output to grant the scanning identity read access to a data lake — the role assignment every real Purview deployment needs before a scan can return a single asset.

module "microsoft_purview" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-purview?ref=v1.0.0"

  name                = "pvw-kloudvin-prod-weu"
  resource_group_name = azurerm_resource_group.governance.name
  location            = "westeurope"

  # Lock the portal + ingestion behind private link.
  public_network_enabled      = false
  managed_resource_group_name = "rg-purview-managed-prod-weu"

  # Portal/catalog API over privatelink.purview.azure.com.
  account_private_endpoint = {
    subnet_id            = azurerm_subnet.privatelink.id
    private_dns_zone_ids = [azurerm_private_dns_zone.purview.id]
  }

  # Atlas Kafka ingestion over privatelink.servicebus.windows.net.
  atlas_kafka_private_endpoint = {
    subnet_id            = azurerm_subnet.privatelink.id
    private_dns_zone_ids = [azurerm_private_dns_zone.servicebus.id]
  }

  tags = {
    environment = "prod"
    owner       = "data-governance"
    cost_center = "CC-7140"
  }
}

# Downstream: let the Purview scanning identity read the data lake so scans
# can crawl, classify, and build lineage. Without this, every scan returns 0
# assets with an authorization error.
resource "azurerm_role_assignment" "purview_lake_reader" {
  scope                = azurerm_storage_account.lake.id
  role_definition_name = "Storage Blob Data Reader"
  principal_id         = module.microsoft_purview.identity_principal_id
}

# And read access to an Azure SQL logical server's data via its AAD admin
# flow is granted in-DB; here we at least let Purview enumerate the server.
resource "azurerm_role_assignment" "purview_sql_reader" {
  scope                = azurerm_mssql_server.analytics.id
  role_definition_name = "Reader"
  principal_id         = module.microsoft_purview.identity_principal_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/purview/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-purview?ref=v1.0.0"
}

inputs = {
  name = "..."
  resource_group_name = "..."
  location = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/purview && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name`	`string`	—	Yes	Purview account name (3-63 chars, lowercase alphanumeric/hyphens, starts with a letter). Forms the endpoint hostnames.
`resource_group_name`	`string`	—	Yes	Resource group for the account and private endpoints (not the managed RG).
`location`	`string`	—	Yes	Azure region for the account.
`managed_resource_group_name`	`string`	`null`	No	Name for the Purview-managed RG (backing Storage + Event Hubs). Must not pre-exist. Defaults to `<name>-managed-rg`.
`public_network_enabled`	`bool`	`false`	No	Whether the portal/ingestion endpoints are publicly reachable. Keep `false` in production.
`user_assigned_identity_ids`	`list(string)`	`[]`	No	At most one UAMI resource ID; when empty a system-assigned identity is used. Purview requires an identity.
`account_private_endpoint`	`object`	`null`	No	Private endpoint for the `account`/portal sub-resource: `subnet_id` (required), optional `name`, `private_dns_zone_ids` (point at `privatelink.purview.azure.com`).
`atlas_kafka_private_endpoint`	`object`	`null`	No	Private endpoint for the `atlas_kafka` ingestion sub-resource: `subnet_id` (required), optional `name`, `private_dns_zone_ids` (point at `privatelink.servicebus.windows.net`).
`tags`	`map(string)`	`{}`	No	Additional tags merged onto the account and private endpoints.

Outputs

Name	Description
`id`	Resource ID of the Purview account.
`name`	Name of the Purview account.
`identity_principal_id`	Object ID of the managed identity — grant it reader RBAC on the data sources you want scanned.
`identity_tenant_id`	Tenant ID of the managed identity.
`catalog_endpoint`	Catalog (data map) API endpoint.
`scan_endpoint`	Scan API endpoint used to register sources and trigger scans.
`managed_resources`	Object of managed backing resource IDs: `resource_group_id`, `storage_account_id`, `event_hub_namespace_id`.
`atlas_kafka_endpoint_primary_connection_string`	Primary Atlas Kafka (Event Hubs) connection string for the data map (sensitive).

Enterprise scenario

A pharmaceutical group running a clinical-data platform must prove to auditors that every dataset holding patient identifiers is catalogued, classified, and access-traced — and that none of the governance plane is reachable from the public internet. They deploy this module once per region with public_network_enabled = false and both private endpoints wired into the shared hub’s privatelink.purview.azure.com and privatelink.servicebus.windows.net zones, so the Purview studio and the Atlas ingestion path resolve only inside the VNet. The account’s system-assigned identity (surfaced via identity_principal_id) is granted Storage Blob Data Reader on the ADLS Gen2 trial-data lake and a scoped reader on the Synapse workspace, after which automated scans classify columns against a custom “PHI” sensitivity rule set and publish lineage — giving the compliance team a single, private catalog they can attest against during GxP audits.

Best practices

The identity is the whole point — grant it, don’t recreate it. A Purview account is useless until its managed identity (output identity_principal_id) has reader RBAC on each source: Storage Blob Data Reader for ADLS Gen2, in-database db_datareader for Azure SQL/Synapse via the AAD admin. Prefer the system-assigned identity unless a shared UAMI is mandated, and never destroy/recreate the account casually — a new identity orphans every scan permission you painstakingly granted.
Go private on both sub-resources, not just one. Setting public_network_enabled = false is half the job; the portal (account) and the ingestion path (atlas_kafka) are separate private-endpoint targets. Wire both, into the correct zones (privatelink.purview.azure.com and privatelink.servicebus.windows.net), or scans will silently fail to publish lineage even though the portal opens.
Pin and protect the managed resource group. Let the module name it deterministically (<name>-managed-rg or your own value) so it is predictable in cost reports and policy scope — but treat it as Purview-owned: do not put your own resources in it, and avoid a CanNotDelete lock that blocks Purview’s own lifecycle operations on the backing Storage/Event Hubs.
Mind the cost model: capacity units, not requests. A Purview account bills on always-on capacity units plus metered scan vCore-hours; it is not free at idle. Run one well-RBAC’d account per region/business unit rather than sprawling per-team accounts, and schedule scans (incremental where possible) instead of full re-scans to keep vCore-hours down.
Keep the Atlas Kafka connection string in state, out of outputs you log. The atlas_kafka_endpoint_primary_connection_string output is marked sensitive for a reason — it grants ingestion into the data map. Never echo it through a non-sensitive output or local-exec; consume it only where a custom lineage publisher genuinely needs it, and rely on the managed identity for everything else.
Name for region and lifecycle. Encode purpose, environment, and region in the account name (e.g. pvw-kloudvin-prod-weu) since it drives the catalog/scan endpoint hostnames and shows up across audit logs and cost views — making prod vs non-prod accounts unambiguous at a glance.

Terraform Module: Azure Microsoft Purview — a governed, private-by-default data catalog account

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow