Terraform Module: Azure Machine Learning Workspace — Private, Governed MLOps Foundations

Quick take — Provision an Azure Machine Learning Workspace with Terraform: customer-managed keys, private connectivity to its Storage/Key Vault/ACR dependencies, system-assigned identity, and clean MLOps-ready outputs. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "machine_learning" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-machine-learning?ref=v1.0.0"

  name                    = "..."  # Workspace name (3-33 chars, alphanumeric/hyphen, valida…
  location                = "..."  # Azure region; must match the dependent resources.
  resource_group_name     = "..."  # Resource group containing the workspace.
  application_insights_id = "..."  # App Insights resource ID for telemetry.
  key_vault_id            = "..."  # Key Vault resource ID for connection secrets.
  storage_account_id      = "..."  # Storage Account resource ID (default datastore).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

An Azure Machine Learning (Azure ML) Workspace is the top-level resource that anchors everything in Azure ML: compute targets, datastores, registered models, environments, jobs, endpoints, and the experiment/run history. Critically, a workspace is not self-contained — at creation time it binds to four mandatory dependent resources: a Storage Account (for artifacts, snapshots and datasets), a Key Vault (for connection secrets and credentials), an Application Insights instance (for telemetry from training jobs and online endpoints), and optionally a Container Registry (ACR, for environment images used by jobs and deployments). Those bindings are immutable for the life of the workspace, which makes the workspace an awkward thing to click together in the portal — get one dependency wrong and you are recreating the whole thing.

That immutability is exactly why this belongs in a reusable Terraform module. Wrapping azurerm_machine_learning_workspace lets you encode the non-negotiable production posture once — system-assigned managed identity, customer-managed encryption keys (CMK), public network access disabled in favour of a private endpoint, high-business-impact (HBI) data handling, and consistent naming/tagging — and then stamp out identical dev, staging, and prod workspaces that differ only by inputs. The module also resolves the chicken-and-egg ordering between the workspace identity and its CMK key vault access, so consumers never have to think about it.

When to use it

You are standing up an MLOps platform and need workspaces that are reproducible across environments and regions, not hand-built snowflakes.
Your data is sensitive (PII, PHI, financial) and you must enforce CMK encryption, HBI workspaces, and no public endpoint for compliance (HIPAA, PCI-DSS, ISO 27001).
You want the workspace and its four dependencies (Storage, Key Vault, App Insights, ACR) wired together correctly and locked down behind private networking from day one.
You are managing many teams and want every Azure ML workspace to carry the same identity model, diagnostic settings hooks, and tags for cost allocation.

Reach for the raw resource instead only for a quick throwaway sandbox where you will accept the Azure-managed defaults and a public endpoint.

Module structure

terraform-module-azure-machine-learning/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # workspace + identity + CMK + private endpoint
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # id/name + identity + discovery URLs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

# The workspace binds to four dependent resources at creation. These IDs are
# passed in so the module stays composable (callers own lifecycle of the deps).
resource "azurerm_machine_learning_workspace" "this" {
  name                = var.name
  location            = var.location
  resource_group_name = var.resource_group_name

  application_insights_id = var.application_insights_id
  key_vault_id            = var.key_vault_id
  storage_account_id      = var.storage_account_id
  # Container registry is optional but required to build custom environments.
  container_registry_id   = var.container_registry_id

  friendly_name  = var.friendly_name
  description     = var.description
  sku_name        = var.sku_name

  # High Business Impact: scrubs diagnostic data that could leak sensitive
  # content out of the workspace boundary. Immutable after creation.
  high_business_impact = var.high_business_impact

  # Lock the data plane down; reach the workspace only via private endpoint.
  public_network_access_enabled = var.public_network_access_enabled

  # v1 legacy mode forces ARM-based RBAC on the workspace Key Vault/Storage.
  # Leave false on new workspaces so you get the current data-plane model.
  v1_legacy_mode_enabled = var.v1_legacy_mode_enabled

  identity {
    type = var.user_assigned_identity_ids == null ? "SystemAssigned" : "SystemAssigned, UserAssigned"
    identity_ids = var.user_assigned_identity_ids
  }

  # Customer-managed key encryption for workspace metadata (Cosmos DB, the
  # internal search and storage that Azure ML provisions on your behalf).
  dynamic "encryption" {
    for_each = var.customer_managed_key == null ? [] : [var.customer_managed_key]
    content {
      key_vault_id     = encryption.value.key_vault_id
      key_id           = encryption.value.key_id
      user_assigned_identity_id = try(encryption.value.user_assigned_identity_id, null)
    }
  }

  tags = var.tags
}

# Grant the workspace's own identity rights to read the CMK so it can complete
# provisioning of its managed datastore. Only created when CMK + RBAC vault.
resource "azurerm_role_assignment" "cmk_crypto_user" {
  count = var.customer_managed_key == null ? 0 : 1

  scope                = var.customer_managed_key.key_vault_id
  role_definition_name = "Key Vault Crypto Service Encryption User"
  principal_id         = azurerm_machine_learning_workspace.this.identity[0].principal_id
}

# Private endpoint for the workspace control/data plane. The amlworkspace
# sub-resource also fronts the notebook and inference scoring traffic.
resource "azurerm_private_endpoint" "this" {
  count = var.private_endpoint == null ? 0 : 1

  name                = "${var.name}-pe"
  location            = var.location
  resource_group_name = var.resource_group_name
  subnet_id           = var.private_endpoint.subnet_id

  private_service_connection {
    name                           = "${var.name}-psc"
    private_connection_resource_id = azurerm_machine_learning_workspace.this.id
    subresource_names              = ["amlworkspace"]
    is_manual_connection           = false
  }

  dynamic "private_dns_zone_group" {
    for_each = length(var.private_endpoint.private_dns_zone_ids) > 0 ? [1] : []
    content {
      name                 = "aml-dns"
      private_dns_zone_ids = var.private_endpoint.private_dns_zone_ids
    }
  }

  tags = var.tags
}

variables.tf

variable "name" {
  type        = string
  description = "Name of the Azure Machine Learning workspace."

  validation {
    # ML workspace names: 3-33 chars, alphanumerics and hyphens.
    condition     = can(regex("^[a-zA-Z0-9][a-zA-Z0-9-]{1,31}[a-zA-Z0-9]$", var.name))
    error_message = "name must be 3-33 chars, alphanumeric or hyphens, and start/end alphanumeric."
  }
}

variable "location" {
  type        = string
  description = "Azure region for the workspace (must match its dependent resources)."
}

variable "resource_group_name" {
  type        = string
  description = "Resource group that will contain the workspace."
}

variable "application_insights_id" {
  type        = string
  description = "Resource ID of the Application Insights instance for job/endpoint telemetry."
}

variable "key_vault_id" {
  type        = string
  description = "Resource ID of the Key Vault that stores workspace connection secrets."
}

variable "storage_account_id" {
  type        = string
  description = "Resource ID of the Storage Account used as the default datastore."
}

variable "container_registry_id" {
  type        = string
  description = "Resource ID of the ACR used to build/store environment images. Premium SKU required when private."
  default     = null
}

variable "friendly_name" {
  type        = string
  description = "Display name shown in Azure ML Studio."
  default     = null
}

variable "description" {
  type        = string
  description = "Description of the workspace shown in Azure ML Studio."
  default     = null
}

variable "sku_name" {
  type        = string
  description = "Workspace SKU tier."
  default     = "Basic"

  validation {
    condition     = contains(["Basic", "Standard"], var.sku_name)
    error_message = "sku_name must be 'Basic' or 'Standard'."
  }
}

variable "high_business_impact" {
  type        = bool
  description = "Enable HBI to suppress collection of sensitive diagnostic data. Immutable after creation."
  default     = true
}

variable "public_network_access_enabled" {
  type        = bool
  description = "Allow public network access to the workspace. Set false and use a private endpoint in production."
  default     = false
}

variable "v1_legacy_mode_enabled" {
  type        = bool
  description = "Enable v1 legacy (ARM-only) data plane. Keep false for new workspaces."
  default     = false
}

variable "user_assigned_identity_ids" {
  type        = list(string)
  description = "Optional list of user-assigned identity IDs to attach alongside the system-assigned identity."
  default     = null
}

variable "customer_managed_key" {
  type = object({
    key_vault_id              = string
    key_id                    = string
    user_assigned_identity_id = optional(string)
  })
  description = "Customer-managed key for workspace data encryption. Null uses Microsoft-managed keys."
  default     = null
}

variable "private_endpoint" {
  type = object({
    subnet_id            = string
    private_dns_zone_ids = optional(list(string), [])
  })
  description = "Private endpoint config for the 'amlworkspace' sub-resource. Null skips PE creation."
  default     = null
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the workspace and private endpoint."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the Azure Machine Learning workspace."
  value       = azurerm_machine_learning_workspace.this.id
}

output "name" {
  description = "Name of the workspace."
  value       = azurerm_machine_learning_workspace.this.name
}

output "discovery_url" {
  description = "Discovery URL used by the Azure ML SDK/CLI to locate workspace endpoints."
  value       = azurerm_machine_learning_workspace.this.discovery_url
}

output "workspace_id" {
  description = "Immutable GUID of the workspace (used in diagnostic/metric queries)."
  value       = azurerm_machine_learning_workspace.this.workspace_id
}

output "principal_id" {
  description = "Principal ID of the workspace's system-assigned managed identity."
  value       = azurerm_machine_learning_workspace.this.identity[0].principal_id
}

output "tenant_id" {
  description = "Tenant ID of the workspace's system-assigned managed identity."
  value       = azurerm_machine_learning_workspace.this.identity[0].tenant_id
}

output "private_endpoint_id" {
  description = "Resource ID of the workspace private endpoint, if created."
  value       = try(azurerm_private_endpoint.this[0].id, null)
}

How to use it

module "machine_learning_workspace" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-machine-learning?ref=v1.0.0"

  name                = "mlw-fraud-prod-eus2"
  location            = "eastus2"
  resource_group_name = azurerm_resource_group.ml.name

  # Four mandatory dependencies — created/owned by the caller.
  application_insights_id = azurerm_application_insights.ml.id
  key_vault_id            = azurerm_key_vault.ml.id
  storage_account_id      = azurerm_storage_account.ml.id
  container_registry_id   = azurerm_container_registry.ml.id

  friendly_name = "Fraud Detection (Prod)"
  description   = "Production training + scoring for the fraud risk models."
  sku_name      = "Basic"

  high_business_impact          = true
  public_network_access_enabled = false

  # Encrypt workspace metadata with our own key.
  customer_managed_key = {
    key_vault_id = azurerm_key_vault.cmk.id
    key_id       = azurerm_key_vault_key.aml.id
  }

  # Land the workspace inside the platform VNet.
  private_endpoint = {
    subnet_id            = azurerm_subnet.ml_pe.id
    private_dns_zone_ids = [azurerm_private_dns_zone.aml_api.id, azurerm_private_dns_zone.aml_notebooks.id]
  }

  tags = {
    env       = "prod"
    workload  = "fraud-detection"
    cost_center = "ml-platform"
  }
}

# Downstream: grant a CI/CD service principal the ability to submit jobs and
# manage assets in the workspace, scoped to its resource ID output.
resource "azurerm_role_assignment" "ci_ml_contributor" {
  scope                = module.machine_learning_workspace.id
  role_definition_name = "AzureML Data Scientist"
  principal_id         = azuread_service_principal.ml_pipeline.object_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/machine_learning/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-machine-learning?ref=v1.0.0"
}

inputs = {
  name = "..."
  location = "..."
  resource_group_name = "..."
  application_insights_id = "..."
  key_vault_id = "..."
  storage_account_id = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/machine_learning && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name`	`string`	—	Yes	Workspace name (3-33 chars, alphanumeric/hyphen, validated).
`location`	`string`	—	Yes	Azure region; must match the dependent resources.
`resource_group_name`	`string`	—	Yes	Resource group containing the workspace.
`application_insights_id`	`string`	—	Yes	App Insights resource ID for telemetry.
`key_vault_id`	`string`	—	Yes	Key Vault resource ID for connection secrets.
`storage_account_id`	`string`	—	Yes	Storage Account resource ID (default datastore).
`container_registry_id`	`string`	`null`	No	ACR resource ID for environment images (Premium when private).
`friendly_name`	`string`	`null`	No	Display name in Azure ML Studio.
`description`	`string`	`null`	No	Workspace description shown in Studio.
`sku_name`	`string`	`"Basic"`	No	Workspace SKU: `Basic` or `Standard` (validated).
`high_business_impact`	`bool`	`true`	No	Suppress sensitive diagnostic data collection (immutable).
`public_network_access_enabled`	`bool`	`false`	No	Allow public access; keep false with a private endpoint.
`v1_legacy_mode_enabled`	`bool`	`false`	No	Enable v1 ARM-only data plane; keep false for new workspaces.
`user_assigned_identity_ids`	`list(string)`	`null`	No	Extra user-assigned identities alongside the system identity.
`customer_managed_key`	`object`	`null`	No	CMK config (`key_vault_id`, `key_id`, optional `user_assigned_identity_id`).
`private_endpoint`	`object`	`null`	No	PE config (`subnet_id`, optional `private_dns_zone_ids`).
`tags`	`map(string)`	`{}`	No	Tags applied to the workspace and private endpoint.

Outputs

Name	Description
`id`	Resource ID of the Azure Machine Learning workspace.
`name`	Name of the workspace.
`discovery_url`	Discovery URL used by the Azure ML SDK/CLI to locate endpoints.
`workspace_id`	Immutable workspace GUID for diagnostic/metric queries.
`principal_id`	Principal ID of the workspace system-assigned managed identity.
`tenant_id`	Tenant ID of the workspace system-assigned managed identity.
`private_endpoint_id`	Resource ID of the workspace private endpoint, if created.

Enterprise scenario

A retail bank’s fraud-analytics team runs three Azure ML workspaces — dev, staging, and prod — each in a separate spoke VNet under the platform landing zone. Because transaction data is HBI, every workspace is provisioned through this module with high_business_impact = true, public_network_access_enabled = false, a customer-managed key rotated quarterly in a dedicated Key Vault, and a private endpoint into the spoke so data scientists reach Azure ML Studio only over ExpressRoute. The module’s principal_id output feeds RBAC assignments that let the workspace identity pull environment images from a shared Premium ACR, while the id output wires a GitHub Actions service principal as AzureML Data Scientist for automated retraining — giving the team identical, audited workspaces with zero portal clicks.

Best practices

Treat the four dependencies as part of the workspace’s blast radius. Storage, Key Vault, App Insights and ACR are bound immutably — keep them in the same resource group, region and lifecycle as the workspace, and disable public access on each so the private endpoint is the only path in.
Always use CMK + HBI for regulated data. Set high_business_impact = true (it can never be changed later) and supply a customer_managed_key; the module auto-grants the workspace identity the Key Vault Crypto Service Encryption User role so provisioning of the managed Cosmos DB store succeeds.
Prefer the system-assigned identity and scope RBAC tightly. Hand the principal_id output narrowly-scoped roles (e.g. AcrPull on one registry, Storage Blob Data Contributor on one account) rather than subscription-wide Contributor.
Require a Premium ACR when networking is locked down. Private endpoints and trusted-service access for ACR are Premium-only; passing a Basic/Standard registry ID will leave environment image builds unable to push or pull.
Control cost at the compute layer, not the workspace. The workspace itself is nearly free; spend comes from compute clusters and managed online endpoints. Keep Basic SKU unless you need managed VNet/private endpoints on compute, and enforce idle-shutdown and min-node-count of 0 on the compute defined alongside this module.
Name and tag for cost allocation and discovery. Use a predictable mlw-<workload>-<env>-<region> convention (enforced by the name validation) and always set cost_center/workload tags so chargeback and the workspace_id in Log Analytics queries line up with a real team.