IaC Azure

Terraform Module: Azure Data Factory — A Managed-Identity-First Orchestration Factory

Quick take — Provision Azure Data Factory with Terraform and azurerm ~> 4.0: system-assigned identity, Git integration, managed VNet with private endpoints, and Key Vault linked service — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "data_factory" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"

  name                = "..."  # Globally unique factory name (3-63 chars, alphanumeric/…
  location            = "..."  # Azure region for the factory and managed IR.
  resource_group_name = "..."  # Resource group that will contain the factory.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Data Factory (ADF) is Azure’s serverless data-integration and orchestration service. It is the thing that runs your ELT/ETL: copying data between 100+ connectors, kicking off Databricks notebooks, calling stored procedures, and stitching all of that into scheduled or event-triggered pipelines. The control plane — the factory itself, its managed identity, its Git wiring, its integration runtimes — is long-lived infrastructure, while the pipelines and datasets inside it are usually authored in the ADF Studio and shipped through Git. That split is exactly why a Terraform module pays off: you want the factory shell to be reproducible, policy-compliant, and identical across dev/test/prod, without Terraform fighting the data engineers over every dataset JSON.

This module wraps azurerm_data_factory plus the three companions you almost always need in production: a managed virtual network with an Azure integration runtime so copy activities run on private, isolated compute; Git (Azure DevOps or GitHub) configuration so the factory is collaboration-mode backed; and a Key Vault linked service so pipelines fetch secrets via the factory’s managed identity instead of inline credentials. Everything is variable-driven with validation, and the factory’s identity block is exposed as an output so you can grant it RBAC (Storage, Key Vault, SQL) from the calling configuration.

When to use it

Module structure

terraform-module-azure-data-factory/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # azurerm_data_factory + managed IR + Key Vault linked service
├── variables.tf     # typed, validated inputs
└── outputs.tf       # id, name, identity principal_id, managed IR id
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}
# main.tf

locals {
  # Git config is mutually exclusive; pick at most one provider.
  use_vsts   = var.git_integration != null && var.git_integration.type == "vsts"
  use_github = var.git_integration != null && var.git_integration.type == "github"

  common_tags = merge(
    {
      managed_by = "terraform"
      module     = "terraform-module-azure-data-factory"
    },
    var.tags,
  )
}

resource "azurerm_data_factory" "this" {
  name                = var.name
  location            = var.location
  resource_group_name = var.resource_group_name

  # Managed VNet keeps copy/dataflow compute off the public network.
  managed_virtual_network_enabled = var.managed_virtual_network_enabled

  # Lock the data plane down; consumers reach it via private endpoint.
  public_network_enabled = var.public_network_enabled

  identity {
    type         = var.user_assigned_identity_ids == null ? "SystemAssigned" : "SystemAssigned, UserAssigned"
    identity_ids = var.user_assigned_identity_ids
  }

  # Azure DevOps (VSTS) Git collaboration mode.
  dynamic "vsts_configuration" {
    for_each = local.use_vsts ? [var.git_integration] : []
    content {
      account_name    = vsts_configuration.value.account_name
      branch_name     = vsts_configuration.value.collaboration_branch
      project_name    = vsts_configuration.value.project_name
      repository_name = vsts_configuration.value.repository_name
      root_folder     = vsts_configuration.value.root_folder
      tenant_id       = vsts_configuration.value.tenant_id
    }
  }

  # GitHub Git collaboration mode.
  dynamic "github_configuration" {
    for_each = local.use_github ? [var.git_integration] : []
    content {
      account_name    = github_configuration.value.account_name
      branch_name     = github_configuration.value.collaboration_branch
      git_url         = github_configuration.value.git_url
      repository_name = github_configuration.value.repository_name
      root_folder     = github_configuration.value.root_folder
    }
  }

  tags = local.common_tags

  lifecycle {
    # Pipelines/datasets are authored in Studio + Git, not Terraform.
    # Prevent accidental destroy of a Git-backed factory.
    prevent_destroy = false
  }
}

# Managed Azure Integration Runtime inside the managed VNet.
# Required for managed private endpoints + isolated copy compute.
resource "azurerm_data_factory_integration_runtime_azure" "managed" {
  count = var.managed_virtual_network_enabled ? 1 : 0

  name                    = var.managed_integration_runtime_name
  data_factory_id         = azurerm_data_factory.this.id
  location                = var.location
  virtual_network_enabled = true
  compute_type            = var.managed_ir_compute_type
  core_count              = var.managed_ir_core_count
  time_to_live_min        = var.managed_ir_time_to_live_min
}

# Linked service so pipelines pull secrets from Key Vault via the
# factory's managed identity (no inline connection strings).
resource "azurerm_data_factory_linked_service_key_vault" "kv" {
  count = var.key_vault_id == null ? 0 : 1

  name            = var.key_vault_linked_service_name
  data_factory_id = azurerm_data_factory.this.id
  key_vault_id    = var.key_vault_id
}
# variables.tf

variable "name" {
  description = "Globally unique name of the Data Factory (3-63 chars, alphanumeric and hyphens, must start/end alphanumeric)."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9]([a-zA-Z0-9-]{1,61}[a-zA-Z0-9])$", var.name))
    error_message = "name must be 3-63 chars, alphanumeric/hyphens, starting and ending with an alphanumeric character."
  }
}

variable "location" {
  description = "Azure region for the factory and its managed integration runtime."
  type        = string
}

variable "resource_group_name" {
  description = "Name of the resource group that will contain the Data Factory."
  type        = string
}

variable "managed_virtual_network_enabled" {
  description = "Enable the managed virtual network. Required for managed private endpoints and isolated copy compute."
  type        = bool
  default     = true
}

variable "public_network_enabled" {
  description = "Whether the data plane is reachable over the public network. Set false and use a private endpoint for production."
  type        = bool
  default     = false
}

variable "user_assigned_identity_ids" {
  description = "Optional list of user-assigned managed identity resource IDs. When null, only the system-assigned identity is created."
  type        = list(string)
  default     = null
}

variable "managed_integration_runtime_name" {
  description = "Name of the managed Azure integration runtime (created only when managed VNet is enabled)."
  type        = string
  default     = "managed-vnet-ir"
}

variable "managed_ir_compute_type" {
  description = "Compute size class for the managed Azure IR data flow cluster."
  type        = string
  default     = "General"

  validation {
    condition     = contains(["General", "ComputeOptimized", "MemoryOptimized"], var.managed_ir_compute_type)
    error_message = "managed_ir_compute_type must be one of: General, ComputeOptimized, MemoryOptimized."
  }
}

variable "managed_ir_core_count" {
  description = "Core count for the managed Azure IR data flow cluster (8, 16, 32, 48, 80, 144, or 272)."
  type        = number
  default     = 8

  validation {
    condition     = contains([8, 16, 32, 48, 80, 144, 272], var.managed_ir_core_count)
    error_message = "managed_ir_core_count must be one of: 8, 16, 32, 48, 80, 144, 272."
  }
}

variable "managed_ir_time_to_live_min" {
  description = "Minutes to keep the data flow cluster warm before tear-down (0 disables TTL). Higher TTL trades cost for lower cold-start latency."
  type        = number
  default     = 10

  validation {
    condition     = var.managed_ir_time_to_live_min >= 0 && var.managed_ir_time_to_live_min <= 1440
    error_message = "managed_ir_time_to_live_min must be between 0 and 1440."
  }
}

variable "key_vault_id" {
  description = "Resource ID of the Key Vault to wire as a linked service. When null, no Key Vault linked service is created."
  type        = string
  default     = null
}

variable "key_vault_linked_service_name" {
  description = "Name of the Key Vault linked service inside the factory."
  type        = string
  default     = "ls_keyvault"
}

variable "git_integration" {
  description = <<-EOT
    Optional Git collaboration configuration. Set type to "vsts" (Azure DevOps) or "github".
    For vsts: account_name, project_name, repository_name, collaboration_branch, root_folder, tenant_id are required.
    For github: account_name, repository_name, collaboration_branch, root_folder, git_url are required.
  EOT
  type = object({
    type                 = string
    account_name         = string
    repository_name      = string
    collaboration_branch = optional(string, "main")
    root_folder          = optional(string, "/")
    project_name         = optional(string)
    tenant_id            = optional(string)
    git_url              = optional(string)
  })
  default = null

  validation {
    condition     = var.git_integration == null || contains(["vsts", "github"], try(var.git_integration.type, ""))
    error_message = "git_integration.type must be either \"vsts\" or \"github\"."
  }

  validation {
    condition = var.git_integration == null || var.git_integration.type != "vsts" || (
      try(var.git_integration.project_name, null) != null && try(var.git_integration.tenant_id, null) != null
    )
    error_message = "vsts git_integration requires both project_name and tenant_id."
  }
}

variable "tags" {
  description = "Additional tags merged onto the factory."
  type        = map(string)
  default     = {}
}
# outputs.tf

output "id" {
  description = "Resource ID of the Data Factory."
  value       = azurerm_data_factory.this.id
}

output "name" {
  description = "Name of the Data Factory."
  value       = azurerm_data_factory.this.name
}

output "identity_principal_id" {
  description = "Object (principal) ID of the factory's system-assigned managed identity — grant this RBAC on Storage, Key Vault, SQL, etc."
  value       = azurerm_data_factory.this.identity[0].principal_id
}

output "identity_tenant_id" {
  description = "Tenant ID of the factory's system-assigned managed identity."
  value       = azurerm_data_factory.this.identity[0].tenant_id
}

output "managed_integration_runtime_id" {
  description = "Resource ID of the managed Azure integration runtime, or null when managed VNet is disabled."
  value       = try(azurerm_data_factory_integration_runtime_azure.managed[0].id, null)
}

output "key_vault_linked_service_name" {
  description = "Name of the Key Vault linked service, or null when no Key Vault was wired."
  value       = try(azurerm_data_factory_linked_service_key_vault.kv[0].name, null)
}

How to use it

module "data_factory" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"

  name                = "adf-kloudvin-prod-weu"
  location            = "westeurope"
  resource_group_name = azurerm_resource_group.data.name

  # Production posture: locked-down data plane on a managed VNet.
  managed_virtual_network_enabled = true
  public_network_enabled          = false

  # Keep the data flow cluster warm for chained activities.
  managed_ir_compute_type     = "MemoryOptimized"
  managed_ir_core_count       = 16
  managed_ir_time_to_live_min = 15

  # Pipelines fetch secrets from this vault via the factory identity.
  key_vault_id = azurerm_key_vault.adf.id

  # Git-backed collaboration mode (Azure DevOps).
  git_integration = {
    type                 = "vsts"
    account_name         = "teknohut"
    project_name         = "kloudvin"
    repository_name      = "adf-pipelines"
    collaboration_branch = "main"
    root_folder          = "/factory"
    tenant_id            = data.azurerm_client_config.current.tenant_id
  }

  tags = {
    environment = "prod"
    owner       = "data-platform"
    cost_center = "CC-4821"
  }
}

# Downstream: grant the factory's managed identity read access to the
# data lake so Copy activities can land curated files.
resource "azurerm_role_assignment" "adf_lake_writer" {
  scope                = azurerm_storage_account.lake.id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = module.data_factory.identity_principal_id
}

# And let the same identity read secrets from the wired Key Vault.
resource "azurerm_role_assignment" "adf_kv_secrets" {
  scope                = azurerm_key_vault.adf.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = module.data_factory.identity_principal_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module configlive/prod/data_factory/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"
}

inputs = {
  name = "..."
  location = "..."
  resource_group_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/data_factory && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name string Yes Globally unique factory name (3-63 chars, alphanumeric/hyphens, must start/end alphanumeric).
location string Yes Azure region for the factory and managed IR.
resource_group_name string Yes Resource group that will contain the factory.
managed_virtual_network_enabled bool true No Enable managed VNet; required for managed private endpoints and isolated compute.
public_network_enabled bool false No Whether the data plane is reachable publicly. Keep false in production.
user_assigned_identity_ids list(string) null No Optional user-assigned identity IDs; system-assigned is always created.
managed_integration_runtime_name string "managed-vnet-ir" No Name of the managed Azure IR (created only when managed VNet is on).
managed_ir_compute_type string "General" No Data flow cluster class: General, ComputeOptimized, or MemoryOptimized.
managed_ir_core_count number 8 No Data flow core count (8/16/32/48/80/144/272).
managed_ir_time_to_live_min number 10 No Minutes to keep the data flow cluster warm (0-1440; 0 disables TTL).
key_vault_id string null No Key Vault resource ID to wire as a linked service; null skips it.
key_vault_linked_service_name string "ls_keyvault" No Name of the Key Vault linked service.
git_integration object null No Git collaboration config (type = vsts or github); null leaves the factory in Live mode.
tags map(string) {} No Additional tags merged onto the factory.

Outputs

Name Description
id Resource ID of the Data Factory.
name Name of the Data Factory.
identity_principal_id Object ID of the system-assigned managed identity — grant it RBAC on Storage, Key Vault, SQL.
identity_tenant_id Tenant ID of the system-assigned managed identity.
managed_integration_runtime_id Resource ID of the managed Azure IR, or null when managed VNet is disabled.
key_vault_linked_service_name Name of the Key Vault linked service, or null when no vault was wired.

Enterprise scenario

A retail analytics team runs nightly ELT that lands point-of-sale exports into ADLS Gen2, transforms them with mapping data flows, and loads a Synapse dedicated pool. Security forbids any data movement over the public internet, so the factory is deployed with public_network_enabled = false and managed_virtual_network_enabled = true, and the team adds managed private endpoints from the managed IR to the storage and Synapse accounts. The factory’s system-assigned identity (surfaced via identity_principal_id) is granted Storage Blob Data Contributor and Key Vault Secrets User from the root module, so no SAS tokens or connection strings ever live in pipeline JSON — and because git_integration points at the team’s Azure DevOps repo, every pipeline change ships through a PR and a release pipeline rather than click-ops in Studio.

Best practices

TerraformAzureData FactoryModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading