Terraform Module: Azure Data Factory — A Managed-Identity-First Orchestration Factory

Quick take — Provision Azure Data Factory with Terraform and azurerm ~> 4.0: system-assigned identity, Git integration, managed VNet with private endpoints, and Key Vault linked service — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "data_factory" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"

  name                = "..."  # Globally unique factory name (3-63 chars, alphanumeric/…
  location            = "..."  # Azure region for the factory and managed IR.
  resource_group_name = "..."  # Resource group that will contain the factory.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Data Factory (ADF) is Azure’s serverless data-integration and orchestration service. It is the thing that runs your ELT/ETL: copying data between 100+ connectors, kicking off Databricks notebooks, calling stored procedures, and stitching all of that into scheduled or event-triggered pipelines. The control plane — the factory itself, its managed identity, its Git wiring, its integration runtimes — is long-lived infrastructure, while the pipelines and datasets inside it are usually authored in the ADF Studio and shipped through Git. That split is exactly why a Terraform module pays off: you want the factory shell to be reproducible, policy-compliant, and identical across dev/test/prod, without Terraform fighting the data engineers over every dataset JSON.

This module wraps azurerm_data_factory plus the three companions you almost always need in production: a managed virtual network with an Azure integration runtime so copy activities run on private, isolated compute; Git (Azure DevOps or GitHub) configuration so the factory is collaboration-mode backed; and a Key Vault linked service so pipelines fetch secrets via the factory’s managed identity instead of inline credentials. Everything is variable-driven with validation, and the factory’s identity block is exposed as an output so you can grant it RBAC (Storage, Key Vault, SQL) from the calling configuration.

When to use it

You are standing up ADF in more than one environment and need the factory, its identity, and its Git binding to be byte-for-byte consistent.
Your security baseline mandates no public data exfiltration — you need a managed VNet and private endpoints, not the public Azure IR.
You want the factory’s system-assigned identity to be the single auth principal for downstream services (Key Vault, ADLS Gen2, Synapse), granted via Terraform RBAC.
You are practising Git-backed CI/CD for ADF and want Terraform to own the vsts_configuration / github_configuration so collaboration_branch and root_folder are codified.
Skip this module if you only need a throwaway factory for a demo — a bare azurerm_data_factory resource is enough, and Git configuration on a short-lived factory just gets in the way.

Module structure

terraform-module-azure-data-factory/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # azurerm_data_factory + managed IR + Key Vault linked service
├── variables.tf     # typed, validated inputs
└── outputs.tf       # id, name, identity principal_id, managed IR id

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

# main.tf

locals {
  # Git config is mutually exclusive; pick at most one provider.
  use_vsts   = var.git_integration != null && var.git_integration.type == "vsts"
  use_github = var.git_integration != null && var.git_integration.type == "github"

  common_tags = merge(
    {
      managed_by = "terraform"
      module     = "terraform-module-azure-data-factory"
    },
    var.tags,
  )
}

resource "azurerm_data_factory" "this" {
  name                = var.name
  location            = var.location
  resource_group_name = var.resource_group_name

  # Managed VNet keeps copy/dataflow compute off the public network.
  managed_virtual_network_enabled = var.managed_virtual_network_enabled

  # Lock the data plane down; consumers reach it via private endpoint.
  public_network_enabled = var.public_network_enabled

  identity {
    type         = var.user_assigned_identity_ids == null ? "SystemAssigned" : "SystemAssigned, UserAssigned"
    identity_ids = var.user_assigned_identity_ids
  }

  # Azure DevOps (VSTS) Git collaboration mode.
  dynamic "vsts_configuration" {
    for_each = local.use_vsts ? [var.git_integration] : []
    content {
      account_name    = vsts_configuration.value.account_name
      branch_name     = vsts_configuration.value.collaboration_branch
      project_name    = vsts_configuration.value.project_name
      repository_name = vsts_configuration.value.repository_name
      root_folder     = vsts_configuration.value.root_folder
      tenant_id       = vsts_configuration.value.tenant_id
    }
  }

  # GitHub Git collaboration mode.
  dynamic "github_configuration" {
    for_each = local.use_github ? [var.git_integration] : []
    content {
      account_name    = github_configuration.value.account_name
      branch_name     = github_configuration.value.collaboration_branch
      git_url         = github_configuration.value.git_url
      repository_name = github_configuration.value.repository_name
      root_folder     = github_configuration.value.root_folder
    }
  }

  tags = local.common_tags

  lifecycle {
    # Pipelines/datasets are authored in Studio + Git, not Terraform.
    # Prevent accidental destroy of a Git-backed factory.
    prevent_destroy = false
  }
}

# Managed Azure Integration Runtime inside the managed VNet.
# Required for managed private endpoints + isolated copy compute.
resource "azurerm_data_factory_integration_runtime_azure" "managed" {
  count = var.managed_virtual_network_enabled ? 1 : 0

  name                    = var.managed_integration_runtime_name
  data_factory_id         = azurerm_data_factory.this.id
  location                = var.location
  virtual_network_enabled = true
  compute_type            = var.managed_ir_compute_type
  core_count              = var.managed_ir_core_count
  time_to_live_min        = var.managed_ir_time_to_live_min
}

# Linked service so pipelines pull secrets from Key Vault via the
# factory's managed identity (no inline connection strings).
resource "azurerm_data_factory_linked_service_key_vault" "kv" {
  count = var.key_vault_id == null ? 0 : 1

  name            = var.key_vault_linked_service_name
  data_factory_id = azurerm_data_factory.this.id
  key_vault_id    = var.key_vault_id
}

# variables.tf

variable "name" {
  description = "Globally unique name of the Data Factory (3-63 chars, alphanumeric and hyphens, must start/end alphanumeric)."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9]([a-zA-Z0-9-]{1,61}[a-zA-Z0-9])$", var.name))
    error_message = "name must be 3-63 chars, alphanumeric/hyphens, starting and ending with an alphanumeric character."
  }
}

variable "location" {
  description = "Azure region for the factory and its managed integration runtime."
  type        = string
}

variable "resource_group_name" {
  description = "Name of the resource group that will contain the Data Factory."
  type        = string
}

variable "managed_virtual_network_enabled" {
  description = "Enable the managed virtual network. Required for managed private endpoints and isolated copy compute."
  type        = bool
  default     = true
}

variable "public_network_enabled" {
  description = "Whether the data plane is reachable over the public network. Set false and use a private endpoint for production."
  type        = bool
  default     = false
}

variable "user_assigned_identity_ids" {
  description = "Optional list of user-assigned managed identity resource IDs. When null, only the system-assigned identity is created."
  type        = list(string)
  default     = null
}

variable "managed_integration_runtime_name" {
  description = "Name of the managed Azure integration runtime (created only when managed VNet is enabled)."
  type        = string
  default     = "managed-vnet-ir"
}

variable "managed_ir_compute_type" {
  description = "Compute size class for the managed Azure IR data flow cluster."
  type        = string
  default     = "General"

  validation {
    condition     = contains(["General", "ComputeOptimized", "MemoryOptimized"], var.managed_ir_compute_type)
    error_message = "managed_ir_compute_type must be one of: General, ComputeOptimized, MemoryOptimized."
  }
}

variable "managed_ir_core_count" {
  description = "Core count for the managed Azure IR data flow cluster (8, 16, 32, 48, 80, 144, or 272)."
  type        = number
  default     = 8

  validation {
    condition     = contains([8, 16, 32, 48, 80, 144, 272], var.managed_ir_core_count)
    error_message = "managed_ir_core_count must be one of: 8, 16, 32, 48, 80, 144, 272."
  }
}

variable "managed_ir_time_to_live_min" {
  description = "Minutes to keep the data flow cluster warm before tear-down (0 disables TTL). Higher TTL trades cost for lower cold-start latency."
  type        = number
  default     = 10

  validation {
    condition     = var.managed_ir_time_to_live_min >= 0 && var.managed_ir_time_to_live_min <= 1440
    error_message = "managed_ir_time_to_live_min must be between 0 and 1440."
  }
}

variable "key_vault_id" {
  description = "Resource ID of the Key Vault to wire as a linked service. When null, no Key Vault linked service is created."
  type        = string
  default     = null
}

variable "key_vault_linked_service_name" {
  description = "Name of the Key Vault linked service inside the factory."
  type        = string
  default     = "ls_keyvault"
}

variable "git_integration" {
  description = <<-EOT
    Optional Git collaboration configuration. Set type to "vsts" (Azure DevOps) or "github".
    For vsts: account_name, project_name, repository_name, collaboration_branch, root_folder, tenant_id are required.
    For github: account_name, repository_name, collaboration_branch, root_folder, git_url are required.
  EOT
  type = object({
    type                 = string
    account_name         = string
    repository_name      = string
    collaboration_branch = optional(string, "main")
    root_folder          = optional(string, "/")
    project_name         = optional(string)
    tenant_id            = optional(string)
    git_url              = optional(string)
  })
  default = null

  validation {
    condition     = var.git_integration == null || contains(["vsts", "github"], try(var.git_integration.type, ""))
    error_message = "git_integration.type must be either \"vsts\" or \"github\"."
  }

  validation {
    condition = var.git_integration == null || var.git_integration.type != "vsts" || (
      try(var.git_integration.project_name, null) != null && try(var.git_integration.tenant_id, null) != null
    )
    error_message = "vsts git_integration requires both project_name and tenant_id."
  }
}

variable "tags" {
  description = "Additional tags merged onto the factory."
  type        = map(string)
  default     = {}
}

# outputs.tf

output "id" {
  description = "Resource ID of the Data Factory."
  value       = azurerm_data_factory.this.id
}

output "name" {
  description = "Name of the Data Factory."
  value       = azurerm_data_factory.this.name
}

output "identity_principal_id" {
  description = "Object (principal) ID of the factory's system-assigned managed identity — grant this RBAC on Storage, Key Vault, SQL, etc."
  value       = azurerm_data_factory.this.identity[0].principal_id
}

output "identity_tenant_id" {
  description = "Tenant ID of the factory's system-assigned managed identity."
  value       = azurerm_data_factory.this.identity[0].tenant_id
}

output "managed_integration_runtime_id" {
  description = "Resource ID of the managed Azure integration runtime, or null when managed VNet is disabled."
  value       = try(azurerm_data_factory_integration_runtime_azure.managed[0].id, null)
}

output "key_vault_linked_service_name" {
  description = "Name of the Key Vault linked service, or null when no Key Vault was wired."
  value       = try(azurerm_data_factory_linked_service_key_vault.kv[0].name, null)
}

How to use it

module "data_factory" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"

  name                = "adf-kloudvin-prod-weu"
  location            = "westeurope"
  resource_group_name = azurerm_resource_group.data.name

  # Production posture: locked-down data plane on a managed VNet.
  managed_virtual_network_enabled = true
  public_network_enabled          = false

  # Keep the data flow cluster warm for chained activities.
  managed_ir_compute_type     = "MemoryOptimized"
  managed_ir_core_count       = 16
  managed_ir_time_to_live_min = 15

  # Pipelines fetch secrets from this vault via the factory identity.
  key_vault_id = azurerm_key_vault.adf.id

  # Git-backed collaboration mode (Azure DevOps).
  git_integration = {
    type                 = "vsts"
    account_name         = "teknohut"
    project_name         = "kloudvin"
    repository_name      = "adf-pipelines"
    collaboration_branch = "main"
    root_folder          = "/factory"
    tenant_id            = data.azurerm_client_config.current.tenant_id
  }

  tags = {
    environment = "prod"
    owner       = "data-platform"
    cost_center = "CC-4821"
  }
}

# Downstream: grant the factory's managed identity read access to the
# data lake so Copy activities can land curated files.
resource "azurerm_role_assignment" "adf_lake_writer" {
  scope                = azurerm_storage_account.lake.id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = module.data_factory.identity_principal_id
}

# And let the same identity read secrets from the wired Key Vault.
resource "azurerm_role_assignment" "adf_kv_secrets" {
  scope                = azurerm_key_vault.adf.id
  role_definition_name = "Key Vault Secrets User"
  principal_id         = module.data_factory.identity_principal_id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/data_factory/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"
}

inputs = {
  name = "..."
  location = "..."
  resource_group_name = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/data_factory && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name`	`string`	—	Yes	Globally unique factory name (3-63 chars, alphanumeric/hyphens, must start/end alphanumeric).
`location`	`string`	—	Yes	Azure region for the factory and managed IR.
`resource_group_name`	`string`	—	Yes	Resource group that will contain the factory.
`managed_virtual_network_enabled`	`bool`	`true`	No	Enable managed VNet; required for managed private endpoints and isolated compute.
`public_network_enabled`	`bool`	`false`	No	Whether the data plane is reachable publicly. Keep `false` in production.
`user_assigned_identity_ids`	`list(string)`	`null`	No	Optional user-assigned identity IDs; system-assigned is always created.
`managed_integration_runtime_name`	`string`	`"managed-vnet-ir"`	No	Name of the managed Azure IR (created only when managed VNet is on).
`managed_ir_compute_type`	`string`	`"General"`	No	Data flow cluster class: `General`, `ComputeOptimized`, or `MemoryOptimized`.
`managed_ir_core_count`	`number`	`8`	No	Data flow core count (8/16/32/48/80/144/272).
`managed_ir_time_to_live_min`	`number`	`10`	No	Minutes to keep the data flow cluster warm (0-1440; 0 disables TTL).
`key_vault_id`	`string`	`null`	No	Key Vault resource ID to wire as a linked service; `null` skips it.
`key_vault_linked_service_name`	`string`	`"ls_keyvault"`	No	Name of the Key Vault linked service.
`git_integration`	`object`	`null`	No	Git collaboration config (`type` = `vsts` or `github`); `null` leaves the factory in Live mode.
`tags`	`map(string)`	`{}`	No	Additional tags merged onto the factory.

Outputs

Name	Description
`id`	Resource ID of the Data Factory.
`name`	Name of the Data Factory.
`identity_principal_id`	Object ID of the system-assigned managed identity — grant it RBAC on Storage, Key Vault, SQL.
`identity_tenant_id`	Tenant ID of the system-assigned managed identity.
`managed_integration_runtime_id`	Resource ID of the managed Azure IR, or `null` when managed VNet is disabled.
`key_vault_linked_service_name`	Name of the Key Vault linked service, or `null` when no vault was wired.

Enterprise scenario

A retail analytics team runs nightly ELT that lands point-of-sale exports into ADLS Gen2, transforms them with mapping data flows, and loads a Synapse dedicated pool. Security forbids any data movement over the public internet, so the factory is deployed with public_network_enabled = false and managed_virtual_network_enabled = true, and the team adds managed private endpoints from the managed IR to the storage and Synapse accounts. The factory’s system-assigned identity (surfaced via identity_principal_id) is granted Storage Blob Data Contributor and Key Vault Secrets User from the root module, so no SAS tokens or connection strings ever live in pipeline JSON — and because git_integration points at the team’s Azure DevOps repo, every pipeline change ships through a PR and a release pipeline rather than click-ops in Studio.

Best practices

Identity over secrets. Always consume identity_principal_id to grant RBAC on Storage, Key Vault, and SQL instead of embedding connection strings or SAS in linked services. Pair it with the Key Vault linked service this module provisions so any unavoidable secret is referenced by name, never inlined.
Lock the data plane. Set public_network_enabled = false and managed_virtual_network_enabled = true in non-dev environments, then attach managed private endpoints from the managed IR to your data stores — this is what keeps copy activities off the public Azure backbone.
Tune the managed IR for cost, not just speed. managed_ir_time_to_live_min keeps the data flow cluster warm to cut cold-start latency, but every warm minute bills; use a short TTL (or 0) for sporadic jobs and a longer one only for tight back-to-back data flow chains. Right-size managed_ir_core_count per workload rather than defaulting large.
Let Git own pipelines, let Terraform own the shell. Configure git_integration so authoring happens in collaboration branches and ships via CI/CD; do not manage individual azurerm_data_factory_pipeline/dataset resources in this module, or Terraform and the data engineers will fight over state.
Name for environment and region. Because the factory name is globally unique, encode purpose, environment, and region (e.g. adf-kloudvin-prod-weu) so prod and non-prod factories never collide and are obvious in the portal and cost reports.
Treat the factory as long-lived. Guard production factories from accidental teardown (review prevent_destroy and your CI plan gates), since destroying a Git-backed factory can drop the collaboration configuration and trigger-state that pipelines depend on.

Terraform Module: Azure Data Factory — A Managed-Identity-First Orchestration Factory

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow