Quick take — Provision Azure Data Factory with Terraform and azurerm ~> 4.0: system-assigned identity, Git integration, managed VNet with private endpoints, and Key Vault linked service — all var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "data_factory" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"
name = "..." # Globally unique factory name (3-63 chars, alphanumeric/…
location = "..." # Azure region for the factory and managed IR.
resource_group_name = "..." # Resource group that will contain the factory.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure Data Factory (ADF) is Azure’s serverless data-integration and orchestration service. It is the thing that runs your ELT/ETL: copying data between 100+ connectors, kicking off Databricks notebooks, calling stored procedures, and stitching all of that into scheduled or event-triggered pipelines. The control plane — the factory itself, its managed identity, its Git wiring, its integration runtimes — is long-lived infrastructure, while the pipelines and datasets inside it are usually authored in the ADF Studio and shipped through Git. That split is exactly why a Terraform module pays off: you want the factory shell to be reproducible, policy-compliant, and identical across dev/test/prod, without Terraform fighting the data engineers over every dataset JSON.
This module wraps azurerm_data_factory plus the three companions you almost always need in production: a managed virtual network with an Azure integration runtime so copy activities run on private, isolated compute; Git (Azure DevOps or GitHub) configuration so the factory is collaboration-mode backed; and a Key Vault linked service so pipelines fetch secrets via the factory’s managed identity instead of inline credentials. Everything is variable-driven with validation, and the factory’s identity block is exposed as an output so you can grant it RBAC (Storage, Key Vault, SQL) from the calling configuration.
When to use it
- You are standing up ADF in more than one environment and need the factory, its identity, and its Git binding to be byte-for-byte consistent.
- Your security baseline mandates no public data exfiltration — you need a managed VNet and private endpoints, not the public Azure IR.
- You want the factory’s system-assigned identity to be the single auth principal for downstream services (Key Vault, ADLS Gen2, Synapse), granted via Terraform RBAC.
- You are practising Git-backed CI/CD for ADF and want Terraform to own the
vsts_configuration/github_configurationsocollaboration_branchandroot_folderare codified. - Skip this module if you only need a throwaway factory for a demo — a bare
azurerm_data_factoryresource is enough, and Git configuration on a short-lived factory just gets in the way.
Module structure
terraform-module-azure-data-factory/
├── versions.tf # provider + Terraform version pins
├── main.tf # azurerm_data_factory + managed IR + Key Vault linked service
├── variables.tf # typed, validated inputs
└── outputs.tf # id, name, identity principal_id, managed IR id
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
# main.tf
locals {
# Git config is mutually exclusive; pick at most one provider.
use_vsts = var.git_integration != null && var.git_integration.type == "vsts"
use_github = var.git_integration != null && var.git_integration.type == "github"
common_tags = merge(
{
managed_by = "terraform"
module = "terraform-module-azure-data-factory"
},
var.tags,
)
}
resource "azurerm_data_factory" "this" {
name = var.name
location = var.location
resource_group_name = var.resource_group_name
# Managed VNet keeps copy/dataflow compute off the public network.
managed_virtual_network_enabled = var.managed_virtual_network_enabled
# Lock the data plane down; consumers reach it via private endpoint.
public_network_enabled = var.public_network_enabled
identity {
type = var.user_assigned_identity_ids == null ? "SystemAssigned" : "SystemAssigned, UserAssigned"
identity_ids = var.user_assigned_identity_ids
}
# Azure DevOps (VSTS) Git collaboration mode.
dynamic "vsts_configuration" {
for_each = local.use_vsts ? [var.git_integration] : []
content {
account_name = vsts_configuration.value.account_name
branch_name = vsts_configuration.value.collaboration_branch
project_name = vsts_configuration.value.project_name
repository_name = vsts_configuration.value.repository_name
root_folder = vsts_configuration.value.root_folder
tenant_id = vsts_configuration.value.tenant_id
}
}
# GitHub Git collaboration mode.
dynamic "github_configuration" {
for_each = local.use_github ? [var.git_integration] : []
content {
account_name = github_configuration.value.account_name
branch_name = github_configuration.value.collaboration_branch
git_url = github_configuration.value.git_url
repository_name = github_configuration.value.repository_name
root_folder = github_configuration.value.root_folder
}
}
tags = local.common_tags
lifecycle {
# Pipelines/datasets are authored in Studio + Git, not Terraform.
# Prevent accidental destroy of a Git-backed factory.
prevent_destroy = false
}
}
# Managed Azure Integration Runtime inside the managed VNet.
# Required for managed private endpoints + isolated copy compute.
resource "azurerm_data_factory_integration_runtime_azure" "managed" {
count = var.managed_virtual_network_enabled ? 1 : 0
name = var.managed_integration_runtime_name
data_factory_id = azurerm_data_factory.this.id
location = var.location
virtual_network_enabled = true
compute_type = var.managed_ir_compute_type
core_count = var.managed_ir_core_count
time_to_live_min = var.managed_ir_time_to_live_min
}
# Linked service so pipelines pull secrets from Key Vault via the
# factory's managed identity (no inline connection strings).
resource "azurerm_data_factory_linked_service_key_vault" "kv" {
count = var.key_vault_id == null ? 0 : 1
name = var.key_vault_linked_service_name
data_factory_id = azurerm_data_factory.this.id
key_vault_id = var.key_vault_id
}
# variables.tf
variable "name" {
description = "Globally unique name of the Data Factory (3-63 chars, alphanumeric and hyphens, must start/end alphanumeric)."
type = string
validation {
condition = can(regex("^[a-zA-Z0-9]([a-zA-Z0-9-]{1,61}[a-zA-Z0-9])$", var.name))
error_message = "name must be 3-63 chars, alphanumeric/hyphens, starting and ending with an alphanumeric character."
}
}
variable "location" {
description = "Azure region for the factory and its managed integration runtime."
type = string
}
variable "resource_group_name" {
description = "Name of the resource group that will contain the Data Factory."
type = string
}
variable "managed_virtual_network_enabled" {
description = "Enable the managed virtual network. Required for managed private endpoints and isolated copy compute."
type = bool
default = true
}
variable "public_network_enabled" {
description = "Whether the data plane is reachable over the public network. Set false and use a private endpoint for production."
type = bool
default = false
}
variable "user_assigned_identity_ids" {
description = "Optional list of user-assigned managed identity resource IDs. When null, only the system-assigned identity is created."
type = list(string)
default = null
}
variable "managed_integration_runtime_name" {
description = "Name of the managed Azure integration runtime (created only when managed VNet is enabled)."
type = string
default = "managed-vnet-ir"
}
variable "managed_ir_compute_type" {
description = "Compute size class for the managed Azure IR data flow cluster."
type = string
default = "General"
validation {
condition = contains(["General", "ComputeOptimized", "MemoryOptimized"], var.managed_ir_compute_type)
error_message = "managed_ir_compute_type must be one of: General, ComputeOptimized, MemoryOptimized."
}
}
variable "managed_ir_core_count" {
description = "Core count for the managed Azure IR data flow cluster (8, 16, 32, 48, 80, 144, or 272)."
type = number
default = 8
validation {
condition = contains([8, 16, 32, 48, 80, 144, 272], var.managed_ir_core_count)
error_message = "managed_ir_core_count must be one of: 8, 16, 32, 48, 80, 144, 272."
}
}
variable "managed_ir_time_to_live_min" {
description = "Minutes to keep the data flow cluster warm before tear-down (0 disables TTL). Higher TTL trades cost for lower cold-start latency."
type = number
default = 10
validation {
condition = var.managed_ir_time_to_live_min >= 0 && var.managed_ir_time_to_live_min <= 1440
error_message = "managed_ir_time_to_live_min must be between 0 and 1440."
}
}
variable "key_vault_id" {
description = "Resource ID of the Key Vault to wire as a linked service. When null, no Key Vault linked service is created."
type = string
default = null
}
variable "key_vault_linked_service_name" {
description = "Name of the Key Vault linked service inside the factory."
type = string
default = "ls_keyvault"
}
variable "git_integration" {
description = <<-EOT
Optional Git collaboration configuration. Set type to "vsts" (Azure DevOps) or "github".
For vsts: account_name, project_name, repository_name, collaboration_branch, root_folder, tenant_id are required.
For github: account_name, repository_name, collaboration_branch, root_folder, git_url are required.
EOT
type = object({
type = string
account_name = string
repository_name = string
collaboration_branch = optional(string, "main")
root_folder = optional(string, "/")
project_name = optional(string)
tenant_id = optional(string)
git_url = optional(string)
})
default = null
validation {
condition = var.git_integration == null || contains(["vsts", "github"], try(var.git_integration.type, ""))
error_message = "git_integration.type must be either \"vsts\" or \"github\"."
}
validation {
condition = var.git_integration == null || var.git_integration.type != "vsts" || (
try(var.git_integration.project_name, null) != null && try(var.git_integration.tenant_id, null) != null
)
error_message = "vsts git_integration requires both project_name and tenant_id."
}
}
variable "tags" {
description = "Additional tags merged onto the factory."
type = map(string)
default = {}
}
# outputs.tf
output "id" {
description = "Resource ID of the Data Factory."
value = azurerm_data_factory.this.id
}
output "name" {
description = "Name of the Data Factory."
value = azurerm_data_factory.this.name
}
output "identity_principal_id" {
description = "Object (principal) ID of the factory's system-assigned managed identity — grant this RBAC on Storage, Key Vault, SQL, etc."
value = azurerm_data_factory.this.identity[0].principal_id
}
output "identity_tenant_id" {
description = "Tenant ID of the factory's system-assigned managed identity."
value = azurerm_data_factory.this.identity[0].tenant_id
}
output "managed_integration_runtime_id" {
description = "Resource ID of the managed Azure integration runtime, or null when managed VNet is disabled."
value = try(azurerm_data_factory_integration_runtime_azure.managed[0].id, null)
}
output "key_vault_linked_service_name" {
description = "Name of the Key Vault linked service, or null when no Key Vault was wired."
value = try(azurerm_data_factory_linked_service_key_vault.kv[0].name, null)
}
How to use it
module "data_factory" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"
name = "adf-kloudvin-prod-weu"
location = "westeurope"
resource_group_name = azurerm_resource_group.data.name
# Production posture: locked-down data plane on a managed VNet.
managed_virtual_network_enabled = true
public_network_enabled = false
# Keep the data flow cluster warm for chained activities.
managed_ir_compute_type = "MemoryOptimized"
managed_ir_core_count = 16
managed_ir_time_to_live_min = 15
# Pipelines fetch secrets from this vault via the factory identity.
key_vault_id = azurerm_key_vault.adf.id
# Git-backed collaboration mode (Azure DevOps).
git_integration = {
type = "vsts"
account_name = "teknohut"
project_name = "kloudvin"
repository_name = "adf-pipelines"
collaboration_branch = "main"
root_folder = "/factory"
tenant_id = data.azurerm_client_config.current.tenant_id
}
tags = {
environment = "prod"
owner = "data-platform"
cost_center = "CC-4821"
}
}
# Downstream: grant the factory's managed identity read access to the
# data lake so Copy activities can land curated files.
resource "azurerm_role_assignment" "adf_lake_writer" {
scope = azurerm_storage_account.lake.id
role_definition_name = "Storage Blob Data Contributor"
principal_id = module.data_factory.identity_principal_id
}
# And let the same identity read secrets from the wired Key Vault.
resource "azurerm_role_assignment" "adf_kv_secrets" {
scope = azurerm_key_vault.adf.id
role_definition_name = "Key Vault Secrets User"
principal_id = module.data_factory.identity_principal_id
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/data_factory/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-factory?ref=v1.0.0"
}
inputs = {
name = "..."
location = "..."
resource_group_name = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/data_factory && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name |
string |
— | Yes | Globally unique factory name (3-63 chars, alphanumeric/hyphens, must start/end alphanumeric). |
location |
string |
— | Yes | Azure region for the factory and managed IR. |
resource_group_name |
string |
— | Yes | Resource group that will contain the factory. |
managed_virtual_network_enabled |
bool |
true |
No | Enable managed VNet; required for managed private endpoints and isolated compute. |
public_network_enabled |
bool |
false |
No | Whether the data plane is reachable publicly. Keep false in production. |
user_assigned_identity_ids |
list(string) |
null |
No | Optional user-assigned identity IDs; system-assigned is always created. |
managed_integration_runtime_name |
string |
"managed-vnet-ir" |
No | Name of the managed Azure IR (created only when managed VNet is on). |
managed_ir_compute_type |
string |
"General" |
No | Data flow cluster class: General, ComputeOptimized, or MemoryOptimized. |
managed_ir_core_count |
number |
8 |
No | Data flow core count (8/16/32/48/80/144/272). |
managed_ir_time_to_live_min |
number |
10 |
No | Minutes to keep the data flow cluster warm (0-1440; 0 disables TTL). |
key_vault_id |
string |
null |
No | Key Vault resource ID to wire as a linked service; null skips it. |
key_vault_linked_service_name |
string |
"ls_keyvault" |
No | Name of the Key Vault linked service. |
git_integration |
object |
null |
No | Git collaboration config (type = vsts or github); null leaves the factory in Live mode. |
tags |
map(string) |
{} |
No | Additional tags merged onto the factory. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the Data Factory. |
name |
Name of the Data Factory. |
identity_principal_id |
Object ID of the system-assigned managed identity — grant it RBAC on Storage, Key Vault, SQL. |
identity_tenant_id |
Tenant ID of the system-assigned managed identity. |
managed_integration_runtime_id |
Resource ID of the managed Azure IR, or null when managed VNet is disabled. |
key_vault_linked_service_name |
Name of the Key Vault linked service, or null when no vault was wired. |
Enterprise scenario
A retail analytics team runs nightly ELT that lands point-of-sale exports into ADLS Gen2, transforms them with mapping data flows, and loads a Synapse dedicated pool. Security forbids any data movement over the public internet, so the factory is deployed with public_network_enabled = false and managed_virtual_network_enabled = true, and the team adds managed private endpoints from the managed IR to the storage and Synapse accounts. The factory’s system-assigned identity (surfaced via identity_principal_id) is granted Storage Blob Data Contributor and Key Vault Secrets User from the root module, so no SAS tokens or connection strings ever live in pipeline JSON — and because git_integration points at the team’s Azure DevOps repo, every pipeline change ships through a PR and a release pipeline rather than click-ops in Studio.
Best practices
- Identity over secrets. Always consume
identity_principal_idto grant RBAC on Storage, Key Vault, and SQL instead of embedding connection strings or SAS in linked services. Pair it with the Key Vault linked service this module provisions so any unavoidable secret is referenced by name, never inlined. - Lock the data plane. Set
public_network_enabled = falseandmanaged_virtual_network_enabled = truein non-dev environments, then attach managed private endpoints from the managed IR to your data stores — this is what keeps copy activities off the public Azure backbone. - Tune the managed IR for cost, not just speed.
managed_ir_time_to_live_minkeeps the data flow cluster warm to cut cold-start latency, but every warm minute bills; use a short TTL (or 0) for sporadic jobs and a longer one only for tight back-to-back data flow chains. Right-sizemanaged_ir_core_countper workload rather than defaulting large. - Let Git own pipelines, let Terraform own the shell. Configure
git_integrationso authoring happens in collaboration branches and ships via CI/CD; do not manage individualazurerm_data_factory_pipeline/dataset resources in this module, or Terraform and the data engineers will fight over state. - Name for environment and region. Because the factory name is globally unique, encode purpose, environment, and region (e.g.
adf-kloudvin-prod-weu) so prod and non-prod factories never collide and are obvious in the portal and cost reports. - Treat the factory as long-lived. Guard production factories from accidental teardown (review
prevent_destroyand your CI plan gates), since destroying a Git-backed factory can drop the collaboration configuration and trigger-state that pipelines depend on.