Terraform Module: Azure HDInsight — production Spark clusters with VNet, ADLS Gen2 and autoscale

Quick take — A reusable azurerm ~> 4.0 Terraform module for Azure HDInsight Spark: head/worker/zookeeper node sizing, ADLS Gen2 storage, VNet injection, schedule and load-based autoscale, and gateway auth. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "hdinsight" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"

  cluster_name                = "..."  # Globally-unique cluster name; becomes `<name>.azurehdin…
  resource_group_name         = "..."  # Resource group for the cluster.
  location                    = "..."  # Azure region, e.g. `centralindia`.
  gateway_password            = "..."  # Gateway admin password (≥10 chars); source from Key Vau…
  ssh_public_key              = "..."  # OpenSSH public key for node SSH access.
  storage_account_id          = "..."  # Resource ID of the ADLS Gen2 storage account.
  storage_filesystem_id       = "..."  # Resource ID of the ADLS Gen2 filesystem (default FS).
  storage_managed_identity_id = "..."  # UAMI with Storage Blob Data Owner on the filesystem.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure HDInsight is Microsoft’s managed, open-source analytics service: it spins up fully-provisioned Hadoop-ecosystem clusters — Spark, Hive (Interactive Query), Kafka, HBase — on top of Azure VMs without you having to install or patch the stack yourself. A Spark cluster on HDInsight gives you a multi-node Apache Spark runtime with Jupyter/Zeppelin notebooks, a Livy REST endpoint and a Thrift/JDBC server, backed by Azure storage as the cluster filesystem.

The problem is that a correct Spark cluster is verbose to declare. You have three distinct node roles (head, worker, zookeeper), each with its own VM SKU and credentials; a gateway with HTTPS basic-auth; an SSH login; a storage account container or — for production — an ADLS Gen2 filesystem with a managed identity; usually VNet injection so the cluster lands in a private subnet; and an autoscale block that is different depending on whether you scale by schedule or by load. Hand-writing all of that per environment invites drift and copy-paste mistakes (mismatched usernames, a worker count below the autoscale floor, a storage key in plain HCL).

This module wraps azurerm_hdinsight_spark_cluster so the caller passes a handful of vetted inputs — name, tier, Spark version, node SKUs, the ADLS Gen2 filesystem + identity, optional VNet subnet, and an optional autoscale policy — and gets back a hardened cluster plus its HTTPS and SSH endpoints as outputs. Validations stop the most common foot-guns before apply.

When to use it

You run batch or streaming Spark jobs (ETL, feature engineering, ad-hoc Notebook analytics) and want a managed cluster rather than self-hosting Spark on AKS or VMs.
You need the cluster inside a VNet for private connectivity to data sources (SQL MI, private storage, on-prem over ExpressRoute) and governed egress.
You want a repeatable, reviewable cluster definition across dev/test/prod with consistent sizing, tags and autoscale, instead of clicking through the portal.
Your data lake is ADLS Gen2 and you want the cluster authenticated to it via a user-assigned managed identity, not storage keys.
Skip this module if you only need short-lived, on-demand Spark with no cluster to keep warm — Azure Synapse Spark pools or Databricks job clusters are a better fit. HDInsight clusters bill while they exist, so they suit long-running or scheduled-up workloads.

Module structure

terraform-module-azure-hdinsight/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # azurerm_hdinsight_spark_cluster + autoscale wiring
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # id, name, https/ssh endpoints

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

locals {
  # The HDInsight gateway always exposes HTTPS on the cluster's public/private
  # endpoint as https://<name>.azurehdinsight.net. SSH lands on the -ssh host.
  https_endpoint = "https://${var.cluster_name}.azurehdinsight.net"
  ssh_endpoint   = "${var.cluster_name}-ssh.azurehdinsight.net"

  # VNet injection is all-or-nothing: both the subnet id and the vnet id must
  # be supplied together, otherwise the network block is omitted entirely.
  network_enabled = var.subnet_id != null && var.virtual_network_id != null
}

resource "azurerm_hdinsight_spark_cluster" "this" {
  name                = var.cluster_name
  resource_group_name = var.resource_group_name
  location            = var.location
  cluster_version     = var.cluster_version
  tier                = var.tier
  tags                = var.tags

  # Spark engine version, e.g. "3.3" on HDInsight 5.1.
  component_version {
    spark = var.spark_version
  }

  gateway {
    username = var.gateway_username
    password = var.gateway_password
  }

  # ADLS Gen2 as the default cluster filesystem, authenticated with a
  # user-assigned managed identity (no storage account keys in state).
  storage_account_gen2 {
    storage_resource_id          = var.storage_account_id
    filesystem_id                = var.storage_filesystem_id
    managed_identity_resource_id = var.storage_managed_identity_id
    is_default                   = true
  }

  roles {
    head_node {
      vm_size            = var.head_node_vm_size
      username           = var.ssh_username
      ssh_keys           = [var.ssh_public_key]
      subnet_id          = local.network_enabled ? var.subnet_id : null
      virtual_network_id = local.network_enabled ? var.virtual_network_id : null
    }

    worker_node {
      vm_size               = var.worker_node_vm_size
      username              = var.ssh_username
      ssh_keys              = [var.ssh_public_key]
      target_instance_count = var.worker_node_count
      subnet_id             = local.network_enabled ? var.subnet_id : null
      virtual_network_id    = local.network_enabled ? var.virtual_network_id : null

      dynamic "autoscale" {
        for_each = var.autoscale != null ? [var.autoscale] : []
        content {
          # Load-based autoscale: min/max worker bounds.
          dynamic "capacity" {
            for_each = autoscale.value.capacity != null ? [autoscale.value.capacity] : []
            content {
              min_instance_count = capacity.value.min_instance_count
              max_instance_count = capacity.value.max_instance_count
            }
          }

          # Schedule-based autoscale: one or more time/day rules.
          dynamic "recurrence" {
            for_each = autoscale.value.recurrence != null ? [autoscale.value.recurrence] : []
            content {
              timezone = recurrence.value.timezone

              dynamic "schedule" {
                for_each = recurrence.value.schedule
                content {
                  days                  = schedule.value.days
                  time                  = schedule.value.time
                  target_instance_count = schedule.value.target_instance_count
                }
              }
            }
          }
        }
      }
    }

    zookeeper_node {
      vm_size            = var.zookeeper_node_vm_size
      username           = var.ssh_username
      ssh_keys           = [var.ssh_public_key]
      subnet_id          = local.network_enabled ? var.subnet_id : null
      virtual_network_id = local.network_enabled ? var.virtual_network_id : null
    }
  }
}

variables.tf

variable "cluster_name" {
  type        = string
  description = "Globally-unique HDInsight cluster name (becomes <name>.azurehdinsight.net)."

  validation {
    condition     = can(regex("^[a-z0-9][a-z0-9-]{1,57}[a-z0-9]$", var.cluster_name))
    error_message = "cluster_name must be 3-59 chars, lowercase alphanumeric or hyphen, not starting/ending with a hyphen."
  }
}

variable "resource_group_name" {
  type        = string
  description = "Resource group that will hold the cluster."
}

variable "location" {
  type        = string
  description = "Azure region for the cluster, e.g. centralindia."
}

variable "cluster_version" {
  type        = string
  description = "HDInsight platform version, e.g. \"5.1\"."
  default     = "5.1"
}

variable "spark_version" {
  type        = string
  description = "Apache Spark component version, e.g. \"3.3\" on HDInsight 5.1."
  default     = "3.3"
}

variable "tier" {
  type        = string
  description = "Cluster tier: Standard or Premium (Premium enables ESP/Kerberos with AAD-DS)."
  default     = "Standard"

  validation {
    condition     = contains(["Standard", "Premium"], var.tier)
    error_message = "tier must be either \"Standard\" or \"Premium\"."
  }
}

variable "gateway_username" {
  type        = string
  description = "HTTPS gateway (Ambari) admin username."
  default     = "admin"

  validation {
    condition     = var.gateway_username != "admin" ? true : true
    error_message = "gateway_username must be set."
  }
}

variable "gateway_password" {
  type        = string
  description = "HTTPS gateway admin password. Source from Key Vault; do not hard-code."
  sensitive   = true

  validation {
    condition     = length(var.gateway_password) >= 10
    error_message = "gateway_password must be at least 10 characters."
  }
}

variable "ssh_username" {
  type        = string
  description = "SSH login user applied to every node role."
  default     = "sshuser"
}

variable "ssh_public_key" {
  type        = string
  description = "OpenSSH public key (ssh-rsa/ssh-ed25519 ...) for node SSH access."

  validation {
    condition     = can(regex("^(ssh-rsa|ssh-ed25519|ecdsa-) ", var.ssh_public_key))
    error_message = "ssh_public_key must be a valid OpenSSH public key string."
  }
}

variable "storage_account_id" {
  type        = string
  description = "Resource ID of the ADLS Gen2 (StorageV2 + HNS) storage account."
}

variable "storage_filesystem_id" {
  type        = string
  description = "Resource ID of the ADLS Gen2 filesystem (container) used as the default FS."
}

variable "storage_managed_identity_id" {
  type        = string
  description = "Resource ID of the user-assigned managed identity with Storage Blob Data Owner on the filesystem."
}

variable "head_node_vm_size" {
  type        = string
  description = "VM SKU for the 2 head nodes."
  default     = "Standard_E8_v3"
}

variable "worker_node_vm_size" {
  type        = string
  description = "VM SKU for worker nodes."
  default     = "Standard_E8_v3"
}

variable "zookeeper_node_vm_size" {
  type        = string
  description = "VM SKU for the 3 zookeeper nodes."
  default     = "Standard_A2_v2"
}

variable "worker_node_count" {
  type        = number
  description = "Initial (fixed) worker count. When autoscale is set, keep this within its bounds."
  default     = 3

  validation {
    condition     = var.worker_node_count >= 1 && var.worker_node_count <= 200
    error_message = "worker_node_count must be between 1 and 200."
  }
}

variable "subnet_id" {
  type        = string
  description = "Subnet resource ID for VNet injection. Set together with virtual_network_id."
  default     = null
}

variable "virtual_network_id" {
  type        = string
  description = "Virtual network resource ID for VNet injection. Set together with subnet_id."
  default     = null
}

variable "autoscale" {
  description = <<-EOT
    Optional autoscale policy. Provide EITHER capacity (load-based) OR recurrence
    (schedule-based), not both. Leave null for a fixed-size cluster.
  EOT
  type = object({
    capacity = optional(object({
      min_instance_count = number
      max_instance_count = number
    }))
    recurrence = optional(object({
      timezone = string
      schedule = list(object({
        days                  = list(string)
        time                  = string
        target_instance_count = number
      }))
    }))
  })
  default = null

  validation {
    condition = (
      var.autoscale == null ||
      (var.autoscale.capacity != null) != (var.autoscale.recurrence != null)
    )
    error_message = "autoscale must set exactly one of capacity (load-based) or recurrence (schedule-based)."
  }
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the cluster."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the HDInsight Spark cluster."
  value       = azurerm_hdinsight_spark_cluster.this.id
}

output "name" {
  description = "Name of the HDInsight Spark cluster."
  value       = azurerm_hdinsight_spark_cluster.this.name
}

output "https_endpoint" {
  description = "HTTPS (Ambari/Livy) endpoint of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.https_endpoint
}

output "ssh_endpoint" {
  description = "SSH endpoint of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.ssh_endpoint
}

output "cluster_version" {
  description = "Resolved HDInsight platform version of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.cluster_version
}

How to use it

module "hdinsight" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"

  cluster_name        = "kv-spark-prod"
  resource_group_name = azurerm_resource_group.analytics.name
  location            = azurerm_resource_group.analytics.location
  cluster_version     = "5.1"
  spark_version       = "3.3"
  tier                = "Standard"

  gateway_username = "ambariadmin"
  gateway_password = data.azurerm_key_vault_secret.hdi_gw.value

  ssh_username   = "sshuser"
  ssh_public_key = file("${path.module}/keys/hdi_id_ed25519.pub")

  # ADLS Gen2 default filesystem + managed identity
  storage_account_id          = azurerm_storage_account.lake.id
  storage_filesystem_id       = azurerm_storage_data_lake_gen2_filesystem.spark.id
  storage_managed_identity_id = azurerm_user_assigned_identity.hdi.id

  # Land the cluster in a private subnet
  subnet_id          = azurerm_subnet.hdinsight.id
  virtual_network_id = azurerm_virtual_network.analytics.id

  head_node_vm_size      = "Standard_E8_v3"
  worker_node_vm_size    = "Standard_E16_v3"
  zookeeper_node_vm_size = "Standard_A2_v2"
  worker_node_count      = 4

  # Scale up for the morning batch window, down overnight (IST)
  autoscale = {
    recurrence = {
      timezone = "India Standard Time"
      schedule = [
        {
          days                  = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
          time                  = "06:00"
          target_instance_count = 10
        },
        {
          days                  = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
          time                  = "20:00"
          target_instance_count = 4
        }
      ]
    }
  }

  tags = {
    env     = "prod"
    owner   = "data-platform"
    service = "spark"
  }
}

# Downstream: grant the cluster's host an inbound NSG allowance, and surface the
# Livy/HTTPS endpoint to a pipeline that submits Spark jobs.
resource "azurerm_data_factory_linked_service_azure_databricks" "noop" {
  count = 0 # placeholder to show output consumption pattern below
}

output "spark_livy_url" {
  description = "HTTPS endpoint used by orchestrators (ADF/Synapse) to submit Spark batches via Livy."
  value       = module.hdinsight.https_endpoint
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/hdinsight/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"
}

inputs = {
  cluster_name = "..."
  resource_group_name = "..."
  location = "..."
  gateway_password = "..."
  ssh_public_key = "..."
  storage_account_id = "..."
  storage_filesystem_id = "..."
  storage_managed_identity_id = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/hdinsight && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`cluster_name`	`string`	—	Yes	Globally-unique cluster name; becomes `<name>.azurehdinsight.net`.
`resource_group_name`	`string`	—	Yes	Resource group for the cluster.
`location`	`string`	—	Yes	Azure region, e.g. `centralindia`.
`cluster_version`	`string`	`"5.1"`	No	HDInsight platform version.
`spark_version`	`string`	`"3.3"`	No	Apache Spark component version.
`tier`	`string`	`"Standard"`	No	`Standard` or `Premium` (Premium enables ESP/Kerberos).
`gateway_username`	`string`	`"admin"`	No	HTTPS gateway (Ambari) admin user.
`gateway_password`	`string` (sensitive)	—	Yes	Gateway admin password (≥10 chars); source from Key Vault.
`ssh_username`	`string`	`"sshuser"`	No	SSH login user for all node roles.
`ssh_public_key`	`string`	—	Yes	OpenSSH public key for node SSH access.
`storage_account_id`	`string`	—	Yes	Resource ID of the ADLS Gen2 storage account.
`storage_filesystem_id`	`string`	—	Yes	Resource ID of the ADLS Gen2 filesystem (default FS).
`storage_managed_identity_id`	`string`	—	Yes	UAMI with Storage Blob Data Owner on the filesystem.
`head_node_vm_size`	`string`	`"Standard_E8_v3"`	No	VM SKU for the 2 head nodes.
`worker_node_vm_size`	`string`	`"Standard_E8_v3"`	No	VM SKU for worker nodes.
`zookeeper_node_vm_size`	`string`	`"Standard_A2_v2"`	No	VM SKU for the 3 zookeeper nodes.
`worker_node_count`	`number`	`3`	No	Initial worker count (1–200); keep within autoscale bounds.
`subnet_id`	`string`	`null`	No	Subnet ID for VNet injection (set with `virtual_network_id`).
`virtual_network_id`	`string`	`null`	No	VNet ID for VNet injection (set with `subnet_id`).
`autoscale`	`object`	`null`	No	Load-based (`capacity`) or schedule-based (`recurrence`) policy.
`tags`	`map(string)`	`{}`	No	Tags applied to the cluster.

Outputs

Name	Description
`id`	Resource ID of the HDInsight Spark cluster.
`name`	Name of the cluster.
`https_endpoint`	HTTPS (Ambari/Livy) endpoint, e.g. `kv-spark-prod.azurehdinsight.net`.
`ssh_endpoint`	SSH endpoint of the cluster.
`cluster_version`	Resolved HDInsight platform version.

Enterprise scenario

A logistics company runs an overnight ETL that joins six months of shipment telemetry from ADLS Gen2 and writes curated Delta tables for the next day’s dashboards. The data platform team consumes this module pinned to ref=v1.0.0 from their Azure DevOps terraform-modules repo, deploying a VNet-injected Spark 3.3 cluster into a private subnet that reaches the lake over a service endpoint and SQL Managed Instance over a peered VNet. Schedule-based autoscale ramps workers from 4 to 10 at 06:00 IST for the heavy join window and back to 4 at 20:00, and Azure Data Factory submits the job through the module’s https_endpoint (Livy). The result is a reviewed, identical cluster in dev/test/prod with no storage keys in state — auth flows entirely through the user-assigned managed identity.

Best practices

Authenticate to storage with a managed identity, never keys. Use storage_account_gen2 with a user-assigned identity that holds Storage Blob Data Owner on the filesystem, so no account key ever lands in Terraform state or the cluster config.
Inject into a VNet and lock down the gateway. Place head/worker/zookeeper nodes in a dedicated subnet, apply the HDInsight service-tag NSG rules, and keep the HTTPS gateway off the public internet; reach Ambari/Livy from inside the VNet or via private connectivity.
Choose autoscale to match the workload shape. Use recurrence (schedule) for predictable batch windows and capacity (load) for bursty interactive use — and make sure worker_node_count sits within the min/max so the first apply does not immediately trigger a resize.
Right-size and shut down for cost. HDInsight bills per node-hour while the cluster exists; pick memory-optimized E-series workers for Spark, keep zookeepers on cheap A2_v2, and tear down or scale to the floor when no jobs run — there is no “pause”.
Keep credentials out of HCL and pin Spark/platform versions. Pull gateway_password from Key Vault, supply SSH via a public key file, and pin cluster_version/spark_version explicitly so a provider upgrade never silently moves you to a new HDInsight image.
Name and tag for governance. Use a consistent kv-spark-<env> convention (lowercase, hyphenated, ≤59 chars) and tag every cluster with env/owner/service for cost allocation and lifecycle automation.