IaC Azure

Terraform Module: Azure HDInsight — production Spark clusters with VNet, ADLS Gen2 and autoscale

Quick take — A reusable azurerm ~> 4.0 Terraform module for Azure HDInsight Spark: head/worker/zookeeper node sizing, ADLS Gen2 storage, VNet injection, schedule and load-based autoscale, and gateway auth. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "hdinsight" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"

  cluster_name                = "..."  # Globally-unique cluster name; becomes `<name>.azurehdin…
  resource_group_name         = "..."  # Resource group for the cluster.
  location                    = "..."  # Azure region, e.g. `centralindia`.
  gateway_password            = "..."  # Gateway admin password (≥10 chars); source from Key Vau…
  ssh_public_key              = "..."  # OpenSSH public key for node SSH access.
  storage_account_id          = "..."  # Resource ID of the ADLS Gen2 storage account.
  storage_filesystem_id       = "..."  # Resource ID of the ADLS Gen2 filesystem (default FS).
  storage_managed_identity_id = "..."  # UAMI with Storage Blob Data Owner on the filesystem.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure HDInsight is Microsoft’s managed, open-source analytics service: it spins up fully-provisioned Hadoop-ecosystem clusters — Spark, Hive (Interactive Query), Kafka, HBase — on top of Azure VMs without you having to install or patch the stack yourself. A Spark cluster on HDInsight gives you a multi-node Apache Spark runtime with Jupyter/Zeppelin notebooks, a Livy REST endpoint and a Thrift/JDBC server, backed by Azure storage as the cluster filesystem.

The problem is that a correct Spark cluster is verbose to declare. You have three distinct node roles (head, worker, zookeeper), each with its own VM SKU and credentials; a gateway with HTTPS basic-auth; an SSH login; a storage account container or — for production — an ADLS Gen2 filesystem with a managed identity; usually VNet injection so the cluster lands in a private subnet; and an autoscale block that is different depending on whether you scale by schedule or by load. Hand-writing all of that per environment invites drift and copy-paste mistakes (mismatched usernames, a worker count below the autoscale floor, a storage key in plain HCL).

This module wraps azurerm_hdinsight_spark_cluster so the caller passes a handful of vetted inputs — name, tier, Spark version, node SKUs, the ADLS Gen2 filesystem + identity, optional VNet subnet, and an optional autoscale policy — and gets back a hardened cluster plus its HTTPS and SSH endpoints as outputs. Validations stop the most common foot-guns before apply.

When to use it

Module structure

terraform-module-azure-hdinsight/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # azurerm_hdinsight_spark_cluster + autoscale wiring
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # id, name, https/ssh endpoints

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

locals {
  # The HDInsight gateway always exposes HTTPS on the cluster's public/private
  # endpoint as https://<name>.azurehdinsight.net. SSH lands on the -ssh host.
  https_endpoint = "https://${var.cluster_name}.azurehdinsight.net"
  ssh_endpoint   = "${var.cluster_name}-ssh.azurehdinsight.net"

  # VNet injection is all-or-nothing: both the subnet id and the vnet id must
  # be supplied together, otherwise the network block is omitted entirely.
  network_enabled = var.subnet_id != null && var.virtual_network_id != null
}

resource "azurerm_hdinsight_spark_cluster" "this" {
  name                = var.cluster_name
  resource_group_name = var.resource_group_name
  location            = var.location
  cluster_version     = var.cluster_version
  tier                = var.tier
  tags                = var.tags

  # Spark engine version, e.g. "3.3" on HDInsight 5.1.
  component_version {
    spark = var.spark_version
  }

  gateway {
    username = var.gateway_username
    password = var.gateway_password
  }

  # ADLS Gen2 as the default cluster filesystem, authenticated with a
  # user-assigned managed identity (no storage account keys in state).
  storage_account_gen2 {
    storage_resource_id          = var.storage_account_id
    filesystem_id                = var.storage_filesystem_id
    managed_identity_resource_id = var.storage_managed_identity_id
    is_default                   = true
  }

  roles {
    head_node {
      vm_size            = var.head_node_vm_size
      username           = var.ssh_username
      ssh_keys           = [var.ssh_public_key]
      subnet_id          = local.network_enabled ? var.subnet_id : null
      virtual_network_id = local.network_enabled ? var.virtual_network_id : null
    }

    worker_node {
      vm_size               = var.worker_node_vm_size
      username              = var.ssh_username
      ssh_keys              = [var.ssh_public_key]
      target_instance_count = var.worker_node_count
      subnet_id             = local.network_enabled ? var.subnet_id : null
      virtual_network_id    = local.network_enabled ? var.virtual_network_id : null

      dynamic "autoscale" {
        for_each = var.autoscale != null ? [var.autoscale] : []
        content {
          # Load-based autoscale: min/max worker bounds.
          dynamic "capacity" {
            for_each = autoscale.value.capacity != null ? [autoscale.value.capacity] : []
            content {
              min_instance_count = capacity.value.min_instance_count
              max_instance_count = capacity.value.max_instance_count
            }
          }

          # Schedule-based autoscale: one or more time/day rules.
          dynamic "recurrence" {
            for_each = autoscale.value.recurrence != null ? [autoscale.value.recurrence] : []
            content {
              timezone = recurrence.value.timezone

              dynamic "schedule" {
                for_each = recurrence.value.schedule
                content {
                  days                  = schedule.value.days
                  time                  = schedule.value.time
                  target_instance_count = schedule.value.target_instance_count
                }
              }
            }
          }
        }
      }
    }

    zookeeper_node {
      vm_size            = var.zookeeper_node_vm_size
      username           = var.ssh_username
      ssh_keys           = [var.ssh_public_key]
      subnet_id          = local.network_enabled ? var.subnet_id : null
      virtual_network_id = local.network_enabled ? var.virtual_network_id : null
    }
  }
}

variables.tf

variable "cluster_name" {
  type        = string
  description = "Globally-unique HDInsight cluster name (becomes <name>.azurehdinsight.net)."

  validation {
    condition     = can(regex("^[a-z0-9][a-z0-9-]{1,57}[a-z0-9]$", var.cluster_name))
    error_message = "cluster_name must be 3-59 chars, lowercase alphanumeric or hyphen, not starting/ending with a hyphen."
  }
}

variable "resource_group_name" {
  type        = string
  description = "Resource group that will hold the cluster."
}

variable "location" {
  type        = string
  description = "Azure region for the cluster, e.g. centralindia."
}

variable "cluster_version" {
  type        = string
  description = "HDInsight platform version, e.g. \"5.1\"."
  default     = "5.1"
}

variable "spark_version" {
  type        = string
  description = "Apache Spark component version, e.g. \"3.3\" on HDInsight 5.1."
  default     = "3.3"
}

variable "tier" {
  type        = string
  description = "Cluster tier: Standard or Premium (Premium enables ESP/Kerberos with AAD-DS)."
  default     = "Standard"

  validation {
    condition     = contains(["Standard", "Premium"], var.tier)
    error_message = "tier must be either \"Standard\" or \"Premium\"."
  }
}

variable "gateway_username" {
  type        = string
  description = "HTTPS gateway (Ambari) admin username."
  default     = "admin"

  validation {
    condition     = var.gateway_username != "admin" ? true : true
    error_message = "gateway_username must be set."
  }
}

variable "gateway_password" {
  type        = string
  description = "HTTPS gateway admin password. Source from Key Vault; do not hard-code."
  sensitive   = true

  validation {
    condition     = length(var.gateway_password) >= 10
    error_message = "gateway_password must be at least 10 characters."
  }
}

variable "ssh_username" {
  type        = string
  description = "SSH login user applied to every node role."
  default     = "sshuser"
}

variable "ssh_public_key" {
  type        = string
  description = "OpenSSH public key (ssh-rsa/ssh-ed25519 ...) for node SSH access."

  validation {
    condition     = can(regex("^(ssh-rsa|ssh-ed25519|ecdsa-) ", var.ssh_public_key))
    error_message = "ssh_public_key must be a valid OpenSSH public key string."
  }
}

variable "storage_account_id" {
  type        = string
  description = "Resource ID of the ADLS Gen2 (StorageV2 + HNS) storage account."
}

variable "storage_filesystem_id" {
  type        = string
  description = "Resource ID of the ADLS Gen2 filesystem (container) used as the default FS."
}

variable "storage_managed_identity_id" {
  type        = string
  description = "Resource ID of the user-assigned managed identity with Storage Blob Data Owner on the filesystem."
}

variable "head_node_vm_size" {
  type        = string
  description = "VM SKU for the 2 head nodes."
  default     = "Standard_E8_v3"
}

variable "worker_node_vm_size" {
  type        = string
  description = "VM SKU for worker nodes."
  default     = "Standard_E8_v3"
}

variable "zookeeper_node_vm_size" {
  type        = string
  description = "VM SKU for the 3 zookeeper nodes."
  default     = "Standard_A2_v2"
}

variable "worker_node_count" {
  type        = number
  description = "Initial (fixed) worker count. When autoscale is set, keep this within its bounds."
  default     = 3

  validation {
    condition     = var.worker_node_count >= 1 && var.worker_node_count <= 200
    error_message = "worker_node_count must be between 1 and 200."
  }
}

variable "subnet_id" {
  type        = string
  description = "Subnet resource ID for VNet injection. Set together with virtual_network_id."
  default     = null
}

variable "virtual_network_id" {
  type        = string
  description = "Virtual network resource ID for VNet injection. Set together with subnet_id."
  default     = null
}

variable "autoscale" {
  description = <<-EOT
    Optional autoscale policy. Provide EITHER capacity (load-based) OR recurrence
    (schedule-based), not both. Leave null for a fixed-size cluster.
  EOT
  type = object({
    capacity = optional(object({
      min_instance_count = number
      max_instance_count = number
    }))
    recurrence = optional(object({
      timezone = string
      schedule = list(object({
        days                  = list(string)
        time                  = string
        target_instance_count = number
      }))
    }))
  })
  default = null

  validation {
    condition = (
      var.autoscale == null ||
      (var.autoscale.capacity != null) != (var.autoscale.recurrence != null)
    )
    error_message = "autoscale must set exactly one of capacity (load-based) or recurrence (schedule-based)."
  }
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the cluster."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the HDInsight Spark cluster."
  value       = azurerm_hdinsight_spark_cluster.this.id
}

output "name" {
  description = "Name of the HDInsight Spark cluster."
  value       = azurerm_hdinsight_spark_cluster.this.name
}

output "https_endpoint" {
  description = "HTTPS (Ambari/Livy) endpoint of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.https_endpoint
}

output "ssh_endpoint" {
  description = "SSH endpoint of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.ssh_endpoint
}

output "cluster_version" {
  description = "Resolved HDInsight platform version of the cluster."
  value       = azurerm_hdinsight_spark_cluster.this.cluster_version
}

How to use it

module "hdinsight" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"

  cluster_name        = "kv-spark-prod"
  resource_group_name = azurerm_resource_group.analytics.name
  location            = azurerm_resource_group.analytics.location
  cluster_version     = "5.1"
  spark_version       = "3.3"
  tier                = "Standard"

  gateway_username = "ambariadmin"
  gateway_password = data.azurerm_key_vault_secret.hdi_gw.value

  ssh_username   = "sshuser"
  ssh_public_key = file("${path.module}/keys/hdi_id_ed25519.pub")

  # ADLS Gen2 default filesystem + managed identity
  storage_account_id          = azurerm_storage_account.lake.id
  storage_filesystem_id       = azurerm_storage_data_lake_gen2_filesystem.spark.id
  storage_managed_identity_id = azurerm_user_assigned_identity.hdi.id

  # Land the cluster in a private subnet
  subnet_id          = azurerm_subnet.hdinsight.id
  virtual_network_id = azurerm_virtual_network.analytics.id

  head_node_vm_size      = "Standard_E8_v3"
  worker_node_vm_size    = "Standard_E16_v3"
  zookeeper_node_vm_size = "Standard_A2_v2"
  worker_node_count      = 4

  # Scale up for the morning batch window, down overnight (IST)
  autoscale = {
    recurrence = {
      timezone = "India Standard Time"
      schedule = [
        {
          days                  = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
          time                  = "06:00"
          target_instance_count = 10
        },
        {
          days                  = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
          time                  = "20:00"
          target_instance_count = 4
        }
      ]
    }
  }

  tags = {
    env     = "prod"
    owner   = "data-platform"
    service = "spark"
  }
}

# Downstream: grant the cluster's host an inbound NSG allowance, and surface the
# Livy/HTTPS endpoint to a pipeline that submits Spark jobs.
resource "azurerm_data_factory_linked_service_azure_databricks" "noop" {
  count = 0 # placeholder to show output consumption pattern below
}

output "spark_livy_url" {
  description = "HTTPS endpoint used by orchestrators (ADF/Synapse) to submit Spark batches via Livy."
  value       = module.hdinsight.https_endpoint
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module configlive/prod/hdinsight/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"
}

inputs = {
  cluster_name = "..."
  resource_group_name = "..."
  location = "..."
  gateway_password = "..."
  ssh_public_key = "..."
  storage_account_id = "..."
  storage_filesystem_id = "..."
  storage_managed_identity_id = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/hdinsight && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
cluster_name string Yes Globally-unique cluster name; becomes <name>.azurehdinsight.net.
resource_group_name string Yes Resource group for the cluster.
location string Yes Azure region, e.g. centralindia.
cluster_version string "5.1" No HDInsight platform version.
spark_version string "3.3" No Apache Spark component version.
tier string "Standard" No Standard or Premium (Premium enables ESP/Kerberos).
gateway_username string "admin" No HTTPS gateway (Ambari) admin user.
gateway_password string (sensitive) Yes Gateway admin password (≥10 chars); source from Key Vault.
ssh_username string "sshuser" No SSH login user for all node roles.
ssh_public_key string Yes OpenSSH public key for node SSH access.
storage_account_id string Yes Resource ID of the ADLS Gen2 storage account.
storage_filesystem_id string Yes Resource ID of the ADLS Gen2 filesystem (default FS).
storage_managed_identity_id string Yes UAMI with Storage Blob Data Owner on the filesystem.
head_node_vm_size string "Standard_E8_v3" No VM SKU for the 2 head nodes.
worker_node_vm_size string "Standard_E8_v3" No VM SKU for worker nodes.
zookeeper_node_vm_size string "Standard_A2_v2" No VM SKU for the 3 zookeeper nodes.
worker_node_count number 3 No Initial worker count (1–200); keep within autoscale bounds.
subnet_id string null No Subnet ID for VNet injection (set with virtual_network_id).
virtual_network_id string null No VNet ID for VNet injection (set with subnet_id).
autoscale object null No Load-based (capacity) or schedule-based (recurrence) policy.
tags map(string) {} No Tags applied to the cluster.

Outputs

Name Description
id Resource ID of the HDInsight Spark cluster.
name Name of the cluster.
https_endpoint HTTPS (Ambari/Livy) endpoint, e.g. kv-spark-prod.azurehdinsight.net.
ssh_endpoint SSH endpoint of the cluster.
cluster_version Resolved HDInsight platform version.

Enterprise scenario

A logistics company runs an overnight ETL that joins six months of shipment telemetry from ADLS Gen2 and writes curated Delta tables for the next day’s dashboards. The data platform team consumes this module pinned to ref=v1.0.0 from their Azure DevOps terraform-modules repo, deploying a VNet-injected Spark 3.3 cluster into a private subnet that reaches the lake over a service endpoint and SQL Managed Instance over a peered VNet. Schedule-based autoscale ramps workers from 4 to 10 at 06:00 IST for the heavy join window and back to 4 at 20:00, and Azure Data Factory submits the job through the module’s https_endpoint (Livy). The result is a reviewed, identical cluster in dev/test/prod with no storage keys in state — auth flows entirely through the user-assigned managed identity.

Best practices

TerraformAzureHDInsightModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading