Quick take — A reusable azurerm ~> 4.0 Terraform module for Azure HDInsight Spark: head/worker/zookeeper node sizing, ADLS Gen2 storage, VNet injection, schedule and load-based autoscale, and gateway auth. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "hdinsight" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"
cluster_name = "..." # Globally-unique cluster name; becomes `<name>.azurehdin…
resource_group_name = "..." # Resource group for the cluster.
location = "..." # Azure region, e.g. `centralindia`.
gateway_password = "..." # Gateway admin password (≥10 chars); source from Key Vau…
ssh_public_key = "..." # OpenSSH public key for node SSH access.
storage_account_id = "..." # Resource ID of the ADLS Gen2 storage account.
storage_filesystem_id = "..." # Resource ID of the ADLS Gen2 filesystem (default FS).
storage_managed_identity_id = "..." # UAMI with Storage Blob Data Owner on the filesystem.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure HDInsight is Microsoft’s managed, open-source analytics service: it spins up fully-provisioned Hadoop-ecosystem clusters — Spark, Hive (Interactive Query), Kafka, HBase — on top of Azure VMs without you having to install or patch the stack yourself. A Spark cluster on HDInsight gives you a multi-node Apache Spark runtime with Jupyter/Zeppelin notebooks, a Livy REST endpoint and a Thrift/JDBC server, backed by Azure storage as the cluster filesystem.
The problem is that a correct Spark cluster is verbose to declare. You have three distinct node roles (head, worker, zookeeper), each with its own VM SKU and credentials; a gateway with HTTPS basic-auth; an SSH login; a storage account container or — for production — an ADLS Gen2 filesystem with a managed identity; usually VNet injection so the cluster lands in a private subnet; and an autoscale block that is different depending on whether you scale by schedule or by load. Hand-writing all of that per environment invites drift and copy-paste mistakes (mismatched usernames, a worker count below the autoscale floor, a storage key in plain HCL).
This module wraps azurerm_hdinsight_spark_cluster so the caller passes a handful of vetted inputs — name, tier, Spark version, node SKUs, the ADLS Gen2 filesystem + identity, optional VNet subnet, and an optional autoscale policy — and gets back a hardened cluster plus its HTTPS and SSH endpoints as outputs. Validations stop the most common foot-guns before apply.
When to use it
- You run batch or streaming Spark jobs (ETL, feature engineering, ad-hoc Notebook analytics) and want a managed cluster rather than self-hosting Spark on AKS or VMs.
- You need the cluster inside a VNet for private connectivity to data sources (SQL MI, private storage, on-prem over ExpressRoute) and governed egress.
- You want a repeatable, reviewable cluster definition across dev/test/prod with consistent sizing, tags and autoscale, instead of clicking through the portal.
- Your data lake is ADLS Gen2 and you want the cluster authenticated to it via a user-assigned managed identity, not storage keys.
- Skip this module if you only need short-lived, on-demand Spark with no cluster to keep warm — Azure Synapse Spark pools or Databricks job clusters are a better fit. HDInsight clusters bill while they exist, so they suit long-running or scheduled-up workloads.
Module structure
terraform-module-azure-hdinsight/
├── versions.tf # provider + Terraform version pins
├── main.tf # azurerm_hdinsight_spark_cluster + autoscale wiring
├── variables.tf # var-driven inputs with validations
└── outputs.tf # id, name, https/ssh endpoints
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
main.tf
locals {
# The HDInsight gateway always exposes HTTPS on the cluster's public/private
# endpoint as https://<name>.azurehdinsight.net. SSH lands on the -ssh host.
https_endpoint = "https://${var.cluster_name}.azurehdinsight.net"
ssh_endpoint = "${var.cluster_name}-ssh.azurehdinsight.net"
# VNet injection is all-or-nothing: both the subnet id and the vnet id must
# be supplied together, otherwise the network block is omitted entirely.
network_enabled = var.subnet_id != null && var.virtual_network_id != null
}
resource "azurerm_hdinsight_spark_cluster" "this" {
name = var.cluster_name
resource_group_name = var.resource_group_name
location = var.location
cluster_version = var.cluster_version
tier = var.tier
tags = var.tags
# Spark engine version, e.g. "3.3" on HDInsight 5.1.
component_version {
spark = var.spark_version
}
gateway {
username = var.gateway_username
password = var.gateway_password
}
# ADLS Gen2 as the default cluster filesystem, authenticated with a
# user-assigned managed identity (no storage account keys in state).
storage_account_gen2 {
storage_resource_id = var.storage_account_id
filesystem_id = var.storage_filesystem_id
managed_identity_resource_id = var.storage_managed_identity_id
is_default = true
}
roles {
head_node {
vm_size = var.head_node_vm_size
username = var.ssh_username
ssh_keys = [var.ssh_public_key]
subnet_id = local.network_enabled ? var.subnet_id : null
virtual_network_id = local.network_enabled ? var.virtual_network_id : null
}
worker_node {
vm_size = var.worker_node_vm_size
username = var.ssh_username
ssh_keys = [var.ssh_public_key]
target_instance_count = var.worker_node_count
subnet_id = local.network_enabled ? var.subnet_id : null
virtual_network_id = local.network_enabled ? var.virtual_network_id : null
dynamic "autoscale" {
for_each = var.autoscale != null ? [var.autoscale] : []
content {
# Load-based autoscale: min/max worker bounds.
dynamic "capacity" {
for_each = autoscale.value.capacity != null ? [autoscale.value.capacity] : []
content {
min_instance_count = capacity.value.min_instance_count
max_instance_count = capacity.value.max_instance_count
}
}
# Schedule-based autoscale: one or more time/day rules.
dynamic "recurrence" {
for_each = autoscale.value.recurrence != null ? [autoscale.value.recurrence] : []
content {
timezone = recurrence.value.timezone
dynamic "schedule" {
for_each = recurrence.value.schedule
content {
days = schedule.value.days
time = schedule.value.time
target_instance_count = schedule.value.target_instance_count
}
}
}
}
}
}
}
zookeeper_node {
vm_size = var.zookeeper_node_vm_size
username = var.ssh_username
ssh_keys = [var.ssh_public_key]
subnet_id = local.network_enabled ? var.subnet_id : null
virtual_network_id = local.network_enabled ? var.virtual_network_id : null
}
}
}
variables.tf
variable "cluster_name" {
type = string
description = "Globally-unique HDInsight cluster name (becomes <name>.azurehdinsight.net)."
validation {
condition = can(regex("^[a-z0-9][a-z0-9-]{1,57}[a-z0-9]$", var.cluster_name))
error_message = "cluster_name must be 3-59 chars, lowercase alphanumeric or hyphen, not starting/ending with a hyphen."
}
}
variable "resource_group_name" {
type = string
description = "Resource group that will hold the cluster."
}
variable "location" {
type = string
description = "Azure region for the cluster, e.g. centralindia."
}
variable "cluster_version" {
type = string
description = "HDInsight platform version, e.g. \"5.1\"."
default = "5.1"
}
variable "spark_version" {
type = string
description = "Apache Spark component version, e.g. \"3.3\" on HDInsight 5.1."
default = "3.3"
}
variable "tier" {
type = string
description = "Cluster tier: Standard or Premium (Premium enables ESP/Kerberos with AAD-DS)."
default = "Standard"
validation {
condition = contains(["Standard", "Premium"], var.tier)
error_message = "tier must be either \"Standard\" or \"Premium\"."
}
}
variable "gateway_username" {
type = string
description = "HTTPS gateway (Ambari) admin username."
default = "admin"
validation {
condition = var.gateway_username != "admin" ? true : true
error_message = "gateway_username must be set."
}
}
variable "gateway_password" {
type = string
description = "HTTPS gateway admin password. Source from Key Vault; do not hard-code."
sensitive = true
validation {
condition = length(var.gateway_password) >= 10
error_message = "gateway_password must be at least 10 characters."
}
}
variable "ssh_username" {
type = string
description = "SSH login user applied to every node role."
default = "sshuser"
}
variable "ssh_public_key" {
type = string
description = "OpenSSH public key (ssh-rsa/ssh-ed25519 ...) for node SSH access."
validation {
condition = can(regex("^(ssh-rsa|ssh-ed25519|ecdsa-) ", var.ssh_public_key))
error_message = "ssh_public_key must be a valid OpenSSH public key string."
}
}
variable "storage_account_id" {
type = string
description = "Resource ID of the ADLS Gen2 (StorageV2 + HNS) storage account."
}
variable "storage_filesystem_id" {
type = string
description = "Resource ID of the ADLS Gen2 filesystem (container) used as the default FS."
}
variable "storage_managed_identity_id" {
type = string
description = "Resource ID of the user-assigned managed identity with Storage Blob Data Owner on the filesystem."
}
variable "head_node_vm_size" {
type = string
description = "VM SKU for the 2 head nodes."
default = "Standard_E8_v3"
}
variable "worker_node_vm_size" {
type = string
description = "VM SKU for worker nodes."
default = "Standard_E8_v3"
}
variable "zookeeper_node_vm_size" {
type = string
description = "VM SKU for the 3 zookeeper nodes."
default = "Standard_A2_v2"
}
variable "worker_node_count" {
type = number
description = "Initial (fixed) worker count. When autoscale is set, keep this within its bounds."
default = 3
validation {
condition = var.worker_node_count >= 1 && var.worker_node_count <= 200
error_message = "worker_node_count must be between 1 and 200."
}
}
variable "subnet_id" {
type = string
description = "Subnet resource ID for VNet injection. Set together with virtual_network_id."
default = null
}
variable "virtual_network_id" {
type = string
description = "Virtual network resource ID for VNet injection. Set together with subnet_id."
default = null
}
variable "autoscale" {
description = <<-EOT
Optional autoscale policy. Provide EITHER capacity (load-based) OR recurrence
(schedule-based), not both. Leave null for a fixed-size cluster.
EOT
type = object({
capacity = optional(object({
min_instance_count = number
max_instance_count = number
}))
recurrence = optional(object({
timezone = string
schedule = list(object({
days = list(string)
time = string
target_instance_count = number
}))
}))
})
default = null
validation {
condition = (
var.autoscale == null ||
(var.autoscale.capacity != null) != (var.autoscale.recurrence != null)
)
error_message = "autoscale must set exactly one of capacity (load-based) or recurrence (schedule-based)."
}
}
variable "tags" {
type = map(string)
description = "Tags applied to the cluster."
default = {}
}
outputs.tf
output "id" {
description = "Resource ID of the HDInsight Spark cluster."
value = azurerm_hdinsight_spark_cluster.this.id
}
output "name" {
description = "Name of the HDInsight Spark cluster."
value = azurerm_hdinsight_spark_cluster.this.name
}
output "https_endpoint" {
description = "HTTPS (Ambari/Livy) endpoint of the cluster."
value = azurerm_hdinsight_spark_cluster.this.https_endpoint
}
output "ssh_endpoint" {
description = "SSH endpoint of the cluster."
value = azurerm_hdinsight_spark_cluster.this.ssh_endpoint
}
output "cluster_version" {
description = "Resolved HDInsight platform version of the cluster."
value = azurerm_hdinsight_spark_cluster.this.cluster_version
}
How to use it
module "hdinsight" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"
cluster_name = "kv-spark-prod"
resource_group_name = azurerm_resource_group.analytics.name
location = azurerm_resource_group.analytics.location
cluster_version = "5.1"
spark_version = "3.3"
tier = "Standard"
gateway_username = "ambariadmin"
gateway_password = data.azurerm_key_vault_secret.hdi_gw.value
ssh_username = "sshuser"
ssh_public_key = file("${path.module}/keys/hdi_id_ed25519.pub")
# ADLS Gen2 default filesystem + managed identity
storage_account_id = azurerm_storage_account.lake.id
storage_filesystem_id = azurerm_storage_data_lake_gen2_filesystem.spark.id
storage_managed_identity_id = azurerm_user_assigned_identity.hdi.id
# Land the cluster in a private subnet
subnet_id = azurerm_subnet.hdinsight.id
virtual_network_id = azurerm_virtual_network.analytics.id
head_node_vm_size = "Standard_E8_v3"
worker_node_vm_size = "Standard_E16_v3"
zookeeper_node_vm_size = "Standard_A2_v2"
worker_node_count = 4
# Scale up for the morning batch window, down overnight (IST)
autoscale = {
recurrence = {
timezone = "India Standard Time"
schedule = [
{
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
time = "06:00"
target_instance_count = 10
},
{
days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
time = "20:00"
target_instance_count = 4
}
]
}
}
tags = {
env = "prod"
owner = "data-platform"
service = "spark"
}
}
# Downstream: grant the cluster's host an inbound NSG allowance, and surface the
# Livy/HTTPS endpoint to a pipeline that submits Spark jobs.
resource "azurerm_data_factory_linked_service_azure_databricks" "noop" {
count = 0 # placeholder to show output consumption pattern below
}
output "spark_livy_url" {
description = "HTTPS endpoint used by orchestrators (ADF/Synapse) to submit Spark batches via Livy."
value = module.hdinsight.https_endpoint
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/hdinsight/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-hdinsight?ref=v1.0.0"
}
inputs = {
cluster_name = "..."
resource_group_name = "..."
location = "..."
gateway_password = "..."
ssh_public_key = "..."
storage_account_id = "..."
storage_filesystem_id = "..."
storage_managed_identity_id = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/hdinsight && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
cluster_name |
string |
— | Yes | Globally-unique cluster name; becomes <name>.azurehdinsight.net. |
resource_group_name |
string |
— | Yes | Resource group for the cluster. |
location |
string |
— | Yes | Azure region, e.g. centralindia. |
cluster_version |
string |
"5.1" |
No | HDInsight platform version. |
spark_version |
string |
"3.3" |
No | Apache Spark component version. |
tier |
string |
"Standard" |
No | Standard or Premium (Premium enables ESP/Kerberos). |
gateway_username |
string |
"admin" |
No | HTTPS gateway (Ambari) admin user. |
gateway_password |
string (sensitive) |
— | Yes | Gateway admin password (≥10 chars); source from Key Vault. |
ssh_username |
string |
"sshuser" |
No | SSH login user for all node roles. |
ssh_public_key |
string |
— | Yes | OpenSSH public key for node SSH access. |
storage_account_id |
string |
— | Yes | Resource ID of the ADLS Gen2 storage account. |
storage_filesystem_id |
string |
— | Yes | Resource ID of the ADLS Gen2 filesystem (default FS). |
storage_managed_identity_id |
string |
— | Yes | UAMI with Storage Blob Data Owner on the filesystem. |
head_node_vm_size |
string |
"Standard_E8_v3" |
No | VM SKU for the 2 head nodes. |
worker_node_vm_size |
string |
"Standard_E8_v3" |
No | VM SKU for worker nodes. |
zookeeper_node_vm_size |
string |
"Standard_A2_v2" |
No | VM SKU for the 3 zookeeper nodes. |
worker_node_count |
number |
3 |
No | Initial worker count (1–200); keep within autoscale bounds. |
subnet_id |
string |
null |
No | Subnet ID for VNet injection (set with virtual_network_id). |
virtual_network_id |
string |
null |
No | VNet ID for VNet injection (set with subnet_id). |
autoscale |
object |
null |
No | Load-based (capacity) or schedule-based (recurrence) policy. |
tags |
map(string) |
{} |
No | Tags applied to the cluster. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the HDInsight Spark cluster. |
name |
Name of the cluster. |
https_endpoint |
HTTPS (Ambari/Livy) endpoint, e.g. kv-spark-prod.azurehdinsight.net. |
ssh_endpoint |
SSH endpoint of the cluster. |
cluster_version |
Resolved HDInsight platform version. |
Enterprise scenario
A logistics company runs an overnight ETL that joins six months of shipment telemetry from ADLS Gen2 and writes curated Delta tables for the next day’s dashboards. The data platform team consumes this module pinned to ref=v1.0.0 from their Azure DevOps terraform-modules repo, deploying a VNet-injected Spark 3.3 cluster into a private subnet that reaches the lake over a service endpoint and SQL Managed Instance over a peered VNet. Schedule-based autoscale ramps workers from 4 to 10 at 06:00 IST for the heavy join window and back to 4 at 20:00, and Azure Data Factory submits the job through the module’s https_endpoint (Livy). The result is a reviewed, identical cluster in dev/test/prod with no storage keys in state — auth flows entirely through the user-assigned managed identity.
Best practices
- Authenticate to storage with a managed identity, never keys. Use
storage_account_gen2with a user-assigned identity that holdsStorage Blob Data Owneron the filesystem, so no account key ever lands in Terraform state or the cluster config. - Inject into a VNet and lock down the gateway. Place head/worker/zookeeper nodes in a dedicated subnet, apply the HDInsight service-tag NSG rules, and keep the HTTPS gateway off the public internet; reach Ambari/Livy from inside the VNet or via private connectivity.
- Choose autoscale to match the workload shape. Use
recurrence(schedule) for predictable batch windows andcapacity(load) for bursty interactive use — and make sureworker_node_countsits within the min/max so the first apply does not immediately trigger a resize. - Right-size and shut down for cost. HDInsight bills per node-hour while the cluster exists; pick memory-optimized
E-series workers for Spark, keep zookeepers on cheapA2_v2, and tear down or scale to the floor when no jobs run — there is no “pause”. - Keep credentials out of HCL and pin Spark/platform versions. Pull
gateway_passwordfrom Key Vault, supply SSH via a public key file, and pincluster_version/spark_versionexplicitly so a provider upgrade never silently moves you to a new HDInsight image. - Name and tag for governance. Use a consistent
kv-spark-<env>convention (lowercase, hyphenated, ≤59 chars) and tag every cluster withenv/owner/servicefor cost allocation and lifecycle automation.