Terraform Module: Azure Data Lake Storage Gen2 — secure, HNS-enabled lake with governed filesystems

Quick take — A reusable hashicorp/azurerm ~> 4.0 module for Azure Data Lake Storage Gen2: an HNS-enabled storage account, governed filesystems, RA-GRS redundancy, private networking, and lifecycle tiering — fully var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "data_lake" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"

  storage_account_name = "..."  # Globally unique account name (3-24 chars, lowercase alp…
  resource_group_name  = "..."  # Resource group to deploy into.
  location             = "..."  # Azure region (e.g. `centralindia`).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Data Lake Storage Gen2 (ADLS Gen2) is not a separate service — it is an Azure Storage account with the Hierarchical Namespace (HNS) feature switched on. That one flag (is_hns_enabled = true) is what turns flat blob containers into true filesystems with real directories, atomic directory renames/deletes, and POSIX-style ACLs that Spark, Databricks, Synapse, and Fabric all expect from abfss:// endpoints. Because HNS can only be set at creation time and never toggled afterward, getting the storage account right on the first apply matters more here than almost anywhere else in Azure.

This module wraps the two resources that define a lake — azurerm_storage_account (the HNS-enabled account) and azurerm_storage_data_lake_gen2_filesystem (the abfss:// filesystems, e.g. bronze, silver, gold) — and bundles the production options teams always end up bolting on: a lifecycle management policy for hot→cool→archive tiering, optional private-endpoint-friendly network rules, blob soft delete, and infrastructure encryption. Wrapping it in a module means every lake in the estate is born HNS-enabled, TLS 1.2-only, with public blob access off and the medallion filesystems pre-created — instead of someone clicking “create storage account,” forgetting the HNS checkbox, and discovering three sprints later that Databricks can’t mount it.

When to use it

You are standing up a medallion / lakehouse (bronze/silver/gold) for Databricks, Synapse Spark, or Microsoft Fabric and need abfss:// filesystems, not flat blob containers.
You want every data lake in the org to be consistent by construction: HNS on, TLS 1.2 minimum, public access disabled, soft delete enabled, and lifecycle tiering attached.
You are landing raw data that ages predictably and want cost control via automatic hot→cool→archive transitions instead of paying hot rates for cold history.
You need the lake to be reachable only over private endpoints with default_action = "Deny" and a curated IP/subnet allowlist for governance and exam-ready compliance.
Do not use this for general-purpose blob/object storage, static website hosting, or queue/table workloads — HNS adds per-transaction cost and constraints (e.g., no immutable WORM blob versioning combos) that those workloads don’t need.

Module structure

terraform-module-azure-data-lake/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # storage account (HNS) + filesystems + lifecycle policy
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # ids, dfs endpoint, filesystem ids

# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

# main.tf

# HNS-enabled storage account — this is what makes it a Data Lake Gen2 account.
# is_hns_enabled CANNOT be changed after creation, so it is forced true here.
resource "azurerm_storage_account" "this" {
  name                = var.storage_account_name
  resource_group_name = var.resource_group_name
  location            = var.location

  account_tier             = "Standard"
  account_kind             = "StorageV2"
  account_replication_type = var.replication_type

  # The defining ADLS Gen2 setting.
  is_hns_enabled = true

  # Hardened defaults.
  min_tls_version                   = "TLS1_2"
  https_traffic_only_enabled        = true
  allow_nested_items_to_be_public   = false
  shared_access_key_enabled         = var.shared_access_key_enabled
  public_network_access_enabled     = var.public_network_access_enabled
  default_to_oauth_authentication   = true
  infrastructure_encryption_enabled = var.infrastructure_encryption_enabled

  blob_properties {
    delete_retention_policy {
      days = var.blob_soft_delete_retention_days
    }
    container_delete_retention_policy {
      days = var.container_soft_delete_retention_days
    }
  }

  # When restricting access, default-deny and allow only curated subnets/IPs.
  dynamic "network_rules" {
    for_each = var.public_network_access_enabled ? [] : [1]
    content {
      default_action             = "Deny"
      bypass                     = var.network_bypass
      ip_rules                   = var.allowed_ip_rules
      virtual_network_subnet_ids = var.allowed_subnet_ids
    }
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

# Medallion (or arbitrary) filesystems exposed over abfss://.
resource "azurerm_storage_data_lake_gen2_filesystem" "this" {
  for_each = toset(var.filesystems)

  name               = each.value
  storage_account_id = azurerm_storage_account.this.id
}

# Lifecycle tiering: age data hot -> cool -> archive -> delete to control cost.
resource "azurerm_storage_management_policy" "this" {
  count              = var.enable_lifecycle_policy ? 1 : 0
  storage_account_id = azurerm_storage_account.this.id

  rule {
    name    = "tier-and-expire"
    enabled = true

    filters {
      blob_types = ["blockBlob"]
    }

    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = var.tier_to_cool_after_days
        tier_to_archive_after_days_since_modification_greater_than = var.tier_to_archive_after_days
        delete_after_days_since_modification_greater_than          = var.delete_after_days
      }
      snapshot {
        delete_after_days_since_creation_greater_than = var.snapshot_delete_after_days
      }
    }
  }
}

# variables.tf

variable "storage_account_name" {
  description = "Globally unique storage account name (3-24 chars, lowercase letters and numbers only)."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9]{3,24}$", var.storage_account_name))
    error_message = "storage_account_name must be 3-24 characters, lowercase letters and numbers only."
  }
}

variable "resource_group_name" {
  description = "Name of the resource group to deploy into."
  type        = string
}

variable "location" {
  description = "Azure region (e.g. 'centralindia', 'eastus')."
  type        = string
}

variable "replication_type" {
  description = "Account replication type. Use ZRS/GZRS/RAGZRS for production lakes; LRS only for dev."
  type        = string
  default     = "ZRS"

  validation {
    condition     = contains(["LRS", "ZRS", "GRS", "RAGRS", "GZRS", "RAGZRS"], var.replication_type)
    error_message = "replication_type must be one of LRS, ZRS, GRS, RAGRS, GZRS, RAGZRS."
  }
}

variable "filesystems" {
  description = "List of ADLS Gen2 filesystem (container) names to create, e.g. medallion zones."
  type        = list(string)
  default     = ["bronze", "silver", "gold"]

  validation {
    condition     = length(var.filesystems) == length(distinct(var.filesystems))
    error_message = "filesystems must not contain duplicate names."
  }

  validation {
    condition     = alltrue([for f in var.filesystems : can(regex("^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$", f))])
    error_message = "Each filesystem name must be 3-63 chars, lowercase alphanumeric or hyphens, and start/end alphanumeric."
  }
}

variable "shared_access_key_enabled" {
  description = "Allow account-key/SAS auth. Disable to enforce Entra ID (AAD)-only access."
  type        = bool
  default     = false
}

variable "public_network_access_enabled" {
  description = "Allow public network access. Set false to require private endpoints / curated network rules."
  type        = bool
  default     = false
}

variable "infrastructure_encryption_enabled" {
  description = "Enable double (infrastructure) encryption at rest. Cannot be changed after creation."
  type        = bool
  default     = true
}

variable "network_bypass" {
  description = "Traffic allowed to bypass network rules when public access is disabled."
  type        = list(string)
  default     = ["AzureServices"]

  validation {
    condition     = alltrue([for b in var.network_bypass : contains(["AzureServices", "Logging", "Metrics", "None"], b)])
    error_message = "network_bypass values must be from: AzureServices, Logging, Metrics, None."
  }
}

variable "allowed_ip_rules" {
  description = "Public IP ranges (CIDR) allowed when network rules are enforced. Note: private IPs are not permitted here."
  type        = list(string)
  default     = []
}

variable "allowed_subnet_ids" {
  description = "Subnet resource IDs (with Microsoft.Storage service endpoint) allowed when network rules are enforced."
  type        = list(string)
  default     = []
}

variable "blob_soft_delete_retention_days" {
  description = "Days to retain soft-deleted blobs (1-365)."
  type        = number
  default     = 14

  validation {
    condition     = var.blob_soft_delete_retention_days >= 1 && var.blob_soft_delete_retention_days <= 365
    error_message = "blob_soft_delete_retention_days must be between 1 and 365."
  }
}

variable "container_soft_delete_retention_days" {
  description = "Days to retain soft-deleted containers/filesystems (1-365)."
  type        = number
  default     = 14

  validation {
    condition     = var.container_soft_delete_retention_days >= 1 && var.container_soft_delete_retention_days <= 365
    error_message = "container_soft_delete_retention_days must be between 1 and 365."
  }
}

variable "enable_lifecycle_policy" {
  description = "Attach a lifecycle management policy for hot -> cool -> archive -> delete tiering."
  type        = bool
  default     = true
}

variable "tier_to_cool_after_days" {
  description = "Days since last modification before transitioning a block blob to Cool."
  type        = number
  default     = 30
}

variable "tier_to_archive_after_days" {
  description = "Days since last modification before transitioning a block blob to Archive."
  type        = number
  default     = 120
}

variable "delete_after_days" {
  description = "Days since last modification before deleting a block blob."
  type        = number
  default     = 2555
}

variable "snapshot_delete_after_days" {
  description = "Days since creation before deleting a blob snapshot."
  type        = number
  default     = 90
}

variable "tags" {
  description = "Tags applied to the storage account."
  type        = map(string)
  default     = {}
}

# outputs.tf

output "storage_account_id" {
  description = "Resource ID of the HNS-enabled storage account."
  value       = azurerm_storage_account.this.id
}

output "storage_account_name" {
  description = "Name of the storage account."
  value       = azurerm_storage_account.this.name
}

output "dfs_endpoint" {
  description = "Primary Data Lake (DFS) endpoint, e.g. https://<name>.dfs.core.windows.net/."
  value       = azurerm_storage_account.this.primary_dfs_endpoint
}

output "primary_dfs_host" {
  description = "DFS host only (no scheme), useful for building abfss:// URIs."
  value       = azurerm_storage_account.this.primary_dfs_host
}

output "filesystem_ids" {
  description = "Map of filesystem name => resource ID for each created ADLS Gen2 filesystem."
  value       = { for k, fs in azurerm_storage_data_lake_gen2_filesystem.this : k => fs.id }
}

output "filesystem_names" {
  description = "List of created ADLS Gen2 filesystem names."
  value       = [for fs in azurerm_storage_data_lake_gen2_filesystem.this : fs.name]
}

output "identity_principal_id" {
  description = "Principal ID of the account's system-assigned managed identity (for CMK / RBAC grants)."
  value       = azurerm_storage_account.this.identity[0].principal_id
}

How to use it

module "data_lake_storage_gen2" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"

  storage_account_name = "kvlakeprodcin01"
  resource_group_name  = azurerm_resource_group.data.name
  location             = "centralindia"

  # Zone-redundant for a production lake.
  replication_type = "ZRS"

  # Medallion zones exposed over abfss://.
  filesystems = ["bronze", "silver", "gold", "sandbox"]

  # Lock it down: Entra ID auth only, no public network.
  shared_access_key_enabled     = false
  public_network_access_enabled = false
  allowed_subnet_ids            = [azurerm_subnet.databricks_private.id]

  # Cost control for raw history.
  enable_lifecycle_policy    = true
  tier_to_cool_after_days    = 30
  tier_to_archive_after_days = 90
  delete_after_days          = 2555 # ~7 years retention for raw

  tags = {
    environment = "prod"
    workload    = "lakehouse"
    owner       = "data-platform"
  }
}

# Downstream: grant a Databricks access connector RBAC on the lake, and hand
# the bronze filesystem URI to a Databricks external location / mount config.
resource "azurerm_role_assignment" "databricks_lake_access" {
  scope                = module.data_lake_storage_gen2.storage_account_id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = azurerm_databricks_access_connector.this.identity[0].principal_id
}

locals {
  bronze_abfss = "abfss://bronze@${module.data_lake_storage_gen2.primary_dfs_host}/"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/data_lake/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"
}

inputs = {
  storage_account_name = "..."
  resource_group_name = "..."
  location = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/data_lake && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`storage_account_name`	`string`	—	Yes	Globally unique account name (3-24 chars, lowercase alphanumeric).
`resource_group_name`	`string`	—	Yes	Resource group to deploy into.
`location`	`string`	—	Yes	Azure region (e.g. `centralindia`).
`replication_type`	`string`	`"ZRS"`	No	LRS/ZRS/GRS/RAGRS/GZRS/RAGZRS.
`filesystems`	`list(string)`	`["bronze","silver","gold"]`	No	ADLS Gen2 filesystem names to create.
`shared_access_key_enabled`	`bool`	`false`	No	Allow account-key/SAS auth; false enforces Entra ID-only.
`public_network_access_enabled`	`bool`	`false`	No	false enforces private endpoints / curated network rules.
`infrastructure_encryption_enabled`	`bool`	`true`	No	Double encryption at rest (immutable after creation).
`network_bypass`	`list(string)`	`["AzureServices"]`	No	Bypass categories when network rules apply.
`allowed_ip_rules`	`list(string)`	`[]`	No	Public CIDR ranges allowed under network rules.
`allowed_subnet_ids`	`list(string)`	`[]`	No	Service-endpoint subnet IDs allowed under network rules.
`blob_soft_delete_retention_days`	`number`	`14`	No	Soft-delete retention for blobs (1-365).
`container_soft_delete_retention_days`	`number`	`14`	No	Soft-delete retention for containers (1-365).
`enable_lifecycle_policy`	`bool`	`true`	No	Attach hot→cool→archive→delete lifecycle policy.
`tier_to_cool_after_days`	`number`	`30`	No	Days before block blobs move to Cool.
`tier_to_archive_after_days`	`number`	`120`	No	Days before block blobs move to Archive.
`delete_after_days`	`number`	`2555`	No	Days before block blobs are deleted.
`snapshot_delete_after_days`	`number`	`90`	No	Days before snapshots are deleted.
`tags`	`map(string)`	`{}`	No	Tags applied to the storage account.

Outputs

Name	Description
`storage_account_id`	Resource ID of the HNS-enabled storage account.
`storage_account_name`	Name of the storage account.
`dfs_endpoint`	Primary DFS endpoint (`https://<name>.dfs.core.windows.net/`).
`primary_dfs_host`	DFS host only, for building `abfss://` URIs.
`filesystem_ids`	Map of filesystem name => resource ID.
`filesystem_names`	List of created filesystem names.
`identity_principal_id`	Principal ID of the account’s system-assigned managed identity.

Enterprise scenario

A retail analytics team runs a Databricks lakehouse on Azure and needs a governed landing zone for point-of-sale and clickstream data across regions. They call this module once per environment to provision kvlakeprodcin01 with bronze/silver/gold filesystems, public_network_access_enabled = false, and access pinned to the Databricks private subnet — so raw data only ever traverses the VNet. The lifecycle policy ages bronze POS files to Cool after 30 days and Archive after 90, cutting storage spend on multi-year history by roughly 60% while still meeting a 7-year retention mandate, and the storage_account_id output feeds a Storage Blob Data Contributor grant to the Databricks access connector so Unity Catalog external locations work without a single account key.

Best practices

Never plan to “turn on” HNS later. is_hns_enabled is creation-time only and irreversible — this module forces it true, so always create the lake through it rather than importing a flat StorageV2 account and hoping to convert.
Prefer Entra ID over keys. Keep shared_access_key_enabled = false and default_to_oauth_authentication = true (both defaults here) and grant data-plane access with RBAC roles like Storage Blob Data Contributor plus POSIX ACLs on directories — account keys bypass all of that governance.
Lock the network by default. Run with public_network_access_enabled = false, reach the lake via a dfs private endpoint, and use allowed_subnet_ids for your Spark/Databricks subnets; remember ip_rules accepts only public IPs, never RFC1918 ranges.
Tier aggressively for cost. Raw/bronze data is write-once, read-rarely — let the lifecycle policy push it to Cool then Archive, and avoid Archive for silver/gold that analysts query interactively (rehydration latency is hours).
Pick redundancy to match the blast radius. Use ZRS/GZRS for production lakes to survive a zone failure; LRS is acceptable only for ephemeral dev/sandbox lakes where re-ingestion is cheap.
Name and tag for the estate. Account names are globally unique and limited to 24 lowercase alphanumeric chars, so encode workload + env + region (e.g. kvlakeprodcin01) and tag environment/owner/workload so cost and ownership are queryable across every lake.

Terraform Module: Azure Data Lake Storage Gen2 — secure, HNS-enabled lake with governed filesystems

Quickstart (copy-paste)

What this module is

When to use it

Module structure

How to use it

With Terragrunt

Inputs

Outputs

Enterprise scenario

Best practices

Written by Vinod

Comments

Keep Reading

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

HashiCorp Terraform Associate (003) Prep Kit: Objectives, Practice Questions & Cheat Sheet

Terraform Fundamentals: HCL, Providers, State & the Core Workflow