Quick take — A reusable hashicorp/azurerm ~> 4.0 module for Azure Data Lake Storage Gen2: an HNS-enabled storage account, governed filesystems, RA-GRS redundancy, private networking, and lifecycle tiering — fully var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "data_lake" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"
storage_account_name = "..." # Globally unique account name (3-24 chars, lowercase alp…
resource_group_name = "..." # Resource group to deploy into.
location = "..." # Azure region (e.g. `centralindia`).
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure Data Lake Storage Gen2 (ADLS Gen2) is not a separate service — it is an Azure Storage account with the Hierarchical Namespace (HNS) feature switched on. That one flag (is_hns_enabled = true) is what turns flat blob containers into true filesystems with real directories, atomic directory renames/deletes, and POSIX-style ACLs that Spark, Databricks, Synapse, and Fabric all expect from abfss:// endpoints. Because HNS can only be set at creation time and never toggled afterward, getting the storage account right on the first apply matters more here than almost anywhere else in Azure.
This module wraps the two resources that define a lake — azurerm_storage_account (the HNS-enabled account) and azurerm_storage_data_lake_gen2_filesystem (the abfss:// filesystems, e.g. bronze, silver, gold) — and bundles the production options teams always end up bolting on: a lifecycle management policy for hot→cool→archive tiering, optional private-endpoint-friendly network rules, blob soft delete, and infrastructure encryption. Wrapping it in a module means every lake in the estate is born HNS-enabled, TLS 1.2-only, with public blob access off and the medallion filesystems pre-created — instead of someone clicking “create storage account,” forgetting the HNS checkbox, and discovering three sprints later that Databricks can’t mount it.
When to use it
- You are standing up a medallion / lakehouse (bronze/silver/gold) for Databricks, Synapse Spark, or Microsoft Fabric and need
abfss://filesystems, not flat blob containers. - You want every data lake in the org to be consistent by construction: HNS on, TLS 1.2 minimum, public access disabled, soft delete enabled, and lifecycle tiering attached.
- You are landing raw data that ages predictably and want cost control via automatic hot→cool→archive transitions instead of paying hot rates for cold history.
- You need the lake to be reachable only over private endpoints with
default_action = "Deny"and a curated IP/subnet allowlist for governance and exam-ready compliance. - Do not use this for general-purpose blob/object storage, static website hosting, or queue/table workloads — HNS adds per-transaction cost and constraints (e.g., no immutable WORM blob versioning combos) that those workloads don’t need.
Module structure
terraform-module-azure-data-lake/
├── versions.tf # provider + Terraform version pins
├── main.tf # storage account (HNS) + filesystems + lifecycle policy
├── variables.tf # var-driven inputs with validations
└── outputs.tf # ids, dfs endpoint, filesystem ids
# versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
# main.tf
# HNS-enabled storage account — this is what makes it a Data Lake Gen2 account.
# is_hns_enabled CANNOT be changed after creation, so it is forced true here.
resource "azurerm_storage_account" "this" {
name = var.storage_account_name
resource_group_name = var.resource_group_name
location = var.location
account_tier = "Standard"
account_kind = "StorageV2"
account_replication_type = var.replication_type
# The defining ADLS Gen2 setting.
is_hns_enabled = true
# Hardened defaults.
min_tls_version = "TLS1_2"
https_traffic_only_enabled = true
allow_nested_items_to_be_public = false
shared_access_key_enabled = var.shared_access_key_enabled
public_network_access_enabled = var.public_network_access_enabled
default_to_oauth_authentication = true
infrastructure_encryption_enabled = var.infrastructure_encryption_enabled
blob_properties {
delete_retention_policy {
days = var.blob_soft_delete_retention_days
}
container_delete_retention_policy {
days = var.container_soft_delete_retention_days
}
}
# When restricting access, default-deny and allow only curated subnets/IPs.
dynamic "network_rules" {
for_each = var.public_network_access_enabled ? [] : [1]
content {
default_action = "Deny"
bypass = var.network_bypass
ip_rules = var.allowed_ip_rules
virtual_network_subnet_ids = var.allowed_subnet_ids
}
}
identity {
type = "SystemAssigned"
}
tags = var.tags
}
# Medallion (or arbitrary) filesystems exposed over abfss://.
resource "azurerm_storage_data_lake_gen2_filesystem" "this" {
for_each = toset(var.filesystems)
name = each.value
storage_account_id = azurerm_storage_account.this.id
}
# Lifecycle tiering: age data hot -> cool -> archive -> delete to control cost.
resource "azurerm_storage_management_policy" "this" {
count = var.enable_lifecycle_policy ? 1 : 0
storage_account_id = azurerm_storage_account.this.id
rule {
name = "tier-and-expire"
enabled = true
filters {
blob_types = ["blockBlob"]
}
actions {
base_blob {
tier_to_cool_after_days_since_modification_greater_than = var.tier_to_cool_after_days
tier_to_archive_after_days_since_modification_greater_than = var.tier_to_archive_after_days
delete_after_days_since_modification_greater_than = var.delete_after_days
}
snapshot {
delete_after_days_since_creation_greater_than = var.snapshot_delete_after_days
}
}
}
}
# variables.tf
variable "storage_account_name" {
description = "Globally unique storage account name (3-24 chars, lowercase letters and numbers only)."
type = string
validation {
condition = can(regex("^[a-z0-9]{3,24}$", var.storage_account_name))
error_message = "storage_account_name must be 3-24 characters, lowercase letters and numbers only."
}
}
variable "resource_group_name" {
description = "Name of the resource group to deploy into."
type = string
}
variable "location" {
description = "Azure region (e.g. 'centralindia', 'eastus')."
type = string
}
variable "replication_type" {
description = "Account replication type. Use ZRS/GZRS/RAGZRS for production lakes; LRS only for dev."
type = string
default = "ZRS"
validation {
condition = contains(["LRS", "ZRS", "GRS", "RAGRS", "GZRS", "RAGZRS"], var.replication_type)
error_message = "replication_type must be one of LRS, ZRS, GRS, RAGRS, GZRS, RAGZRS."
}
}
variable "filesystems" {
description = "List of ADLS Gen2 filesystem (container) names to create, e.g. medallion zones."
type = list(string)
default = ["bronze", "silver", "gold"]
validation {
condition = length(var.filesystems) == length(distinct(var.filesystems))
error_message = "filesystems must not contain duplicate names."
}
validation {
condition = alltrue([for f in var.filesystems : can(regex("^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$", f))])
error_message = "Each filesystem name must be 3-63 chars, lowercase alphanumeric or hyphens, and start/end alphanumeric."
}
}
variable "shared_access_key_enabled" {
description = "Allow account-key/SAS auth. Disable to enforce Entra ID (AAD)-only access."
type = bool
default = false
}
variable "public_network_access_enabled" {
description = "Allow public network access. Set false to require private endpoints / curated network rules."
type = bool
default = false
}
variable "infrastructure_encryption_enabled" {
description = "Enable double (infrastructure) encryption at rest. Cannot be changed after creation."
type = bool
default = true
}
variable "network_bypass" {
description = "Traffic allowed to bypass network rules when public access is disabled."
type = list(string)
default = ["AzureServices"]
validation {
condition = alltrue([for b in var.network_bypass : contains(["AzureServices", "Logging", "Metrics", "None"], b)])
error_message = "network_bypass values must be from: AzureServices, Logging, Metrics, None."
}
}
variable "allowed_ip_rules" {
description = "Public IP ranges (CIDR) allowed when network rules are enforced. Note: private IPs are not permitted here."
type = list(string)
default = []
}
variable "allowed_subnet_ids" {
description = "Subnet resource IDs (with Microsoft.Storage service endpoint) allowed when network rules are enforced."
type = list(string)
default = []
}
variable "blob_soft_delete_retention_days" {
description = "Days to retain soft-deleted blobs (1-365)."
type = number
default = 14
validation {
condition = var.blob_soft_delete_retention_days >= 1 && var.blob_soft_delete_retention_days <= 365
error_message = "blob_soft_delete_retention_days must be between 1 and 365."
}
}
variable "container_soft_delete_retention_days" {
description = "Days to retain soft-deleted containers/filesystems (1-365)."
type = number
default = 14
validation {
condition = var.container_soft_delete_retention_days >= 1 && var.container_soft_delete_retention_days <= 365
error_message = "container_soft_delete_retention_days must be between 1 and 365."
}
}
variable "enable_lifecycle_policy" {
description = "Attach a lifecycle management policy for hot -> cool -> archive -> delete tiering."
type = bool
default = true
}
variable "tier_to_cool_after_days" {
description = "Days since last modification before transitioning a block blob to Cool."
type = number
default = 30
}
variable "tier_to_archive_after_days" {
description = "Days since last modification before transitioning a block blob to Archive."
type = number
default = 120
}
variable "delete_after_days" {
description = "Days since last modification before deleting a block blob."
type = number
default = 2555
}
variable "snapshot_delete_after_days" {
description = "Days since creation before deleting a blob snapshot."
type = number
default = 90
}
variable "tags" {
description = "Tags applied to the storage account."
type = map(string)
default = {}
}
# outputs.tf
output "storage_account_id" {
description = "Resource ID of the HNS-enabled storage account."
value = azurerm_storage_account.this.id
}
output "storage_account_name" {
description = "Name of the storage account."
value = azurerm_storage_account.this.name
}
output "dfs_endpoint" {
description = "Primary Data Lake (DFS) endpoint, e.g. https://<name>.dfs.core.windows.net/."
value = azurerm_storage_account.this.primary_dfs_endpoint
}
output "primary_dfs_host" {
description = "DFS host only (no scheme), useful for building abfss:// URIs."
value = azurerm_storage_account.this.primary_dfs_host
}
output "filesystem_ids" {
description = "Map of filesystem name => resource ID for each created ADLS Gen2 filesystem."
value = { for k, fs in azurerm_storage_data_lake_gen2_filesystem.this : k => fs.id }
}
output "filesystem_names" {
description = "List of created ADLS Gen2 filesystem names."
value = [for fs in azurerm_storage_data_lake_gen2_filesystem.this : fs.name]
}
output "identity_principal_id" {
description = "Principal ID of the account's system-assigned managed identity (for CMK / RBAC grants)."
value = azurerm_storage_account.this.identity[0].principal_id
}
How to use it
module "data_lake_storage_gen2" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"
storage_account_name = "kvlakeprodcin01"
resource_group_name = azurerm_resource_group.data.name
location = "centralindia"
# Zone-redundant for a production lake.
replication_type = "ZRS"
# Medallion zones exposed over abfss://.
filesystems = ["bronze", "silver", "gold", "sandbox"]
# Lock it down: Entra ID auth only, no public network.
shared_access_key_enabled = false
public_network_access_enabled = false
allowed_subnet_ids = [azurerm_subnet.databricks_private.id]
# Cost control for raw history.
enable_lifecycle_policy = true
tier_to_cool_after_days = 30
tier_to_archive_after_days = 90
delete_after_days = 2555 # ~7 years retention for raw
tags = {
environment = "prod"
workload = "lakehouse"
owner = "data-platform"
}
}
# Downstream: grant a Databricks access connector RBAC on the lake, and hand
# the bronze filesystem URI to a Databricks external location / mount config.
resource "azurerm_role_assignment" "databricks_lake_access" {
scope = module.data_lake_storage_gen2.storage_account_id
role_definition_name = "Storage Blob Data Contributor"
principal_id = azurerm_databricks_access_connector.this.identity[0].principal_id
}
locals {
bronze_abfss = "abfss://bronze@${module.data_lake_storage_gen2.primary_dfs_host}/"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/data_lake/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"
}
inputs = {
storage_account_name = "..."
resource_group_name = "..."
location = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/data_lake && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
storage_account_name |
string |
— | Yes | Globally unique account name (3-24 chars, lowercase alphanumeric). |
resource_group_name |
string |
— | Yes | Resource group to deploy into. |
location |
string |
— | Yes | Azure region (e.g. centralindia). |
replication_type |
string |
"ZRS" |
No | LRS/ZRS/GRS/RAGRS/GZRS/RAGZRS. |
filesystems |
list(string) |
["bronze","silver","gold"] |
No | ADLS Gen2 filesystem names to create. |
shared_access_key_enabled |
bool |
false |
No | Allow account-key/SAS auth; false enforces Entra ID-only. |
public_network_access_enabled |
bool |
false |
No | false enforces private endpoints / curated network rules. |
infrastructure_encryption_enabled |
bool |
true |
No | Double encryption at rest (immutable after creation). |
network_bypass |
list(string) |
["AzureServices"] |
No | Bypass categories when network rules apply. |
allowed_ip_rules |
list(string) |
[] |
No | Public CIDR ranges allowed under network rules. |
allowed_subnet_ids |
list(string) |
[] |
No | Service-endpoint subnet IDs allowed under network rules. |
blob_soft_delete_retention_days |
number |
14 |
No | Soft-delete retention for blobs (1-365). |
container_soft_delete_retention_days |
number |
14 |
No | Soft-delete retention for containers (1-365). |
enable_lifecycle_policy |
bool |
true |
No | Attach hot→cool→archive→delete lifecycle policy. |
tier_to_cool_after_days |
number |
30 |
No | Days before block blobs move to Cool. |
tier_to_archive_after_days |
number |
120 |
No | Days before block blobs move to Archive. |
delete_after_days |
number |
2555 |
No | Days before block blobs are deleted. |
snapshot_delete_after_days |
number |
90 |
No | Days before snapshots are deleted. |
tags |
map(string) |
{} |
No | Tags applied to the storage account. |
Outputs
| Name | Description |
|---|---|
storage_account_id |
Resource ID of the HNS-enabled storage account. |
storage_account_name |
Name of the storage account. |
dfs_endpoint |
Primary DFS endpoint (https://<name>.dfs.core.windows.net/). |
primary_dfs_host |
DFS host only, for building abfss:// URIs. |
filesystem_ids |
Map of filesystem name => resource ID. |
filesystem_names |
List of created filesystem names. |
identity_principal_id |
Principal ID of the account’s system-assigned managed identity. |
Enterprise scenario
A retail analytics team runs a Databricks lakehouse on Azure and needs a governed landing zone for point-of-sale and clickstream data across regions. They call this module once per environment to provision kvlakeprodcin01 with bronze/silver/gold filesystems, public_network_access_enabled = false, and access pinned to the Databricks private subnet — so raw data only ever traverses the VNet. The lifecycle policy ages bronze POS files to Cool after 30 days and Archive after 90, cutting storage spend on multi-year history by roughly 60% while still meeting a 7-year retention mandate, and the storage_account_id output feeds a Storage Blob Data Contributor grant to the Databricks access connector so Unity Catalog external locations work without a single account key.
Best practices
- Never plan to “turn on” HNS later.
is_hns_enabledis creation-time only and irreversible — this module forces it true, so always create the lake through it rather than importing a flat StorageV2 account and hoping to convert. - Prefer Entra ID over keys. Keep
shared_access_key_enabled = falseanddefault_to_oauth_authentication = true(both defaults here) and grant data-plane access with RBAC roles likeStorage Blob Data Contributorplus POSIX ACLs on directories — account keys bypass all of that governance. - Lock the network by default. Run with
public_network_access_enabled = false, reach the lake via adfsprivate endpoint, and useallowed_subnet_idsfor your Spark/Databricks subnets; rememberip_rulesaccepts only public IPs, never RFC1918 ranges. - Tier aggressively for cost. Raw/bronze data is write-once, read-rarely — let the lifecycle policy push it to Cool then Archive, and avoid Archive for silver/gold that analysts query interactively (rehydration latency is hours).
- Pick redundancy to match the blast radius. Use
ZRS/GZRSfor production lakes to survive a zone failure;LRSis acceptable only for ephemeral dev/sandbox lakes where re-ingestion is cheap. - Name and tag for the estate. Account names are globally unique and limited to 24 lowercase alphanumeric chars, so encode workload + env + region (e.g.
kvlakeprodcin01) and tagenvironment/owner/workloadso cost and ownership are queryable across every lake.