IaC Azure

Terraform Module: Azure Data Lake Storage Gen2 — secure, HNS-enabled lake with governed filesystems

Quick take — A reusable hashicorp/azurerm ~> 4.0 module for Azure Data Lake Storage Gen2: an HNS-enabled storage account, governed filesystems, RA-GRS redundancy, private networking, and lifecycle tiering — fully var-driven. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "data_lake" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"

  storage_account_name = "..."  # Globally unique account name (3-24 chars, lowercase alp…
  resource_group_name  = "..."  # Resource group to deploy into.
  location             = "..."  # Azure region (e.g. `centralindia`).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Data Lake Storage Gen2 (ADLS Gen2) is not a separate service — it is an Azure Storage account with the Hierarchical Namespace (HNS) feature switched on. That one flag (is_hns_enabled = true) is what turns flat blob containers into true filesystems with real directories, atomic directory renames/deletes, and POSIX-style ACLs that Spark, Databricks, Synapse, and Fabric all expect from abfss:// endpoints. Because HNS can only be set at creation time and never toggled afterward, getting the storage account right on the first apply matters more here than almost anywhere else in Azure.

This module wraps the two resources that define a lake — azurerm_storage_account (the HNS-enabled account) and azurerm_storage_data_lake_gen2_filesystem (the abfss:// filesystems, e.g. bronze, silver, gold) — and bundles the production options teams always end up bolting on: a lifecycle management policy for hot→cool→archive tiering, optional private-endpoint-friendly network rules, blob soft delete, and infrastructure encryption. Wrapping it in a module means every lake in the estate is born HNS-enabled, TLS 1.2-only, with public blob access off and the medallion filesystems pre-created — instead of someone clicking “create storage account,” forgetting the HNS checkbox, and discovering three sprints later that Databricks can’t mount it.

When to use it

Module structure

terraform-module-azure-data-lake/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # storage account (HNS) + filesystems + lifecycle policy
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # ids, dfs endpoint, filesystem ids
# versions.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}
# main.tf

# HNS-enabled storage account — this is what makes it a Data Lake Gen2 account.
# is_hns_enabled CANNOT be changed after creation, so it is forced true here.
resource "azurerm_storage_account" "this" {
  name                = var.storage_account_name
  resource_group_name = var.resource_group_name
  location            = var.location

  account_tier             = "Standard"
  account_kind             = "StorageV2"
  account_replication_type = var.replication_type

  # The defining ADLS Gen2 setting.
  is_hns_enabled = true

  # Hardened defaults.
  min_tls_version                   = "TLS1_2"
  https_traffic_only_enabled        = true
  allow_nested_items_to_be_public   = false
  shared_access_key_enabled         = var.shared_access_key_enabled
  public_network_access_enabled     = var.public_network_access_enabled
  default_to_oauth_authentication   = true
  infrastructure_encryption_enabled = var.infrastructure_encryption_enabled

  blob_properties {
    delete_retention_policy {
      days = var.blob_soft_delete_retention_days
    }
    container_delete_retention_policy {
      days = var.container_soft_delete_retention_days
    }
  }

  # When restricting access, default-deny and allow only curated subnets/IPs.
  dynamic "network_rules" {
    for_each = var.public_network_access_enabled ? [] : [1]
    content {
      default_action             = "Deny"
      bypass                     = var.network_bypass
      ip_rules                   = var.allowed_ip_rules
      virtual_network_subnet_ids = var.allowed_subnet_ids
    }
  }

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

# Medallion (or arbitrary) filesystems exposed over abfss://.
resource "azurerm_storage_data_lake_gen2_filesystem" "this" {
  for_each = toset(var.filesystems)

  name               = each.value
  storage_account_id = azurerm_storage_account.this.id
}

# Lifecycle tiering: age data hot -> cool -> archive -> delete to control cost.
resource "azurerm_storage_management_policy" "this" {
  count              = var.enable_lifecycle_policy ? 1 : 0
  storage_account_id = azurerm_storage_account.this.id

  rule {
    name    = "tier-and-expire"
    enabled = true

    filters {
      blob_types = ["blockBlob"]
    }

    actions {
      base_blob {
        tier_to_cool_after_days_since_modification_greater_than    = var.tier_to_cool_after_days
        tier_to_archive_after_days_since_modification_greater_than = var.tier_to_archive_after_days
        delete_after_days_since_modification_greater_than          = var.delete_after_days
      }
      snapshot {
        delete_after_days_since_creation_greater_than = var.snapshot_delete_after_days
      }
    }
  }
}
# variables.tf

variable "storage_account_name" {
  description = "Globally unique storage account name (3-24 chars, lowercase letters and numbers only)."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9]{3,24}$", var.storage_account_name))
    error_message = "storage_account_name must be 3-24 characters, lowercase letters and numbers only."
  }
}

variable "resource_group_name" {
  description = "Name of the resource group to deploy into."
  type        = string
}

variable "location" {
  description = "Azure region (e.g. 'centralindia', 'eastus')."
  type        = string
}

variable "replication_type" {
  description = "Account replication type. Use ZRS/GZRS/RAGZRS for production lakes; LRS only for dev."
  type        = string
  default     = "ZRS"

  validation {
    condition     = contains(["LRS", "ZRS", "GRS", "RAGRS", "GZRS", "RAGZRS"], var.replication_type)
    error_message = "replication_type must be one of LRS, ZRS, GRS, RAGRS, GZRS, RAGZRS."
  }
}

variable "filesystems" {
  description = "List of ADLS Gen2 filesystem (container) names to create, e.g. medallion zones."
  type        = list(string)
  default     = ["bronze", "silver", "gold"]

  validation {
    condition     = length(var.filesystems) == length(distinct(var.filesystems))
    error_message = "filesystems must not contain duplicate names."
  }

  validation {
    condition     = alltrue([for f in var.filesystems : can(regex("^[a-z0-9][a-z0-9-]{1,61}[a-z0-9]$", f))])
    error_message = "Each filesystem name must be 3-63 chars, lowercase alphanumeric or hyphens, and start/end alphanumeric."
  }
}

variable "shared_access_key_enabled" {
  description = "Allow account-key/SAS auth. Disable to enforce Entra ID (AAD)-only access."
  type        = bool
  default     = false
}

variable "public_network_access_enabled" {
  description = "Allow public network access. Set false to require private endpoints / curated network rules."
  type        = bool
  default     = false
}

variable "infrastructure_encryption_enabled" {
  description = "Enable double (infrastructure) encryption at rest. Cannot be changed after creation."
  type        = bool
  default     = true
}

variable "network_bypass" {
  description = "Traffic allowed to bypass network rules when public access is disabled."
  type        = list(string)
  default     = ["AzureServices"]

  validation {
    condition     = alltrue([for b in var.network_bypass : contains(["AzureServices", "Logging", "Metrics", "None"], b)])
    error_message = "network_bypass values must be from: AzureServices, Logging, Metrics, None."
  }
}

variable "allowed_ip_rules" {
  description = "Public IP ranges (CIDR) allowed when network rules are enforced. Note: private IPs are not permitted here."
  type        = list(string)
  default     = []
}

variable "allowed_subnet_ids" {
  description = "Subnet resource IDs (with Microsoft.Storage service endpoint) allowed when network rules are enforced."
  type        = list(string)
  default     = []
}

variable "blob_soft_delete_retention_days" {
  description = "Days to retain soft-deleted blobs (1-365)."
  type        = number
  default     = 14

  validation {
    condition     = var.blob_soft_delete_retention_days >= 1 && var.blob_soft_delete_retention_days <= 365
    error_message = "blob_soft_delete_retention_days must be between 1 and 365."
  }
}

variable "container_soft_delete_retention_days" {
  description = "Days to retain soft-deleted containers/filesystems (1-365)."
  type        = number
  default     = 14

  validation {
    condition     = var.container_soft_delete_retention_days >= 1 && var.container_soft_delete_retention_days <= 365
    error_message = "container_soft_delete_retention_days must be between 1 and 365."
  }
}

variable "enable_lifecycle_policy" {
  description = "Attach a lifecycle management policy for hot -> cool -> archive -> delete tiering."
  type        = bool
  default     = true
}

variable "tier_to_cool_after_days" {
  description = "Days since last modification before transitioning a block blob to Cool."
  type        = number
  default     = 30
}

variable "tier_to_archive_after_days" {
  description = "Days since last modification before transitioning a block blob to Archive."
  type        = number
  default     = 120
}

variable "delete_after_days" {
  description = "Days since last modification before deleting a block blob."
  type        = number
  default     = 2555
}

variable "snapshot_delete_after_days" {
  description = "Days since creation before deleting a blob snapshot."
  type        = number
  default     = 90
}

variable "tags" {
  description = "Tags applied to the storage account."
  type        = map(string)
  default     = {}
}
# outputs.tf

output "storage_account_id" {
  description = "Resource ID of the HNS-enabled storage account."
  value       = azurerm_storage_account.this.id
}

output "storage_account_name" {
  description = "Name of the storage account."
  value       = azurerm_storage_account.this.name
}

output "dfs_endpoint" {
  description = "Primary Data Lake (DFS) endpoint, e.g. https://<name>.dfs.core.windows.net/."
  value       = azurerm_storage_account.this.primary_dfs_endpoint
}

output "primary_dfs_host" {
  description = "DFS host only (no scheme), useful for building abfss:// URIs."
  value       = azurerm_storage_account.this.primary_dfs_host
}

output "filesystem_ids" {
  description = "Map of filesystem name => resource ID for each created ADLS Gen2 filesystem."
  value       = { for k, fs in azurerm_storage_data_lake_gen2_filesystem.this : k => fs.id }
}

output "filesystem_names" {
  description = "List of created ADLS Gen2 filesystem names."
  value       = [for fs in azurerm_storage_data_lake_gen2_filesystem.this : fs.name]
}

output "identity_principal_id" {
  description = "Principal ID of the account's system-assigned managed identity (for CMK / RBAC grants)."
  value       = azurerm_storage_account.this.identity[0].principal_id
}

How to use it

module "data_lake_storage_gen2" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"

  storage_account_name = "kvlakeprodcin01"
  resource_group_name  = azurerm_resource_group.data.name
  location             = "centralindia"

  # Zone-redundant for a production lake.
  replication_type = "ZRS"

  # Medallion zones exposed over abfss://.
  filesystems = ["bronze", "silver", "gold", "sandbox"]

  # Lock it down: Entra ID auth only, no public network.
  shared_access_key_enabled     = false
  public_network_access_enabled = false
  allowed_subnet_ids            = [azurerm_subnet.databricks_private.id]

  # Cost control for raw history.
  enable_lifecycle_policy    = true
  tier_to_cool_after_days    = 30
  tier_to_archive_after_days = 90
  delete_after_days          = 2555 # ~7 years retention for raw

  tags = {
    environment = "prod"
    workload    = "lakehouse"
    owner       = "data-platform"
  }
}

# Downstream: grant a Databricks access connector RBAC on the lake, and hand
# the bronze filesystem URI to a Databricks external location / mount config.
resource "azurerm_role_assignment" "databricks_lake_access" {
  scope                = module.data_lake_storage_gen2.storage_account_id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = azurerm_databricks_access_connector.this.identity[0].principal_id
}

locals {
  bronze_abfss = "abfss://bronze@${module.data_lake_storage_gen2.primary_dfs_host}/"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module configlive/prod/data_lake/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-data-lake?ref=v1.0.0"
}

inputs = {
  storage_account_name = "..."
  resource_group_name = "..."
  location = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/data_lake && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
storage_account_name string Yes Globally unique account name (3-24 chars, lowercase alphanumeric).
resource_group_name string Yes Resource group to deploy into.
location string Yes Azure region (e.g. centralindia).
replication_type string "ZRS" No LRS/ZRS/GRS/RAGRS/GZRS/RAGZRS.
filesystems list(string) ["bronze","silver","gold"] No ADLS Gen2 filesystem names to create.
shared_access_key_enabled bool false No Allow account-key/SAS auth; false enforces Entra ID-only.
public_network_access_enabled bool false No false enforces private endpoints / curated network rules.
infrastructure_encryption_enabled bool true No Double encryption at rest (immutable after creation).
network_bypass list(string) ["AzureServices"] No Bypass categories when network rules apply.
allowed_ip_rules list(string) [] No Public CIDR ranges allowed under network rules.
allowed_subnet_ids list(string) [] No Service-endpoint subnet IDs allowed under network rules.
blob_soft_delete_retention_days number 14 No Soft-delete retention for blobs (1-365).
container_soft_delete_retention_days number 14 No Soft-delete retention for containers (1-365).
enable_lifecycle_policy bool true No Attach hot→cool→archive→delete lifecycle policy.
tier_to_cool_after_days number 30 No Days before block blobs move to Cool.
tier_to_archive_after_days number 120 No Days before block blobs move to Archive.
delete_after_days number 2555 No Days before block blobs are deleted.
snapshot_delete_after_days number 90 No Days before snapshots are deleted.
tags map(string) {} No Tags applied to the storage account.

Outputs

Name Description
storage_account_id Resource ID of the HNS-enabled storage account.
storage_account_name Name of the storage account.
dfs_endpoint Primary DFS endpoint (https://<name>.dfs.core.windows.net/).
primary_dfs_host DFS host only, for building abfss:// URIs.
filesystem_ids Map of filesystem name => resource ID.
filesystem_names List of created filesystem names.
identity_principal_id Principal ID of the account’s system-assigned managed identity.

Enterprise scenario

A retail analytics team runs a Databricks lakehouse on Azure and needs a governed landing zone for point-of-sale and clickstream data across regions. They call this module once per environment to provision kvlakeprodcin01 with bronze/silver/gold filesystems, public_network_access_enabled = false, and access pinned to the Databricks private subnet — so raw data only ever traverses the VNet. The lifecycle policy ages bronze POS files to Cool after 30 days and Archive after 90, cutting storage spend on multi-year history by roughly 60% while still meeting a 7-year retention mandate, and the storage_account_id output feeds a Storage Blob Data Contributor grant to the Databricks access connector so Unity Catalog external locations work without a single account key.

Best practices

TerraformAzureData Lake Storage Gen2ModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading