Terraform Module: Azure Service Fabric Managed Cluster — production-grade microservices clusters without the ARM sprawl

Quick take — Reusable hashicorp/azurerm ~> 4.0 Terraform module for azurerm_service_fabric_managed_cluster: Standard SKU, multi-node-type primary/secondary, client cert auth, and ELB rules wired up. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "service_fabric" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"

  cluster_name        = "..."  # Cluster name + global DNS prefix (3–23 chars, alphanume…
  resource_group_name = "..."  # Resource group for the cluster.
  location            = "..."  # Azure region.
  node_types          = {}     # Node types; exactly one must be `primary = true`, each …
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Service Fabric Managed Cluster is the second-generation, fully managed flavour of Service Fabric. Unlike the classic azurerm_service_fabric_cluster, where you hand-assemble the underlying VM scale sets, load balancer, public IP, NSGs and storage accounts yourself, the managed cluster hides all of that behind a single control-plane resource. You declare node types (each of which becomes a VMSS under the hood) and Azure owns the scale-set lifecycle, the load balancer, the reverse proxy and the certificate plumbing. The trade-off is that the resource model is opinionated: you get a Basic or Standard SKU, you authenticate with client certificates or Entra ID, and you open ports through cluster-level load-balancing rules rather than touching an NSG directly.

That opinionated surface is exactly why it deserves a module. The azurerm_service_fabric_managed_cluster resource has a long tail of fields that must be internally consistent — the SKU determines the minimum primary node-type count (Standard requires ≥ 5), the client_connection_port and http_gateway_port have to be reflected in your load-balancing rules, and at least one node type must be flagged primary = true. Wrapping it in a module lets you encode those invariants once with validation blocks, expose a small set of knobs (cluster name, SKU, node-type sizing, client thumbprints), and hand every team a cluster that is correct-by-construction instead of a 300-line copy-paste that drifts.

When to use it

You are running stateful or stateless microservices on Service Fabric (Reliable Services / Reliable Actors, or containers) and want the managed control plane rather than babysitting scale sets.
You are migrating off a classic Service Fabric cluster and want the v2 managed model with its simpler upgrade story and built-in reverse proxy.
You need multiple node types — for example a primary system-services node type plus one or more secondary workload node types with different VM sizes — provisioned consistently across dev, test and prod.
You want certificate- or Entra-based client authentication and Azure-managed load-balancing rules expressed as code, reviewed in a pull request, instead of clicked into the portal.
Reach for the classic azurerm_service_fabric_cluster resource instead only when you genuinely need to own the VMSS/NSG/LB resources (e.g. exotic networking) — most greenfield clusters in 2026 should be managed.

Module structure

terraform-module-azure-service-fabric/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # managed cluster + node types + LB rules
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # id, name, fqdn, node-type ids

versions.tf

terraform {
  required_version = ">= 1.6.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

locals {
  # Service Fabric DNS names are global and must be lowercase + DNS-safe.
  cluster_dns_name = lower(var.cluster_name)

  # Standard SKU requires a 5-node primary node type for production durability.
  primary_min_nodes = var.sku == "Standard" ? 5 : 3
}

resource "azurerm_service_fabric_managed_cluster" "this" {
  name                = var.cluster_name
  resource_group_name = var.resource_group_name
  location            = var.location
  sku                 = var.sku

  dns_name = local.cluster_dns_name

  # Cluster management endpoints.
  client_connection_port = var.client_connection_port
  http_gateway_port      = var.http_gateway_port

  # Pin (or auto-upgrade) the runtime. "Wave" rings let Azure roll upgrades for you.
  upgrade_wave        = var.upgrade_wave
  cluster_code_version = var.cluster_code_version

  # Restrict portal/Explorer access to known networks.
  dynamic "authentication" {
    for_each = length(var.client_certificate_thumbprints) > 0 || length(var.entra_clients) > 0 ? [1] : []
    content {
      dynamic "certificate" {
        for_each = var.client_certificate_thumbprints
        content {
          thumbprint = certificate.value.thumbprint
          # "AdminClient" can run management ops; "ReadOnlyClient" can only query.
          common_name = certificate.value.common_name
        }
      }

      dynamic "active_directory" {
        for_each = var.entra_clients != null ? [var.entra_clients] : []
        content {
          client_application_id  = active_directory.value.client_application_id
          cluster_application_id = active_directory.value.cluster_application_id
          tenant_id              = active_directory.value.tenant_id
        }
      }
    }
  }

  # Cluster-level load-balancing rules: this is how you expose app ports,
  # since a managed cluster owns its own Standard load balancer.
  dynamic "lb_rule" {
    for_each = var.lb_rules
    content {
      backend_port      = lb_rule.value.backend_port
      frontend_port     = lb_rule.value.frontend_port
      probe_protocol    = lb_rule.value.probe_protocol
      probe_request_path = lb_rule.value.probe_request_path
      protocol          = lb_rule.value.protocol
    }
  }

  tags = var.tags
}

# Each node type is a managed VM scale set. Exactly one must be primary.
resource "azurerm_service_fabric_managed_cluster_node_type" "this" {
  for_each = var.node_types

  name                  = each.key
  cluster_id            = azurerm_service_fabric_managed_cluster.this.id
  primary               = each.value.primary
  vm_size               = each.value.vm_size
  vm_instance_count     = each.value.vm_instance_count
  data_disk_size_gb     = each.value.data_disk_size_gb
  data_disk_type        = each.value.data_disk_type

  vm_image_publisher    = each.value.vm_image_publisher
  vm_image_offer        = each.value.vm_image_offer
  vm_image_sku          = each.value.vm_image_sku
  vm_image_version      = each.value.vm_image_version

  # Stateless node types can be backed by Spot capacity to cut cost.
  stateless             = each.value.stateless
  multiple_placement_groups_enabled = each.value.multiple_placement_groups_enabled

  application_port_range = each.value.application_port_range
  ephemeral_port_range   = each.value.ephemeral_port_range
}

variables.tf

variable "cluster_name" {
  description = "Name of the Service Fabric managed cluster (also used as the global DNS prefix)."
  type        = string

  validation {
    condition     = can(regex("^[a-zA-Z0-9-]{3,23}$", var.cluster_name))
    error_message = "cluster_name must be 3-23 chars, alphanumeric or hyphen (it becomes a global DNS name)."
  }
}

variable "resource_group_name" {
  description = "Resource group that will hold the managed cluster."
  type        = string
}

variable "location" {
  description = "Azure region for the cluster."
  type        = string
}

variable "sku" {
  description = "Cluster SKU. Basic = dev/test (3-node primary), Standard = production (5-node primary, zone resilient)."
  type        = string
  default     = "Standard"

  validation {
    condition     = contains(["Basic", "Standard"], var.sku)
    error_message = "sku must be either \"Basic\" or \"Standard\"."
  }
}

variable "client_connection_port" {
  description = "TCP port the SF client/FabricClient connects on (default 19000)."
  type        = number
  default     = 19000
}

variable "http_gateway_port" {
  description = "HTTP gateway / Service Fabric Explorer port (default 19080)."
  type        = number
  default     = 19080
}

variable "upgrade_wave" {
  description = "Runtime upgrade ring: Wave0 (early), Wave1, or Wave2 (most conservative)."
  type        = string
  default     = "Wave1"

  validation {
    condition     = contains(["Wave0", "Wave1", "Wave2"], var.upgrade_wave)
    error_message = "upgrade_wave must be Wave0, Wave1, or Wave2."
  }
}

variable "cluster_code_version" {
  description = "Optional pinned Service Fabric runtime version. Leave null to let the wave manage it."
  type        = string
  default     = null
}

variable "client_certificate_thumbprints" {
  description = "Client certificates allowed to manage the cluster. common_name acts as the auth type label."
  type = list(object({
    thumbprint  = string
    common_name = string
  }))
  default = []
}

variable "entra_clients" {
  description = "Optional Entra ID (Azure AD) authentication config for the cluster."
  type = object({
    client_application_id  = string
    cluster_application_id = string
    tenant_id              = string
  })
  default = null
}

variable "lb_rules" {
  description = "Cluster load-balancing rules used to expose application ports."
  type = list(object({
    backend_port       = number
    frontend_port      = number
    probe_protocol     = string
    probe_request_path = optional(string)
    protocol           = string
  }))
  default = []

  validation {
    condition = alltrue([
      for r in var.lb_rules : contains(["tcp", "udp"], lower(r.protocol))
    ])
    error_message = "Each lb_rule.protocol must be tcp or udp."
  }
}

variable "node_types" {
  description = "Map of node types. Exactly one must have primary = true."
  type = map(object({
    primary                           = bool
    vm_size                           = string
    vm_instance_count                 = number
    data_disk_size_gb                 = number
    data_disk_type                    = optional(string, "StandardSSD_LRS")
    vm_image_publisher                = optional(string, "MicrosoftWindowsServer")
    vm_image_offer                    = optional(string, "WindowsServer")
    vm_image_sku                      = optional(string, "2022-datacenter")
    vm_image_version                  = optional(string, "latest")
    stateless                         = optional(bool, false)
    multiple_placement_groups_enabled = optional(bool, false)
    application_port_range            = optional(string, "20000-30000")
    ephemeral_port_range              = optional(string, "49152-65534")
  }))

  validation {
    condition     = length([for nt in var.node_types : nt if nt.primary]) == 1
    error_message = "Exactly one node type must be marked primary = true."
  }

  validation {
    condition = alltrue([
      for nt in var.node_types : nt.vm_instance_count >= 3
    ])
    error_message = "Every node type needs at least 3 instances for a reliable cluster."
  }
}

variable "tags" {
  description = "Tags applied to the cluster."
  type        = map(string)
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the Service Fabric managed cluster."
  value       = azurerm_service_fabric_managed_cluster.this.id
}

output "name" {
  description = "Name of the managed cluster."
  value       = azurerm_service_fabric_managed_cluster.this.name
}

output "dns_name" {
  description = "Global DNS name of the cluster."
  value       = azurerm_service_fabric_managed_cluster.this.dns_name
}

output "client_connection_endpoint" {
  description = "TCP endpoint a FabricClient uses to connect (dns_name:client_connection_port)."
  value       = "${azurerm_service_fabric_managed_cluster.this.dns_name}:${var.client_connection_port}"
}

output "management_endpoint" {
  description = "HTTPS management / Service Fabric Explorer endpoint."
  value       = "https://${azurerm_service_fabric_managed_cluster.this.dns_name}:${var.http_gateway_port}"
}

output "node_type_ids" {
  description = "Map of node-type name => resource ID."
  value       = { for k, nt in azurerm_service_fabric_managed_cluster_node_type.this : k => nt.id }
}

How to use it

module "service_fabric_managed_cluster" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"

  cluster_name        = "kvprod-sfmc"
  resource_group_name = azurerm_resource_group.platform.name
  location            = "centralindia"
  sku                 = "Standard"

  upgrade_wave = "Wave2" # most conservative ring for production

  client_certificate_thumbprints = [
    {
      thumbprint  = "A1B2C3D4E5F60718293A4B5C6D7E8F90A1B2C3D4"
      common_name = "AdminClient"
    }
  ]

  # Expose the ingress port of the platform's stateless gateway service.
  lb_rules = [
    {
      frontend_port      = 443
      backend_port       = 8443
      protocol           = "tcp"
      probe_protocol     = "http"
      probe_request_path = "/healthz"
    }
  ]

  node_types = {
    # Primary node type runs the system services — keep it Standard SSD + 5 nodes.
    system = {
      primary           = true
      vm_size           = "Standard_D4s_v5"
      vm_instance_count = 5
      data_disk_size_gb = 256
    }
    # Secondary stateless workload pool, sized independently.
    workload = {
      primary           = false
      stateless         = true
      vm_size           = "Standard_D8s_v5"
      vm_instance_count = 6
      data_disk_size_gb = 128
    }
  }

  tags = {
    environment = "prod"
    workload    = "payments-platform"
    owner       = "platform-engineering"
  }
}

# Downstream: publish the management endpoint to Key Vault so the
# deployment pipeline (sfctl / Azure DevOps) can discover the cluster.
resource "azurerm_key_vault_secret" "sf_management_endpoint" {
  name         = "sf-management-endpoint"
  value        = module.service_fabric_managed_cluster.management_endpoint
  key_vault_id = azurerm_key_vault.platform.id
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module config — live/prod/service_fabric/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"
}

inputs = {
  cluster_name = "..."
  resource_group_name = "..."
  location = "..."
  node_types = {}
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/service_fabric && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`cluster_name`	`string`	—	Yes	Cluster name + global DNS prefix (3–23 chars, alphanumeric/hyphen).
`resource_group_name`	`string`	—	Yes	Resource group for the cluster.
`location`	`string`	—	Yes	Azure region.
`sku`	`string`	`"Standard"`	No	`Basic` (dev/test) or `Standard` (production, zone resilient).
`client_connection_port`	`number`	`19000`	No	FabricClient TCP connection port.
`http_gateway_port`	`number`	`19080`	No	HTTP gateway / Service Fabric Explorer port.
`upgrade_wave`	`string`	`"Wave1"`	No	Runtime upgrade ring: `Wave0`, `Wave1`, or `Wave2`.
`cluster_code_version`	`string`	`null`	No	Pinned SF runtime version; `null` lets the wave manage it.
`client_certificate_thumbprints`	`list(object)`	`[]`	No	Client certs allowed to manage the cluster (`thumbprint`, `common_name`).
`entra_clients`	`object`	`null`	No	Entra ID auth (`client_application_id`, `cluster_application_id`, `tenant_id`).
`lb_rules`	`list(object)`	`[]`	No	Cluster load-balancing rules exposing app ports.
`node_types`	`map(object)`	—	Yes	Node types; exactly one must be `primary = true`, each ≥ 3 instances.
`tags`	`map(string)`	`{}`	No	Tags applied to the cluster.

Outputs

Name	Description
`id`	Resource ID of the Service Fabric managed cluster.
`name`	Name of the managed cluster.
`dns_name`	Global DNS name of the cluster.
`client_connection_endpoint`	TCP endpoint (`dns_name:client_connection_port`) for FabricClient.
`management_endpoint`	HTTPS management / Service Fabric Explorer endpoint.
`node_type_ids`	Map of node-type name to resource ID.

Enterprise scenario

A payments platform team runs a latency-sensitive transaction-routing service as stateful Reliable Services and needs strict isolation between the system services and the heavy workload pool. They consume this module once per environment from their landing-zone repo: Standard SKU with a 5-node system primary node type on Standard_D4s_v5, plus a 6-node stateless workload pool on Standard_D8s_v5 that they can scale independently during end-of-month settlement spikes. The management_endpoint output is written straight into the platform Key Vault, so the Azure DevOps release pipeline resolves the cluster with sfctl without anyone hand-copying an FQDN — and pinning upgrade_wave = "Wave2" keeps runtime upgrades on the most conservative ring while dev clusters ride Wave0 to surface breakage early.

Best practices

Use the Standard SKU with ≥ 5 primary nodes for anything production. Service Fabric places its own system services (Naming, Failover Manager, Cluster Manager) on the primary node type; a 3-node Basic primary cannot survive a node loss during an upgrade, so reserve Basic strictly for ephemeral dev/test clusters.
Separate system and workload node types. Keep applications off the primary node type by adding placement constraints and a dedicated secondary node type — this stops a noisy workload from starving the system services and lets you scale or use Spot/stateless capacity for workloads without touching the cluster’s stability.
Authenticate with named client certificates or Entra ID, never anonymously. Map an AdminClient cert for management and a ReadOnlyClient cert for dashboards, store thumbprints in a variable file (not the cluster name), and rotate certs through the client_certificate_thumbprints input rather than the portal.
Pin the upgrade wave per environment. Ride Wave0 in dev to catch runtime regressions early and Wave2 in production for the slowest, most-vetted rollout; only set cluster_code_version explicitly when you must freeze a specific runtime for a compliance window.
Right-size data disks and prefer StandardSSD_LRS by default. Premium disks on every node type quietly inflate cost; reserve Premium_LRS for stateful node types with real IOPS needs and keep stateless pools on Standard SSD.
Name clusters DNS-first and tag for ownership. Because cluster_name becomes a global DNS label, adopt a <org><env>-sfmc convention (e.g. kvprod-sfmc) to avoid collisions, and always set environment/owner/workload tags so cost and incident routing work without guesswork.