Quick take — Reusable hashicorp/azurerm ~> 4.0 Terraform module for azurerm_service_fabric_managed_cluster: Standard SKU, multi-node-type primary/secondary, client cert auth, and ELB rules wired up. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "service_fabric" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"
cluster_name = "..." # Cluster name + global DNS prefix (3–23 chars, alphanume…
resource_group_name = "..." # Resource group for the cluster.
location = "..." # Azure region.
node_types = {} # Node types; exactly one must be `primary = true`, each …
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure Service Fabric Managed Cluster is the second-generation, fully managed flavour of Service Fabric. Unlike the classic azurerm_service_fabric_cluster, where you hand-assemble the underlying VM scale sets, load balancer, public IP, NSGs and storage accounts yourself, the managed cluster hides all of that behind a single control-plane resource. You declare node types (each of which becomes a VMSS under the hood) and Azure owns the scale-set lifecycle, the load balancer, the reverse proxy and the certificate plumbing. The trade-off is that the resource model is opinionated: you get a Basic or Standard SKU, you authenticate with client certificates or Entra ID, and you open ports through cluster-level load-balancing rules rather than touching an NSG directly.
That opinionated surface is exactly why it deserves a module. The azurerm_service_fabric_managed_cluster resource has a long tail of fields that must be internally consistent — the SKU determines the minimum primary node-type count (Standard requires ≥ 5), the client_connection_port and http_gateway_port have to be reflected in your load-balancing rules, and at least one node type must be flagged primary = true. Wrapping it in a module lets you encode those invariants once with validation blocks, expose a small set of knobs (cluster name, SKU, node-type sizing, client thumbprints), and hand every team a cluster that is correct-by-construction instead of a 300-line copy-paste that drifts.
When to use it
- You are running stateful or stateless microservices on Service Fabric (Reliable Services / Reliable Actors, or containers) and want the managed control plane rather than babysitting scale sets.
- You are migrating off a classic Service Fabric cluster and want the v2 managed model with its simpler upgrade story and built-in reverse proxy.
- You need multiple node types — for example a
primarysystem-services node type plus one or moresecondaryworkload node types with different VM sizes — provisioned consistently across dev, test and prod. - You want certificate- or Entra-based client authentication and Azure-managed load-balancing rules expressed as code, reviewed in a pull request, instead of clicked into the portal.
- Reach for the classic
azurerm_service_fabric_clusterresource instead only when you genuinely need to own the VMSS/NSG/LB resources (e.g. exotic networking) — most greenfield clusters in 2026 should be managed.
Module structure
terraform-module-azure-service-fabric/
├── versions.tf # provider + Terraform version pins
├── main.tf # managed cluster + node types + LB rules
├── variables.tf # var-driven inputs with validation
└── outputs.tf # id, name, fqdn, node-type ids
versions.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
main.tf
locals {
# Service Fabric DNS names are global and must be lowercase + DNS-safe.
cluster_dns_name = lower(var.cluster_name)
# Standard SKU requires a 5-node primary node type for production durability.
primary_min_nodes = var.sku == "Standard" ? 5 : 3
}
resource "azurerm_service_fabric_managed_cluster" "this" {
name = var.cluster_name
resource_group_name = var.resource_group_name
location = var.location
sku = var.sku
dns_name = local.cluster_dns_name
# Cluster management endpoints.
client_connection_port = var.client_connection_port
http_gateway_port = var.http_gateway_port
# Pin (or auto-upgrade) the runtime. "Wave" rings let Azure roll upgrades for you.
upgrade_wave = var.upgrade_wave
cluster_code_version = var.cluster_code_version
# Restrict portal/Explorer access to known networks.
dynamic "authentication" {
for_each = length(var.client_certificate_thumbprints) > 0 || length(var.entra_clients) > 0 ? [1] : []
content {
dynamic "certificate" {
for_each = var.client_certificate_thumbprints
content {
thumbprint = certificate.value.thumbprint
# "AdminClient" can run management ops; "ReadOnlyClient" can only query.
common_name = certificate.value.common_name
}
}
dynamic "active_directory" {
for_each = var.entra_clients != null ? [var.entra_clients] : []
content {
client_application_id = active_directory.value.client_application_id
cluster_application_id = active_directory.value.cluster_application_id
tenant_id = active_directory.value.tenant_id
}
}
}
}
# Cluster-level load-balancing rules: this is how you expose app ports,
# since a managed cluster owns its own Standard load balancer.
dynamic "lb_rule" {
for_each = var.lb_rules
content {
backend_port = lb_rule.value.backend_port
frontend_port = lb_rule.value.frontend_port
probe_protocol = lb_rule.value.probe_protocol
probe_request_path = lb_rule.value.probe_request_path
protocol = lb_rule.value.protocol
}
}
tags = var.tags
}
# Each node type is a managed VM scale set. Exactly one must be primary.
resource "azurerm_service_fabric_managed_cluster_node_type" "this" {
for_each = var.node_types
name = each.key
cluster_id = azurerm_service_fabric_managed_cluster.this.id
primary = each.value.primary
vm_size = each.value.vm_size
vm_instance_count = each.value.vm_instance_count
data_disk_size_gb = each.value.data_disk_size_gb
data_disk_type = each.value.data_disk_type
vm_image_publisher = each.value.vm_image_publisher
vm_image_offer = each.value.vm_image_offer
vm_image_sku = each.value.vm_image_sku
vm_image_version = each.value.vm_image_version
# Stateless node types can be backed by Spot capacity to cut cost.
stateless = each.value.stateless
multiple_placement_groups_enabled = each.value.multiple_placement_groups_enabled
application_port_range = each.value.application_port_range
ephemeral_port_range = each.value.ephemeral_port_range
}
variables.tf
variable "cluster_name" {
description = "Name of the Service Fabric managed cluster (also used as the global DNS prefix)."
type = string
validation {
condition = can(regex("^[a-zA-Z0-9-]{3,23}$", var.cluster_name))
error_message = "cluster_name must be 3-23 chars, alphanumeric or hyphen (it becomes a global DNS name)."
}
}
variable "resource_group_name" {
description = "Resource group that will hold the managed cluster."
type = string
}
variable "location" {
description = "Azure region for the cluster."
type = string
}
variable "sku" {
description = "Cluster SKU. Basic = dev/test (3-node primary), Standard = production (5-node primary, zone resilient)."
type = string
default = "Standard"
validation {
condition = contains(["Basic", "Standard"], var.sku)
error_message = "sku must be either \"Basic\" or \"Standard\"."
}
}
variable "client_connection_port" {
description = "TCP port the SF client/FabricClient connects on (default 19000)."
type = number
default = 19000
}
variable "http_gateway_port" {
description = "HTTP gateway / Service Fabric Explorer port (default 19080)."
type = number
default = 19080
}
variable "upgrade_wave" {
description = "Runtime upgrade ring: Wave0 (early), Wave1, or Wave2 (most conservative)."
type = string
default = "Wave1"
validation {
condition = contains(["Wave0", "Wave1", "Wave2"], var.upgrade_wave)
error_message = "upgrade_wave must be Wave0, Wave1, or Wave2."
}
}
variable "cluster_code_version" {
description = "Optional pinned Service Fabric runtime version. Leave null to let the wave manage it."
type = string
default = null
}
variable "client_certificate_thumbprints" {
description = "Client certificates allowed to manage the cluster. common_name acts as the auth type label."
type = list(object({
thumbprint = string
common_name = string
}))
default = []
}
variable "entra_clients" {
description = "Optional Entra ID (Azure AD) authentication config for the cluster."
type = object({
client_application_id = string
cluster_application_id = string
tenant_id = string
})
default = null
}
variable "lb_rules" {
description = "Cluster load-balancing rules used to expose application ports."
type = list(object({
backend_port = number
frontend_port = number
probe_protocol = string
probe_request_path = optional(string)
protocol = string
}))
default = []
validation {
condition = alltrue([
for r in var.lb_rules : contains(["tcp", "udp"], lower(r.protocol))
])
error_message = "Each lb_rule.protocol must be tcp or udp."
}
}
variable "node_types" {
description = "Map of node types. Exactly one must have primary = true."
type = map(object({
primary = bool
vm_size = string
vm_instance_count = number
data_disk_size_gb = number
data_disk_type = optional(string, "StandardSSD_LRS")
vm_image_publisher = optional(string, "MicrosoftWindowsServer")
vm_image_offer = optional(string, "WindowsServer")
vm_image_sku = optional(string, "2022-datacenter")
vm_image_version = optional(string, "latest")
stateless = optional(bool, false)
multiple_placement_groups_enabled = optional(bool, false)
application_port_range = optional(string, "20000-30000")
ephemeral_port_range = optional(string, "49152-65534")
}))
validation {
condition = length([for nt in var.node_types : nt if nt.primary]) == 1
error_message = "Exactly one node type must be marked primary = true."
}
validation {
condition = alltrue([
for nt in var.node_types : nt.vm_instance_count >= 3
])
error_message = "Every node type needs at least 3 instances for a reliable cluster."
}
}
variable "tags" {
description = "Tags applied to the cluster."
type = map(string)
default = {}
}
outputs.tf
output "id" {
description = "Resource ID of the Service Fabric managed cluster."
value = azurerm_service_fabric_managed_cluster.this.id
}
output "name" {
description = "Name of the managed cluster."
value = azurerm_service_fabric_managed_cluster.this.name
}
output "dns_name" {
description = "Global DNS name of the cluster."
value = azurerm_service_fabric_managed_cluster.this.dns_name
}
output "client_connection_endpoint" {
description = "TCP endpoint a FabricClient uses to connect (dns_name:client_connection_port)."
value = "${azurerm_service_fabric_managed_cluster.this.dns_name}:${var.client_connection_port}"
}
output "management_endpoint" {
description = "HTTPS management / Service Fabric Explorer endpoint."
value = "https://${azurerm_service_fabric_managed_cluster.this.dns_name}:${var.http_gateway_port}"
}
output "node_type_ids" {
description = "Map of node-type name => resource ID."
value = { for k, nt in azurerm_service_fabric_managed_cluster_node_type.this : k => nt.id }
}
How to use it
module "service_fabric_managed_cluster" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"
cluster_name = "kvprod-sfmc"
resource_group_name = azurerm_resource_group.platform.name
location = "centralindia"
sku = "Standard"
upgrade_wave = "Wave2" # most conservative ring for production
client_certificate_thumbprints = [
{
thumbprint = "A1B2C3D4E5F60718293A4B5C6D7E8F90A1B2C3D4"
common_name = "AdminClient"
}
]
# Expose the ingress port of the platform's stateless gateway service.
lb_rules = [
{
frontend_port = 443
backend_port = 8443
protocol = "tcp"
probe_protocol = "http"
probe_request_path = "/healthz"
}
]
node_types = {
# Primary node type runs the system services — keep it Standard SSD + 5 nodes.
system = {
primary = true
vm_size = "Standard_D4s_v5"
vm_instance_count = 5
data_disk_size_gb = 256
}
# Secondary stateless workload pool, sized independently.
workload = {
primary = false
stateless = true
vm_size = "Standard_D8s_v5"
vm_instance_count = 6
data_disk_size_gb = 128
}
}
tags = {
environment = "prod"
workload = "payments-platform"
owner = "platform-engineering"
}
}
# Downstream: publish the management endpoint to Key Vault so the
# deployment pipeline (sfctl / Azure DevOps) can discover the cluster.
resource "azurerm_key_vault_secret" "sf_management_endpoint" {
name = "sf-management-endpoint"
value = module.service_fabric_managed_cluster.management_endpoint
key_vault_id = azurerm_key_vault.platform.id
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/service_fabric/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-service-fabric?ref=v1.0.0"
}
inputs = {
cluster_name = "..."
resource_group_name = "..."
location = "..."
node_types = {}
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/service_fabric && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
cluster_name |
string |
— | Yes | Cluster name + global DNS prefix (3–23 chars, alphanumeric/hyphen). |
resource_group_name |
string |
— | Yes | Resource group for the cluster. |
location |
string |
— | Yes | Azure region. |
sku |
string |
"Standard" |
No | Basic (dev/test) or Standard (production, zone resilient). |
client_connection_port |
number |
19000 |
No | FabricClient TCP connection port. |
http_gateway_port |
number |
19080 |
No | HTTP gateway / Service Fabric Explorer port. |
upgrade_wave |
string |
"Wave1" |
No | Runtime upgrade ring: Wave0, Wave1, or Wave2. |
cluster_code_version |
string |
null |
No | Pinned SF runtime version; null lets the wave manage it. |
client_certificate_thumbprints |
list(object) |
[] |
No | Client certs allowed to manage the cluster (thumbprint, common_name). |
entra_clients |
object |
null |
No | Entra ID auth (client_application_id, cluster_application_id, tenant_id). |
lb_rules |
list(object) |
[] |
No | Cluster load-balancing rules exposing app ports. |
node_types |
map(object) |
— | Yes | Node types; exactly one must be primary = true, each ≥ 3 instances. |
tags |
map(string) |
{} |
No | Tags applied to the cluster. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the Service Fabric managed cluster. |
name |
Name of the managed cluster. |
dns_name |
Global DNS name of the cluster. |
client_connection_endpoint |
TCP endpoint (dns_name:client_connection_port) for FabricClient. |
management_endpoint |
HTTPS management / Service Fabric Explorer endpoint. |
node_type_ids |
Map of node-type name to resource ID. |
Enterprise scenario
A payments platform team runs a latency-sensitive transaction-routing service as stateful Reliable Services and needs strict isolation between the system services and the heavy workload pool. They consume this module once per environment from their landing-zone repo: Standard SKU with a 5-node system primary node type on Standard_D4s_v5, plus a 6-node stateless workload pool on Standard_D8s_v5 that they can scale independently during end-of-month settlement spikes. The management_endpoint output is written straight into the platform Key Vault, so the Azure DevOps release pipeline resolves the cluster with sfctl without anyone hand-copying an FQDN — and pinning upgrade_wave = "Wave2" keeps runtime upgrades on the most conservative ring while dev clusters ride Wave0 to surface breakage early.
Best practices
- Use the Standard SKU with ≥ 5 primary nodes for anything production. Service Fabric places its own system services (Naming, Failover Manager, Cluster Manager) on the primary node type; a 3-node Basic primary cannot survive a node loss during an upgrade, so reserve Basic strictly for ephemeral dev/test clusters.
- Separate system and workload node types. Keep applications off the primary node type by adding placement constraints and a dedicated secondary node type — this stops a noisy workload from starving the system services and lets you scale or use Spot/stateless capacity for workloads without touching the cluster’s stability.
- Authenticate with named client certificates or Entra ID, never anonymously. Map an
AdminClientcert for management and aReadOnlyClientcert for dashboards, store thumbprints in a variable file (not the cluster name), and rotate certs through theclient_certificate_thumbprintsinput rather than the portal. - Pin the upgrade wave per environment. Ride
Wave0in dev to catch runtime regressions early andWave2in production for the slowest, most-vetted rollout; only setcluster_code_versionexplicitly when you must freeze a specific runtime for a compliance window. - Right-size data disks and prefer
StandardSSD_LRSby default. Premium disks on every node type quietly inflate cost; reservePremium_LRSfor stateful node types with real IOPS needs and keep stateless pools on Standard SSD. - Name clusters DNS-first and tag for ownership. Because
cluster_namebecomes a global DNS label, adopt a<org><env>-sfmcconvention (e.g.kvprod-sfmc) to avoid collisions, and always setenvironment/owner/workloadtags so cost and incident routing work without guesswork.