Quick take — A reusable hashicorp/azurerm ~> 4.0 Terraform module for Azure Kubernetes Service: system/user node pools, cluster autoscaler, Workload Identity, Azure CNI Overlay, and Entra-integrated RBAC. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "azurerm" {
features {}
}
module "aks" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-aks?ref=v1.0.0"
cluster_name = "..." # Cluster name; also the DNS prefix if `dns_prefix` is un…
location = "..." # Azure region for the control plane.
resource_group_name = "..." # Existing resource group to deploy into.
environment = "..." # One of `dev`, `stg`, `prod`, `sandbox`; applied as a ta…
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Azure Kubernetes Service (AKS) is the managed Kubernetes offering on Azure: Microsoft runs and SLA-backs the control plane (API server, etcd, scheduler) for free, and you pay only for the worker nodes plus any uptime SLA you opt into. A raw azurerm_kubernetes_cluster resource, however, has well over a hundred arguments and a dozen nested blocks — identity, network profile, default node pool, auto-scaler profile, Entra integration, add-ons, maintenance windows — and getting them subtly wrong (kubenet instead of CNI Overlay, local accounts left enabled, a non-zonal system pool) is how clusters end up insecure or impossible to scale later.
This module wraps azurerm_kubernetes_cluster plus a separate azurerm_kubernetes_cluster_node_pool for user workloads into a single opinionated, var-driven unit. It defaults to the patterns you actually want in production — a SystemAssigned (or user-assigned) managed identity instead of a service principal, Azure CNI Overlay networking, the cluster autoscaler on the default node pool, Workload Identity + OIDC issuer enabled, Entra ID RBAC with local accounts disabled, and an automatic channel upgrade with a maintenance window — while still exposing the knobs (Kubernetes version, VM SKU, node counts, network plugin, SKU tier) teams legitimately need to differentiate dev from prod.
When to use it
- You are standing up multiple AKS clusters (per environment, per region, per team) and want them provisioned identically rather than hand-edited.
- You need clusters that are secure by default: managed identity, disabled local Kubernetes accounts, Entra group-based admin, and Workload Identity so pods get tokens without secrets.
- You want elastic, zonal node pools — a small autoscaling system pool for add-ons and a separate user pool for application workloads — without rewriting the same 80 lines of HCL each time.
- You are running GitOps or platform engineering and want the cluster’s OIDC issuer URL, kubelet identity, and node resource group exposed as outputs for downstream Workload Identity federation, ACR role assignments, and DNS wiring.
If you only need a throwaway single-node sandbox, the raw resource is fine. Reach for this module the moment a cluster needs to survive an audit or scale under load.
Module structure
terraform-module-azure-aks/
├── versions.tf # provider + Terraform version pinning
├── main.tf # azurerm_kubernetes_cluster + user node pool
├── variables.tf # var-driven inputs with validation
└── outputs.tf # id/name, OIDC issuer, kubelet identity, kube_config
versions.tf
terraform {
required_version = ">= 1.6.0"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.0"
}
}
}
main.tf
locals {
# AKS DNS prefix must be 1-54 chars, alphanumeric or hyphen, start/end alphanumeric.
dns_prefix = coalesce(var.dns_prefix, replace(var.cluster_name, "_", "-"))
tags = merge(
{
managed_by = "terraform"
module = "terraform-module-azure-aks"
environment = var.environment
},
var.tags,
)
}
resource "azurerm_kubernetes_cluster" "this" {
name = var.cluster_name
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = local.dns_prefix
kubernetes_version = var.kubernetes_version
sku_tier = var.sku_tier
automatic_upgrade_channel = var.automatic_upgrade_channel
node_os_upgrade_channel = var.node_os_upgrade_channel
# Node resource group (the MC_* RG that holds VMSS, NICs, disks).
node_resource_group = var.node_resource_group
# Hardening: no local "clusterAdmin" kubeconfig, Entra ID is the only auth path.
local_account_disabled = true
oidc_issuer_enabled = true
workload_identity_enabled = true
default_node_pool {
name = var.system_node_pool.name
vm_size = var.system_node_pool.vm_size
orchestrator_version = coalesce(var.system_node_pool.orchestrator_version, var.kubernetes_version)
os_sku = var.system_node_pool.os_sku
zones = var.system_node_pool.zones
vnet_subnet_id = var.system_node_pool.vnet_subnet_id
# Cluster autoscaler bounds for the system pool.
auto_scaling_enabled = true
min_count = var.system_node_pool.min_count
max_count = var.system_node_pool.max_count
# CriticalAddonsOnly taint keeps app pods off the system pool.
only_critical_addons_enabled = var.system_node_pool.only_critical_addons_enabled
max_pods = var.system_node_pool.max_pods
node_labels = var.system_node_pool.node_labels
upgrade_settings {
max_surge = var.system_node_pool.max_surge
}
tags = local.tags
}
identity {
type = var.identity_type
identity_ids = var.identity_type == "UserAssigned" ? var.identity_ids : null
}
network_profile {
network_plugin = var.network_profile.network_plugin
network_plugin_mode = var.network_profile.network_plugin_mode
network_policy = var.network_profile.network_policy
network_data_plane = var.network_profile.network_data_plane
pod_cidr = var.network_profile.pod_cidr
service_cidr = var.network_profile.service_cidr
dns_service_ip = var.network_profile.dns_service_ip
load_balancer_sku = "standard"
outbound_type = var.network_profile.outbound_type
}
# Entra ID (AAD) managed RBAC: cluster-admin granted only to these object IDs.
azure_active_directory_role_based_access_control {
azure_rbac_enabled = var.azure_rbac_enabled
admin_group_object_ids = var.admin_group_object_ids
tenant_id = var.tenant_id
}
# Pull container images from ACR using the kubelet identity (no imagePullSecrets).
dynamic "key_vault_secrets_provider" {
for_each = var.enable_secrets_store_csi ? [1] : []
content {
secret_rotation_enabled = true
secret_rotation_interval = "2m"
}
}
dynamic "microsoft_defender" {
for_each = var.log_analytics_workspace_id != null && var.enable_defender ? [1] : []
content {
log_analytics_workspace_id = var.log_analytics_workspace_id
}
}
dynamic "oms_agent" {
for_each = var.log_analytics_workspace_id != null ? [1] : []
content {
log_analytics_workspace_id = var.log_analytics_workspace_id
msi_auth_for_monitoring_enabled = true
}
}
auto_scaler_profile {
balance_similar_node_groups = true
expander = "least-waste"
scale_down_unneeded = "10m"
scale_down_delay_after_add = "10m"
}
maintenance_window_auto_upgrade {
frequency = "Weekly"
interval = 1
day_of_week = var.maintenance_day_of_week
start_time = var.maintenance_start_time
utc_offset = "+00:00"
duration = 4
}
tags = local.tags
lifecycle {
ignore_changes = [
# Autoscaler manages live node count; don't let plan revert it.
default_node_pool[0].node_count,
kubernetes_version,
]
}
}
# Dedicated user node pool(s) for application workloads.
resource "azurerm_kubernetes_cluster_node_pool" "user" {
for_each = var.user_node_pools
name = each.value.name
kubernetes_cluster_id = azurerm_kubernetes_cluster.this.id
vm_size = each.value.vm_size
orchestrator_version = coalesce(each.value.orchestrator_version, var.kubernetes_version)
os_sku = each.value.os_sku
mode = "User"
zones = each.value.zones
vnet_subnet_id = coalesce(each.value.vnet_subnet_id, var.system_node_pool.vnet_subnet_id)
auto_scaling_enabled = true
min_count = each.value.min_count
max_count = each.value.max_count
max_pods = each.value.max_pods
node_labels = each.value.node_labels
node_taints = each.value.node_taints
priority = each.value.priority
spot_max_price = each.value.priority == "Spot" ? each.value.spot_max_price : null
eviction_policy = each.value.priority == "Spot" ? "Delete" : null
upgrade_settings {
max_surge = each.value.max_surge
}
tags = local.tags
}
variables.tf
variable "cluster_name" {
description = "Name of the AKS cluster (also used for the DNS prefix if dns_prefix is unset)."
type = string
validation {
condition = can(regex("^[a-zA-Z0-9][a-zA-Z0-9_-]{0,61}[a-zA-Z0-9]$", var.cluster_name))
error_message = "cluster_name must be 1-63 chars, alphanumeric/underscore/hyphen, starting and ending alphanumeric."
}
}
variable "location" {
description = "Azure region for the cluster control plane."
type = string
}
variable "resource_group_name" {
description = "Name of the existing resource group to deploy the cluster into."
type = string
}
variable "environment" {
description = "Environment short name (e.g. dev, stg, prod). Applied as a tag."
type = string
validation {
condition = contains(["dev", "stg", "prod", "sandbox"], var.environment)
error_message = "environment must be one of: dev, stg, prod, sandbox."
}
}
variable "dns_prefix" {
description = "DNS prefix for the cluster API server FQDN. Defaults to cluster_name with underscores replaced."
type = string
default = null
}
variable "kubernetes_version" {
description = "Kubernetes minor version (e.g. \"1.31\"). Pin a minor and let the upgrade channel handle patches."
type = string
default = "1.31"
validation {
condition = can(regex("^1\\.(2[89]|3[0-9])(\\.[0-9]+)?$", var.kubernetes_version))
error_message = "kubernetes_version must be a supported 1.28+ version, e.g. \"1.31\" or \"1.31.3\"."
}
}
variable "sku_tier" {
description = "Control plane SKU. \"Standard\" gives the 99.95% uptime SLA; \"Free\" has no SLA; \"Premium\" adds long-term support."
type = string
default = "Standard"
validation {
condition = contains(["Free", "Standard", "Premium"], var.sku_tier)
error_message = "sku_tier must be Free, Standard, or Premium."
}
}
variable "automatic_upgrade_channel" {
description = "Cluster auto-upgrade channel: patch, rapid, stable, node-image, or null to disable."
type = string
default = "patch"
validation {
condition = var.automatic_upgrade_channel == null || contains(["patch", "rapid", "stable", "node-image"], var.automatic_upgrade_channel)
error_message = "automatic_upgrade_channel must be one of patch, rapid, stable, node-image, or null."
}
}
variable "node_os_upgrade_channel" {
description = "Node OS image auto-upgrade channel: NodeImage, SecurityPatch, Unmanaged, or None."
type = string
default = "NodeImage"
validation {
condition = contains(["NodeImage", "SecurityPatch", "Unmanaged", "None"], var.node_os_upgrade_channel)
error_message = "node_os_upgrade_channel must be NodeImage, SecurityPatch, Unmanaged, or None."
}
}
variable "node_resource_group" {
description = "Name of the auto-created MC_* node resource group. Null lets AKS choose the default name."
type = string
default = null
}
variable "identity_type" {
description = "Cluster control-plane identity type: SystemAssigned or UserAssigned."
type = string
default = "SystemAssigned"
validation {
condition = contains(["SystemAssigned", "UserAssigned"], var.identity_type)
error_message = "identity_type must be SystemAssigned or UserAssigned."
}
}
variable "identity_ids" {
description = "User-assigned managed identity resource IDs. Required when identity_type is UserAssigned."
type = list(string)
default = []
}
variable "azure_rbac_enabled" {
description = "Enable Azure RBAC for Kubernetes authorization (manage k8s RBAC via Azure role assignments)."
type = bool
default = true
}
variable "admin_group_object_ids" {
description = "Entra ID group object IDs granted cluster-admin via managed AAD integration."
type = list(string)
default = []
}
variable "tenant_id" {
description = "Entra ID tenant ID for AAD RBAC. Null defaults to the current subscription tenant."
type = string
default = null
}
variable "system_node_pool" {
description = "Configuration for the system (default) node pool that hosts critical add-ons."
type = object({
name = optional(string, "system")
vm_size = optional(string, "Standard_D4ds_v5")
orchestrator_version = optional(string)
os_sku = optional(string, "AzureLinux")
zones = optional(list(string), ["1", "2", "3"])
vnet_subnet_id = optional(string)
min_count = optional(number, 2)
max_count = optional(number, 5)
max_pods = optional(number, 110)
max_surge = optional(string, "33%")
only_critical_addons_enabled = optional(bool, true)
node_labels = optional(map(string), {})
})
default = {}
validation {
condition = var.system_node_pool.min_count <= var.system_node_pool.max_count
error_message = "system_node_pool.min_count must be <= max_count."
}
validation {
condition = length(var.system_node_pool.zones) > 0
error_message = "system_node_pool.zones must list at least one availability zone for resilience."
}
}
variable "user_node_pools" {
description = "Map of additional user node pools keyed by a logical name. Supports Spot via priority."
type = map(object({
name = string
vm_size = optional(string, "Standard_D8ds_v5")
orchestrator_version = optional(string)
os_sku = optional(string, "AzureLinux")
zones = optional(list(string), ["1", "2", "3"])
vnet_subnet_id = optional(string)
min_count = optional(number, 1)
max_count = optional(number, 10)
max_pods = optional(number, 110)
max_surge = optional(string, "33%")
node_labels = optional(map(string), {})
node_taints = optional(list(string), [])
priority = optional(string, "Regular")
spot_max_price = optional(number, -1)
}))
default = {}
validation {
condition = alltrue([
for p in values(var.user_node_pools) : contains(["Regular", "Spot"], p.priority)
])
error_message = "Each user_node_pools[*].priority must be Regular or Spot."
}
validation {
condition = alltrue([
for p in values(var.user_node_pools) : p.min_count <= p.max_count
])
error_message = "Each user node pool's min_count must be <= max_count."
}
}
variable "network_profile" {
description = "Cluster networking. Defaults to Azure CNI Overlay with Cilium dataplane and Calico-free policy."
type = object({
network_plugin = optional(string, "azure")
network_plugin_mode = optional(string, "overlay")
network_policy = optional(string, "cilium")
network_data_plane = optional(string, "cilium")
pod_cidr = optional(string, "10.244.0.0/16")
service_cidr = optional(string, "10.0.0.0/16")
dns_service_ip = optional(string, "10.0.0.10")
outbound_type = optional(string, "loadBalancer")
})
default = {}
validation {
condition = contains(["azure", "kubenet", "none"], var.network_profile.network_plugin)
error_message = "network_profile.network_plugin must be azure, kubenet, or none."
}
}
variable "enable_secrets_store_csi" {
description = "Enable the Azure Key Vault Secrets Store CSI driver add-on with 2m rotation."
type = bool
default = true
}
variable "enable_defender" {
description = "Enable Microsoft Defender for Containers (requires log_analytics_workspace_id)."
type = bool
default = true
}
variable "log_analytics_workspace_id" {
description = "Log Analytics workspace resource ID for Container Insights and Defender. Null disables both."
type = string
default = null
}
variable "maintenance_day_of_week" {
description = "Day of week for the weekly auto-upgrade maintenance window."
type = string
default = "Sunday"
}
variable "maintenance_start_time" {
description = "Start time (HH:MM, UTC) for the maintenance window."
type = string
default = "03:00"
}
variable "tags" {
description = "Additional tags merged onto the cluster and node pools."
type = map(string)
default = {}
}
outputs.tf
output "id" {
description = "Resource ID of the AKS cluster."
value = azurerm_kubernetes_cluster.this.id
}
output "name" {
description = "Name of the AKS cluster."
value = azurerm_kubernetes_cluster.this.name
}
output "fqdn" {
description = "Public FQDN of the cluster API server."
value = azurerm_kubernetes_cluster.this.fqdn
}
output "node_resource_group" {
description = "Auto-created MC_* resource group holding the cluster's node VMSS, NICs, and disks."
value = azurerm_kubernetes_cluster.this.node_resource_group
}
output "oidc_issuer_url" {
description = "OIDC issuer URL — use this to federate Workload Identity credentials to pods."
value = azurerm_kubernetes_cluster.this.oidc_issuer_url
}
output "kubelet_identity_object_id" {
description = "Object ID of the kubelet managed identity — grant it AcrPull on your container registries."
value = azurerm_kubernetes_cluster.this.kubelet_identity[0].object_id
}
output "cluster_identity_principal_id" {
description = "Principal ID of the cluster control-plane identity (for Network Contributor on custom subnets)."
value = try(azurerm_kubernetes_cluster.this.identity[0].principal_id, null)
}
output "kube_config_raw" {
description = "Raw kubeconfig for the cluster (Entra-backed). Sensitive — avoid persisting in state outputs."
value = azurerm_kubernetes_cluster.this.kube_config_raw
sensitive = true
}
output "user_node_pool_ids" {
description = "Map of user node pool logical names to their resource IDs."
value = { for k, v in azurerm_kubernetes_cluster_node_pool.user : k => v.id }
}
How to use it
data "azurerm_client_config" "current" {}
resource "azurerm_resource_group" "platform" {
name = "rg-platform-prod-weu"
location = "westeurope"
}
module "aks_cluster" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-aks?ref=v1.0.0"
cluster_name = "aks-platform-prod-weu"
location = azurerm_resource_group.platform.location
resource_group_name = azurerm_resource_group.platform.name
environment = "prod"
kubernetes_version = "1.31"
sku_tier = "Standard"
# Entra group that gets cluster-admin; local accounts are disabled by the module.
admin_group_object_ids = ["11111111-2222-3333-4444-555555555555"]
tenant_id = data.azurerm_client_config.current.tenant_id
system_node_pool = {
vm_size = "Standard_D4ds_v5"
min_count = 2
max_count = 4
}
user_node_pools = {
apps = {
name = "apps"
vm_size = "Standard_D8ds_v5"
min_count = 3
max_count = 12
}
batch = {
name = "batch"
vm_size = "Standard_D16ds_v5"
min_count = 0
max_count = 20
priority = "Spot"
spot_max_price = -1 # pay up to the on-demand price
node_taints = ["workload=batch:NoSchedule"]
}
}
log_analytics_workspace_id = azurerm_log_analytics_workspace.platform.id
tags = {
cost_center = "platform-eng"
owner = "vinod"
}
}
# Downstream: grant the kubelet identity pull rights on the team's ACR
# so pods can pull images without imagePullSecrets.
resource "azurerm_role_assignment" "acr_pull" {
scope = azurerm_container_registry.platform.id
role_definition_name = "AcrPull"
principal_id = module.aks_cluster.kubelet_identity_object_id
}
# Downstream: federate a Workload Identity credential using the OIDC issuer.
resource "azurerm_federated_identity_credential" "argocd" {
name = "fic-argocd"
resource_group_name = azurerm_resource_group.platform.name
parent_id = azurerm_user_assigned_identity.argocd.id
audience = ["api://AzureADTokenExchange"]
issuer = module.aks_cluster.oidc_issuer_url
subject = "system:serviceaccount:argocd:argocd-application-controller"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "azurerm"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...azurerm state bucket/container + key per path...
}
}
2. Module config — live/prod/aks/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-aks?ref=v1.0.0"
}
inputs = {
cluster_name = "..."
location = "..."
resource_group_name = "..."
environment = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/aks && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
cluster_name |
string |
— | Yes | Cluster name; also the DNS prefix if dns_prefix is unset. Validated to AKS naming rules. |
location |
string |
— | Yes | Azure region for the control plane. |
resource_group_name |
string |
— | Yes | Existing resource group to deploy into. |
environment |
string |
— | Yes | One of dev, stg, prod, sandbox; applied as a tag. |
dns_prefix |
string |
null |
No | API server FQDN prefix; defaults to cluster_name with underscores replaced. |
kubernetes_version |
string |
"1.31" |
No | Pinned minor version; patches handled by the upgrade channel. Validated to 1.28+. |
sku_tier |
string |
"Standard" |
No | Control plane SKU: Free, Standard (SLA), or Premium (LTS). |
automatic_upgrade_channel |
string |
"patch" |
No | Cluster upgrade channel: patch, rapid, stable, node-image, or null. |
node_os_upgrade_channel |
string |
"NodeImage" |
No | Node OS upgrade channel: NodeImage, SecurityPatch, Unmanaged, None. |
node_resource_group |
string |
null |
No | Name of the auto-created MC_* node resource group. |
identity_type |
string |
"SystemAssigned" |
No | Control-plane identity: SystemAssigned or UserAssigned. |
identity_ids |
list(string) |
[] |
No | User-assigned identity IDs; required when identity_type = "UserAssigned". |
azure_rbac_enabled |
bool |
true |
No | Enable Azure RBAC for Kubernetes authorization. |
admin_group_object_ids |
list(string) |
[] |
No | Entra group object IDs granted cluster-admin. |
tenant_id |
string |
null |
No | Entra tenant ID for AAD RBAC; defaults to current tenant. |
system_node_pool |
object(...) |
{} |
No | System/default pool: SKU, zones, autoscaler bounds, CriticalAddonsOnly taint. |
user_node_pools |
map(object(...)) |
{} |
No | Additional user pools keyed by name; supports Spot, taints, and labels. |
network_profile |
object(...) |
{} |
No | Networking; defaults to Azure CNI Overlay with the Cilium dataplane. |
enable_secrets_store_csi |
bool |
true |
No | Enable the Key Vault Secrets Store CSI driver with 2m rotation. |
enable_defender |
bool |
true |
No | Enable Microsoft Defender for Containers (needs a workspace). |
log_analytics_workspace_id |
string |
null |
No | Workspace ID for Container Insights and Defender; null disables both. |
maintenance_day_of_week |
string |
"Sunday" |
No | Day of the weekly auto-upgrade maintenance window. |
maintenance_start_time |
string |
"03:00" |
No | Start time (HH:MM UTC) of the maintenance window. |
tags |
map(string) |
{} |
No | Extra tags merged onto the cluster and node pools. |
Outputs
| Name | Description |
|---|---|
id |
Resource ID of the AKS cluster. |
name |
Name of the AKS cluster. |
fqdn |
Public FQDN of the cluster API server. |
node_resource_group |
Auto-created MC_* resource group holding node VMSS, NICs, and disks. |
oidc_issuer_url |
OIDC issuer URL for federating Workload Identity credentials to pods. |
kubelet_identity_object_id |
Object ID of the kubelet identity; grant it AcrPull on your registries. |
cluster_identity_principal_id |
Principal ID of the control-plane identity (e.g. for Network Contributor on custom subnets). |
kube_config_raw |
Raw Entra-backed kubeconfig (sensitive). |
user_node_pool_ids |
Map of user node pool logical names to resource IDs. |
Enterprise scenario
A fintech platform team runs a regulated payments workload across westeurope and northeurope. They instantiate this module twice — once per region — with sku_tier = "Standard" for the SLA, a zonal system pool tainted CriticalAddonsOnly, an autoscaling apps pool for the always-on services, and a Spot batch pool that scales from zero for nightly reconciliation jobs. Because local_account_disabled is forced on and admin_group_object_ids points at their PIM-eligible “AKS Operators” Entra group, every cluster login is auditable and just-in-time, and the exported oidc_issuer_url feeds Workload Identity federation so no Kubernetes secret ever holds a long-lived credential.
Best practices
- Disable local accounts and use Entra group RBAC. This module sets
local_account_disabled = trueandazure_rbac_enabled = trueso the only path in is an Entra identity — pair it with a PIM-eligible group inadmin_group_object_idsfor just-in-time admin and a clean audit trail. - Use Workload Identity over secrets. The OIDC issuer and Workload Identity are enabled by default; federate pod service accounts to user-assigned identities (via the
oidc_issuer_urloutput) instead of mounting client secrets or storage keys. - Pull from ACR with the kubelet identity. Grant the
kubelet_identity_object_idtheAcrPullrole rather than creatingimagePullSecrets— it rotates automatically and never appears in a manifest. - Right-size with autoscaling and Spot. Keep the system pool small (2–4 nodes) and tainted
CriticalAddonsOnly, run apps on a separate autoscaling user pool, and put interruptible batch work on aSpotpool that scales from zero to cut compute cost dramatically. - Spread across availability zones. The defaults pin node pools to zones
["1","2","3"]; never run a production cluster single-zone, and choosesku_tier = "Standard"(orPremium) so the control-plane SLA actually applies. - Name and tag predictably. Use the
aks-<workload>-<env>-<region>convention forcluster_name, set a meaningfulnode_resource_group, and rely on the mergedtags(which includeenvironmentandmanaged_by) so cost reports and policy assignments resolve cleanly.