IaC Azure

Terraform Module: Azure Cosmos DB — globally-distributed NoSQL with sane defaults

Quick take — Provision a production-ready Azure Cosmos DB account with Terraform and azurerm 4.x: multi-region writes, consistency tuning, autoscale serverless, private endpoints, and locked-down keys. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "azurerm" {
  features {}
}

module "cosmos_db" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-cosmos-db?ref=v1.0.0"

  account_name        = "..."  # Globally-unique account name (3-44 chars, lowercase, va…
  resource_group_name = "..."  # Resource group holding the account.
  primary_region      = "..."  # Write region (failover priority 0).
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Azure Cosmos DB is Microsoft’s globally-distributed, multi-model database. It gives you single-digit-millisecond reads, turnkey replication across any number of Azure regions, five well-defined consistency levels, and SLA-backed throughput. The catch is that azurerm_cosmosdb_account is one of the most option-heavy resources in the entire provider: consistency_policy, geo_location, capabilities, backup, analytical_storage, multiple-write semantics, and network ACLs are all nested blocks that interact with each other in subtle ways. Get one wrong — say, enabling multi-region writes with Strong consistency — and terraform apply fails or, worse, silently overprovisions Request Units that bill 24/7.

This module wraps azurerm_cosmosdb_account so the dangerous knobs become validated, defaulted variables. It defaults to the Session consistency level (the right choice for ~90% of workloads), wires up the geo-replication list as a simple variable, optionally enables a SQL database + container with autoscale throughput, and — critically for security — disables local key-based auth, public network access, and forces a private endpoint when you ask for it. Teams consume one module block instead of hand-assembling forty lines of nested HCL per environment.

When to use it

Module structure

terraform-module-azure-cosmos-db/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # cosmosdb_account + optional sql database/container + private endpoint
├── variables.tf     # validated, var-driven inputs
└── outputs.tf       # id, endpoint, connection strings (sensitive)

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.0"
    }
  }
}

main.tf

locals {
  # Build the geo_location set from the primary region plus any replicas.
  # failover_priority 0 must be the write region; replicas start at 1.
  geo_locations = concat(
    [{
      location          = var.primary_region
      failover_priority = 0
      zone_redundant    = var.zone_redundant
    }],
    [for idx, region in var.replica_regions : {
      location          = region
      failover_priority = idx + 1
      zone_redundant    = var.zone_redundant
    }]
  )

  # Cosmos requires automatic_failover when multi-region writes are enabled.
  effective_automatic_failover = var.multi_region_writes ? true : var.automatic_failover
}

resource "azurerm_cosmosdb_account" "this" {
  name                = var.account_name
  resource_group_name = var.resource_group_name
  location            = var.primary_region
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  # --- Throughput model ---
  # Serverless and provisioned/autoscale are mutually exclusive at the account level.
  dynamic "capabilities" {
    for_each = var.enable_serverless ? [1] : []
    content {
      name = "EnableServerless"
    }
  }

  # --- Auth & network hardening ---
  local_authentication_disabled = var.local_authentication_disabled
  public_network_access_enabled = var.public_network_access_enabled
  minimal_tls_version           = "Tls12"
  ip_range_filter               = var.ip_range_filter

  # --- Multi-region writes & failover ---
  multiple_write_locations_enabled = var.multi_region_writes
  automatic_failover_enabled       = local.effective_automatic_failover

  consistency_policy {
    consistency_level       = var.consistency_level
    max_interval_in_seconds = var.consistency_level == "BoundedStaleness" ? var.bounded_staleness_max_interval : null
    max_staleness_prefix    = var.consistency_level == "BoundedStaleness" ? var.bounded_staleness_max_prefix : null
  }

  dynamic "geo_location" {
    for_each = local.geo_locations
    content {
      location          = geo_location.value.location
      failover_priority = geo_location.value.failover_priority
      zone_redundant    = geo_location.value.zone_redundant
    }
  }

  backup {
    type                = var.backup_type
    interval_in_minutes = var.backup_type == "Periodic" ? var.backup_interval_in_minutes : null
    retention_in_hours  = var.backup_type == "Periodic" ? var.backup_retention_in_hours : null
    storage_redundancy  = var.backup_type == "Periodic" ? var.backup_storage_redundancy : null
  }

  tags = var.tags
}

# --- Optional SQL (Core) API database ---
resource "azurerm_cosmosdb_sql_database" "this" {
  count               = var.create_sql_database ? 1 : 0
  name                = var.sql_database_name
  resource_group_name = azurerm_cosmosdb_account.this.resource_group_name
  account_name        = azurerm_cosmosdb_account.this.name

  # Throughput is set on the DB only in provisioned mode; serverless rejects it.
  dynamic "autoscale_settings" {
    for_each = var.enable_serverless ? [] : [1]
    content {
      max_throughput = var.sql_database_max_throughput
    }
  }
}

# --- Optional SQL container ---
resource "azurerm_cosmosdb_sql_container" "this" {
  count                 = var.create_sql_database && var.create_sql_container ? 1 : 0
  name                  = var.sql_container_name
  resource_group_name   = azurerm_cosmosdb_account.this.resource_group_name
  account_name          = azurerm_cosmosdb_account.this.name
  database_name         = azurerm_cosmosdb_sql_database.this[0].name
  partition_key_paths   = [var.sql_container_partition_key_path]
  partition_key_version = 2

  dynamic "autoscale_settings" {
    for_each = var.enable_serverless ? [] : [1]
    content {
      max_throughput = var.sql_container_max_throughput
    }
  }

  indexing_policy {
    indexing_mode = "consistent"

    included_path {
      path = "/*"
    }
  }
}

# --- Optional private endpoint ---
resource "azurerm_private_endpoint" "this" {
  count               = var.private_endpoint_subnet_id == null ? 0 : 1
  name                = "pe-${var.account_name}"
  location            = var.primary_region
  resource_group_name = var.resource_group_name
  subnet_id           = var.private_endpoint_subnet_id

  private_service_connection {
    name                           = "psc-${var.account_name}"
    private_connection_resource_id = azurerm_cosmosdb_account.this.id
    is_manual_connection           = false
    subresource_names              = ["Sql"]
  }

  tags = var.tags
}

variables.tf

variable "account_name" {
  type        = string
  description = "Globally-unique Cosmos DB account name (3-44 chars, lowercase letters, numbers, hyphens)."

  validation {
    condition     = can(regex("^[a-z0-9][a-z0-9-]{1,42}[a-z0-9]$", var.account_name))
    error_message = "account_name must be 3-44 chars, lowercase alphanumerics/hyphens, and not start or end with a hyphen."
  }
}

variable "resource_group_name" {
  type        = string
  description = "Name of the resource group that will hold the Cosmos DB account."
}

variable "primary_region" {
  type        = string
  description = "Azure region for the write region (failover priority 0), e.g. 'centralindia'."
}

variable "replica_regions" {
  type        = list(string)
  description = "Ordered list of read replica regions; priority increments from 1 in list order."
  default     = []

  validation {
    condition     = length(var.replica_regions) == length(distinct(var.replica_regions)) && !contains(var.replica_regions, var.primary_region)
    error_message = "replica_regions must be unique and must not include primary_region."
  }
}

variable "consistency_level" {
  type        = string
  description = "Default consistency level for the account."
  default     = "Session"

  validation {
    condition     = contains(["Strong", "BoundedStaleness", "Session", "ConsistentPrefix", "Eventual"], var.consistency_level)
    error_message = "consistency_level must be one of: Strong, BoundedStaleness, Session, ConsistentPrefix, Eventual."
  }
}

variable "bounded_staleness_max_interval" {
  type        = number
  description = "Max lag in seconds for BoundedStaleness (5-86400). Ignored for other levels."
  default     = 300

  validation {
    condition     = var.bounded_staleness_max_interval >= 5 && var.bounded_staleness_max_interval <= 86400
    error_message = "bounded_staleness_max_interval must be between 5 and 86400 seconds."
  }
}

variable "bounded_staleness_max_prefix" {
  type        = number
  description = "Max number of stale requests for BoundedStaleness (>= 10). Ignored for other levels."
  default     = 100000

  validation {
    condition     = var.bounded_staleness_max_prefix >= 10
    error_message = "bounded_staleness_max_prefix must be at least 10."
  }
}

variable "multi_region_writes" {
  type        = bool
  description = "Enable multi-region (multi-master) writes. Forces automatic_failover on and is incompatible with Strong consistency."
  default     = false

  validation {
    condition     = !(var.multi_region_writes && var.consistency_level == "Strong")
    error_message = "multi_region_writes cannot be used with Strong consistency; choose BoundedStaleness or weaker."
  }
}

variable "automatic_failover" {
  type        = bool
  description = "Enable service-managed failover to a read region when the write region is unavailable."
  default     = true
}

variable "zone_redundant" {
  type        = bool
  description = "Spread each region's replicas across availability zones (region must support AZs)."
  default     = false
}

variable "enable_serverless" {
  type        = bool
  description = "Use the serverless capacity mode instead of provisioned/autoscale throughput. Single-region only."
  default     = false

  validation {
    condition     = !(var.enable_serverless && length(var.replica_regions) > 0)
    error_message = "Serverless accounts cannot have replica_regions; serverless is single-region only."
  }
}

variable "local_authentication_disabled" {
  type        = bool
  description = "Disable primary/secondary key auth and require Entra ID (AAD) RBAC."
  default     = true
}

variable "public_network_access_enabled" {
  type        = bool
  description = "Allow access from public networks. Set false when using a private endpoint."
  default     = false
}

variable "ip_range_filter" {
  type        = list(string)
  description = "Allowed source IPs/CIDRs when public access is on (e.g. Azure Portal/Functions ranges)."
  default     = []
}

variable "backup_type" {
  type        = string
  description = "Backup mode: 'Periodic' or 'Continuous'."
  default     = "Continuous"

  validation {
    condition     = contains(["Periodic", "Continuous"], var.backup_type)
    error_message = "backup_type must be 'Periodic' or 'Continuous'."
  }
}

variable "backup_interval_in_minutes" {
  type        = number
  description = "Periodic backup interval in minutes (60-1440). Ignored for Continuous."
  default     = 240
}

variable "backup_retention_in_hours" {
  type        = number
  description = "Periodic backup retention in hours (8-720). Ignored for Continuous."
  default     = 168
}

variable "backup_storage_redundancy" {
  type        = string
  description = "Periodic backup storage redundancy: Geo, Local, or Zone. Ignored for Continuous."
  default     = "Geo"

  validation {
    condition     = contains(["Geo", "Local", "Zone"], var.backup_storage_redundancy)
    error_message = "backup_storage_redundancy must be Geo, Local, or Zone."
  }
}

variable "create_sql_database" {
  type        = bool
  description = "Provision a SQL (Core) API database in the account."
  default     = false
}

variable "sql_database_name" {
  type        = string
  description = "Name of the SQL database to create when create_sql_database is true."
  default     = "appdb"
}

variable "sql_database_max_throughput" {
  type        = number
  description = "Autoscale max RU/s for the database (provisioned mode only). Must be a multiple of 1000, >= 1000."
  default     = 4000

  validation {
    condition     = var.sql_database_max_throughput >= 1000 && var.sql_database_max_throughput % 1000 == 0
    error_message = "sql_database_max_throughput must be >= 1000 and a multiple of 1000."
  }
}

variable "create_sql_container" {
  type        = bool
  description = "Provision a SQL container in the database (requires create_sql_database)."
  default     = false
}

variable "sql_container_name" {
  type        = string
  description = "Name of the SQL container to create."
  default     = "items"
}

variable "sql_container_partition_key_path" {
  type        = string
  description = "Partition key path for the container, e.g. '/tenantId'."
  default     = "/id"

  validation {
    condition     = startswith(var.sql_container_partition_key_path, "/")
    error_message = "sql_container_partition_key_path must start with '/'."
  }
}

variable "sql_container_max_throughput" {
  type        = number
  description = "Autoscale max RU/s for the container (provisioned mode only). Multiple of 1000, >= 1000."
  default     = 4000

  validation {
    condition     = var.sql_container_max_throughput >= 1000 && var.sql_container_max_throughput % 1000 == 0
    error_message = "sql_container_max_throughput must be >= 1000 and a multiple of 1000."
  }
}

variable "private_endpoint_subnet_id" {
  type        = string
  description = "Subnet resource ID for a private endpoint. Null disables the private endpoint."
  default     = null
}

variable "tags" {
  type        = map(string)
  description = "Tags applied to the account and private endpoint."
  default     = {}
}

outputs.tf

output "id" {
  description = "Resource ID of the Cosmos DB account."
  value       = azurerm_cosmosdb_account.this.id
}

output "name" {
  description = "Name of the Cosmos DB account."
  value       = azurerm_cosmosdb_account.this.name
}

output "endpoint" {
  description = "Document endpoint URI used by SDK clients."
  value       = azurerm_cosmosdb_account.this.endpoint
}

output "read_endpoints" {
  description = "Ordered list of read endpoints across all configured regions."
  value       = azurerm_cosmosdb_account.this.read_endpoints
}

output "write_endpoints" {
  description = "Ordered list of write endpoints (multiple when multi-region writes are enabled)."
  value       = azurerm_cosmosdb_account.this.write_endpoints
}

output "primary_sql_connection_string" {
  description = "Primary SQL connection string. Empty when local auth is disabled."
  value       = try(azurerm_cosmosdb_account.this.primary_sql_connection_string, null)
  sensitive   = true
}

output "sql_database_name" {
  description = "Name of the created SQL database, or null if none was created."
  value       = var.create_sql_database ? azurerm_cosmosdb_sql_database.this[0].name : null
}

output "private_endpoint_ip" {
  description = "Private IP allocated to the private endpoint, or null when none is configured."
  value       = var.private_endpoint_subnet_id == null ? null : azurerm_private_endpoint.this[0].private_service_connection[0].private_ip_address
}

How to use it

module "cosmos_db" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-cosmos-db?ref=v1.0.0"

  account_name        = "kv-orders-prod"
  resource_group_name = azurerm_resource_group.data.name
  primary_region      = "centralindia"
  replica_regions     = ["southindia"]

  # Geo-distributed reads, single write region, Session consistency.
  consistency_level   = "Session"
  multi_region_writes = false
  automatic_failover  = true
  zone_redundant      = true

  # Provisioned autoscale: a SQL database + tenant-partitioned container.
  create_sql_database              = true
  sql_database_name                = "orders"
  sql_database_max_throughput      = 10000
  create_sql_container             = true
  sql_container_name               = "order-events"
  sql_container_partition_key_path = "/tenantId"
  sql_container_max_throughput     = 10000

  # Security baseline: keyless + private only.
  local_authentication_disabled = true
  public_network_access_enabled = false
  private_endpoint_subnet_id    = azurerm_subnet.data.id

  backup_type = "Continuous"

  tags = {
    env   = "prod"
    owner = "platform-data"
  }
}

# Downstream: hand the document endpoint to an App Service so the app
# authenticates with its managed identity (no keys in app settings).
resource "azurerm_linux_web_app" "orders_api" {
  name                = "kv-orders-api-prod"
  resource_group_name = azurerm_resource_group.data.name
  location            = "centralindia"
  service_plan_id     = azurerm_service_plan.api.id

  identity {
    type = "SystemAssigned"
  }

  app_settings = {
    "COSMOS_ENDPOINT" = module.cosmos_db.endpoint
    "COSMOS_DATABASE" = module.cosmos_db.sql_database_name
  }

  site_config {}
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "azurerm"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...azurerm state bucket/container + key per path...
  }
}

2. Module configlive/prod/cosmos_db/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-azure-cosmos-db?ref=v1.0.0"
}

inputs = {
  account_name = "..."
  resource_group_name = "..."
  primary_region = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/cosmos_db && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
account_name string Yes Globally-unique account name (3-44 chars, lowercase, validated).
resource_group_name string Yes Resource group holding the account.
primary_region string Yes Write region (failover priority 0).
replica_regions list(string) [] No Ordered read replica regions; must be unique and exclude the primary.
consistency_level string "Session" No One of Strong, BoundedStaleness, Session, ConsistentPrefix, Eventual.
bounded_staleness_max_interval number 300 No Max lag seconds (5-86400) for BoundedStaleness only.
bounded_staleness_max_prefix number 100000 No Max stale requests (>= 10) for BoundedStaleness only.
multi_region_writes bool false No Multi-master writes; forces failover on, blocks Strong consistency.
automatic_failover bool true No Service-managed failover to a read region.
zone_redundant bool false No Spread replicas across availability zones per region.
enable_serverless bool false No Serverless capacity mode; single-region only.
local_authentication_disabled bool true No Disable key auth and require Entra ID RBAC.
public_network_access_enabled bool false No Allow public network access; set false with private endpoints.
ip_range_filter list(string) [] No Allowed source IPs/CIDRs when public access is on.
backup_type string "Continuous" No Periodic or Continuous backup mode.
backup_interval_in_minutes number 240 No Periodic backup interval (60-1440).
backup_retention_in_hours number 168 No Periodic backup retention (8-720).
backup_storage_redundancy string "Geo" No Periodic backup redundancy: Geo, Local, or Zone.
create_sql_database bool false No Provision a SQL (Core) API database.
sql_database_name string "appdb" No Database name when created.
sql_database_max_throughput number 4000 No Autoscale max RU/s for the DB (provisioned, multiple of 1000).
create_sql_container bool false No Provision a container (requires the database).
sql_container_name string "items" No Container name when created.
sql_container_partition_key_path string "/id" No Partition key path; must start with /.
sql_container_max_throughput number 4000 No Autoscale max RU/s for the container (provisioned, multiple of 1000).
private_endpoint_subnet_id string null No Subnet ID for a private endpoint; null disables it.
tags map(string) {} No Tags applied to the account and private endpoint.

Outputs

Name Description
id Resource ID of the Cosmos DB account.
name Name of the Cosmos DB account.
endpoint Document endpoint URI for SDK clients.
read_endpoints Ordered list of read endpoints across all regions.
write_endpoints Ordered list of write endpoints (multiple under multi-region writes).
primary_sql_connection_string Primary SQL connection string (sensitive; empty when local auth is disabled).
sql_database_name Name of the created SQL database, or null.
private_endpoint_ip Private IP of the private endpoint, or null when none is configured.

Enterprise scenario

A multi-tenant SaaS order-management platform runs its API in Central India and serves a growing customer base in South India. The platform team consumes this module to stand up a kv-orders-prod account with a southindia read replica, zone-redundant replicas, and Session consistency, exposing a /tenantId-partitioned container at 10,000 autoscale RU/s so per-tenant load spikes are absorbed without manual scaling. Because local_authentication_disabled and a private endpoint are on by default, the order API reaches Cosmos over the VNet using its system-assigned managed identity — no connection strings ever land in app settings or pipeline variables, satisfying the company’s “no long-lived secrets” control.

Best practices

TerraformAzureCosmos DBModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading