IaC AWS

Terraform Module: AWS X-Ray — Codify Trace Sampling, Groups, and KMS Encryption as One Unit

Quick take — A reusable hashicorp/aws ~> 5.0 Terraform module for AWS X-Ray: version-controlled sampling rules with priority/reservoir/rate, filter-expression trace groups, and account-wide KMS encryption config. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "xray" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-xray?ref=v1.0.0"

  rule_name         = "..."  # Unique name of the sampling rule.
  priority          = 0      # Evaluation order, 1–9999; lower runs first. Must be uni…
  reservoir_size    = 0      # Guaranteed traces/sec captured before `fixed_rate` appl…
  fixed_rate        = 0      # Sampling rate (0.0–1.0) for matching requests beyond th…
  group_name        = "..."  # Unique name of the X-Ray group.
  filter_expression = "..."  # X-Ray filter expression scoping the group (e.g. `http.s…
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS X-Ray is the distributed-tracing service that stitches together the path a single request takes across your microservices, Lambda functions, queues, and databases — turning a vague “checkout is slow” complaint into a flame graph that says “the 800ms is a synchronous DynamoDB call inside the payments service.” The traces themselves are emitted by the X-Ray SDK or the OpenTelemetry/ADOT collector running next to your code; what you actually configure on the AWS side is the control plane around them: how much of your traffic gets traced (sampling rules), how you slice the resulting traces for dashboards and insights (groups), and whether trace data is encrypted with your own key (encryption config).

That control plane is small but easy to get wrong by hand. Sampling rules are priority-ordered and evaluated highest-first; a single mis-prioritised rule with reservoir_size = 0 and fixed_rate = 0.0 can silently stop tracing an entire service, and the default 1-request-per-second reservoir is rarely what a high-traffic API actually wants. Groups need valid filter expressions (service("payments") AND http.status >= 500) or they create empty, useless views. And the X-Ray encryption setting is account- and region-wide singleton state — exactly the kind of global config that should live in version control with a clear owner, not be flipped in the console by whoever logged in last.

This module wraps aws_xray_sampling_rule together with aws_xray_group and the singleton aws_xray_encryption_config into one var-driven unit. You declare your sampling matrix and your trace groups as data, point at a KMS key, and get back consistent, reviewable tracing configuration with sane production defaults — instead of bespoke, drift-prone X-Ray HCL copied between every service repo.

When to use it

If you only need the default sampling behaviour (1 req/sec + 5% reservoir) and no grouping, you may not need any X-Ray resources at all — the service ships a built-in Default rule. Reach for this module the moment you want anything beyond that default.

Module structure

terraform-module-aws-xray/
├── versions.tf
├── main.tf
├── variables.tf
└── outputs.tf

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  tags = merge(var.tags, {
    ManagedBy = "terraform"
    Module    = "terraform-module-aws-xray"
  })
}

# ---------------------------------------------------------------------------
# Sampling rules
# Priority-ordered (lower number = evaluated first). Each rule reserves a
# guaranteed number of traces/sec (reservoir_size), then samples the remainder
# at fixed_rate. "*" wildcards match any value for that dimension.
# ---------------------------------------------------------------------------
resource "aws_xray_sampling_rule" "this" {
  for_each = { for r in var.sampling_rules : r.rule_name => r }

  rule_name = each.value.rule_name
  priority  = each.value.priority
  version   = 1

  # Sampling knobs.
  reservoir_size = each.value.reservoir_size
  fixed_rate     = each.value.fixed_rate

  # Matching dimensions — which requests this rule applies to.
  service_name = each.value.service_name
  service_type = each.value.service_type
  host         = each.value.host
  http_method  = each.value.http_method
  url_path     = each.value.url_path
  resource_arn = each.value.resource_arn

  attributes = each.value.attributes

  tags = local.tags
}

# ---------------------------------------------------------------------------
# Trace groups
# A group is a saved filter expression that drives the Service Map view,
# group-scoped CloudWatch metrics, and (optionally) X-Ray Insights.
# ---------------------------------------------------------------------------
resource "aws_xray_group" "this" {
  for_each = { for g in var.groups : g.group_name => g }

  group_name        = each.value.group_name
  filter_expression = each.value.filter_expression

  insights_configuration {
    insights_enabled      = each.value.insights_enabled
    notifications_enabled = each.value.notifications_enabled
  }

  tags = local.tags
}

# ---------------------------------------------------------------------------
# Encryption configuration (account + region singleton)
# Only one of these exists per account/region. Manage it here so the choice
# of SSE-S3 (NONE) vs a customer-managed KMS key is version-controlled.
# ---------------------------------------------------------------------------
resource "aws_xray_encryption_config" "this" {
  count = var.manage_encryption_config ? 1 : 0

  type   = var.encryption_kms_key_id != null ? "KMS" : "NONE"
  key_id = var.encryption_kms_key_id
}

variables.tf

variable "sampling_rules" {
  description = <<-EOT
    List of X-Ray sampling rules. Rules are evaluated by ascending priority
    (1 first). Each rule guarantees `reservoir_size` traces/sec, then samples
    additional matching requests at `fixed_rate` (0.0-1.0). Use "*" to match any
    value for a dimension. Define a low-priority catch-all so every request is
    covered by some rule.
  EOT
  type = list(object({
    rule_name      = string
    priority       = number
    reservoir_size = number
    fixed_rate     = number
    service_name   = optional(string, "*")
    service_type   = optional(string, "*")
    host           = optional(string, "*")
    http_method    = optional(string, "*")
    url_path       = optional(string, "*")
    resource_arn   = optional(string, "*")
    attributes     = optional(map(string), null)
  }))
  default = []

  validation {
    condition     = alltrue([for r in var.sampling_rules : r.fixed_rate >= 0 && r.fixed_rate <= 1])
    error_message = "fixed_rate must be between 0.0 and 1.0 for every sampling rule."
  }

  validation {
    condition     = alltrue([for r in var.sampling_rules : r.priority >= 1 && r.priority <= 9999])
    error_message = "priority must be between 1 and 9999 for every sampling rule."
  }

  validation {
    condition     = alltrue([for r in var.sampling_rules : r.reservoir_size >= 0 && floor(r.reservoir_size) == r.reservoir_size])
    error_message = "reservoir_size must be a non-negative integer for every sampling rule."
  }

  validation {
    condition     = length(distinct([for r in var.sampling_rules : r.rule_name])) == length(var.sampling_rules)
    error_message = "sampling rule_name values must be unique."
  }

  validation {
    condition     = length(distinct([for r in var.sampling_rules : r.priority])) == length(var.sampling_rules)
    error_message = "sampling rule priorities must be unique so evaluation order is deterministic."
  }
}

variable "groups" {
  description = <<-EOT
    List of X-Ray groups. Each is a saved filter expression that scopes the
    Service Map, group CloudWatch metrics, and optional Insights. Example
    filter: service("payments") AND http.status >= 500
  EOT
  type = list(object({
    group_name            = string
    filter_expression     = string
    insights_enabled      = optional(bool, false)
    notifications_enabled = optional(bool, false)
  }))
  default = []

  validation {
    condition     = alltrue([for g in var.groups : length(g.filter_expression) > 0])
    error_message = "filter_expression must be a non-empty X-Ray filter expression for every group."
  }

  validation {
    condition     = length(distinct([for g in var.groups : g.group_name])) == length(var.groups)
    error_message = "group_name values must be unique."
  }

  validation {
    # Notifications require Insights to be enabled on the group.
    condition     = alltrue([for g in var.groups : g.notifications_enabled == false || g.insights_enabled == true])
    error_message = "notifications_enabled can only be true when insights_enabled is also true."
  }
}

variable "manage_encryption_config" {
  description = <<-EOT
    Whether this module manages the account+region X-Ray encryption singleton.
    Set true in exactly ONE Terraform configuration per account/region to avoid
    two states fighting over the same global setting.
  EOT
  type    = bool
  default = false
}

variable "encryption_kms_key_id" {
  description = <<-EOT
    KMS key ARN/ID used to encrypt X-Ray trace data when manage_encryption_config
    is true. The key policy must grant the X-Ray service principal
    kms:GenerateDataKey* and kms:Decrypt. Null = AWS-owned key (type NONE).
  EOT
  type    = string
  default = null

  validation {
    condition     = var.encryption_kms_key_id == null || can(regex("^(arn:aws[a-z-]*:kms:|[0-9a-f-]{36}$|alias/)", var.encryption_kms_key_id))
    error_message = "encryption_kms_key_id must be a KMS key ARN, key UUID, alias, or null."
  }
}

variable "tags" {
  description = "Tags applied to sampling rules and groups created by the module."
  type        = map(string)
  default     = {}
}

outputs.tf

output "sampling_rule_arns" {
  description = "Map of rule_name => sampling rule ARN."
  value       = { for k, r in aws_xray_sampling_rule.this : k => r.arn }
}

output "sampling_rule_names" {
  description = "List of sampling rule names managed by this module."
  value       = [for r in aws_xray_sampling_rule.this : r.rule_name]
}

output "group_arns" {
  description = "Map of group_name => X-Ray group ARN."
  value       = { for k, g in aws_xray_group.this : k => g.arn }
}

output "group_names" {
  description = "List of X-Ray group names managed by this module."
  value       = [for g in aws_xray_group.this : g.group_name]
}

output "encryption_type" {
  description = "Active X-Ray encryption type (KMS or NONE), if managed by this module."
  value       = var.manage_encryption_config ? aws_xray_encryption_config.this[0].type : null
}

output "encryption_key_id" {
  description = "KMS key used for X-Ray trace encryption, if managed and set to KMS."
  value       = var.manage_encryption_config ? aws_xray_encryption_config.this[0].key_id : null
}

How to use it

This example traces a payments platform: 100% of payment-service errors are captured at top priority, a separate rule samples healthy POST /checkout traffic at 5%, and a low-priority catch-all keeps a thin baseline across everything else. Two groups slice the Service Map by faults and by the payments service (with Insights on), and trace data is encrypted with a customer-managed KMS key. A downstream CloudWatch alarm consumes the module’s group ARN to page on a fault spike.

module "x_ray" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-xray?ref=v1.0.0"

  sampling_rules = [
    {
      rule_name      = "payments-errors-always"
      priority       = 100
      reservoir_size = 1
      fixed_rate     = 1.0 # capture 100% of matching traffic
      service_name   = "payments"
      http_method    = "*"
      url_path       = "*"
    },
    {
      rule_name      = "checkout-baseline"
      priority       = 200
      reservoir_size = 2
      fixed_rate     = 0.05 # 5% of healthy checkout traffic
      service_name   = "*"
      http_method    = "POST"
      url_path       = "/checkout"
    },
    {
      rule_name      = "catch-all"
      priority       = 9000
      reservoir_size = 1
      fixed_rate     = 0.01 # thin 1% baseline everywhere else
    },
  ]

  groups = [
    {
      group_name        = "faults-5xx"
      filter_expression = "fault = true OR http.status >= 500"
    },
    {
      group_name            = "payments-service"
      filter_expression     = "service(\"payments\")"
      insights_enabled      = true
      notifications_enabled = true
    },
  ]

  manage_encryption_config = true
  encryption_kms_key_id    = aws_kms_key.xray.arn

  tags = {
    Environment = "prod"
    Team        = "payments-platform"
  }
}

# Downstream: alarm on a fault spike for the "faults-5xx" group using its ARN.
resource "aws_cloudwatch_metric_alarm" "xray_faults" {
  alarm_name          = "xray-faults-5xx-spike"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  period              = 300
  threshold           = 25
  statistic           = "Sum"
  namespace           = "AWS/X-Ray"
  metric_name         = "FaultRate"
  treat_missing_data  = "notBreaching"

  dimensions = {
    GroupARN = module.x_ray.group_arns["faults-5xx"]
  }

  alarm_actions = [aws_sns_topic.observability.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/xray/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-xray?ref=v1.0.0"
}

inputs = {
  rule_name = "..."
  priority = 0
  reservoir_size = 0
  fixed_rate = 0
  group_name = "..."
  filter_expression = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/xray && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
sampling_rules list(object) [] No Priority-ordered sampling rules (reservoir_size, fixed_rate, and match dimensions). Validated for unique names/priorities and rate 0.0–1.0.
groups list(object) [] No X-Ray groups: a filter_expression plus optional Insights/notifications toggles.
manage_encryption_config bool false No Manage the account+region encryption singleton from this config. Enable in only one place per account/region.
encryption_kms_key_id string null No KMS key ARN/ID/alias for trace encryption; null uses the AWS-owned key (type NONE).
tags map(string) {} No Tags applied to sampling rules and groups.

Per-rule fields (sampling_rules[*])

Name Type Default Required Description
rule_name string Yes Unique name of the sampling rule.
priority number Yes Evaluation order, 1–9999; lower runs first. Must be unique.
reservoir_size number Yes Guaranteed traces/sec captured before fixed_rate applies.
fixed_rate number Yes Sampling rate (0.0–1.0) for matching requests beyond the reservoir.
service_name string "*" No Match on the instrumented service name.
service_type string "*" No Match on origin (e.g. AWS::Lambda::Function).
host string "*" No Match on the Host header.
http_method string "*" No Match on HTTP method.
url_path string "*" No Match on request path.
resource_arn string "*" No Match on the ARN of the AWS resource the rule applies to.
attributes map(string) null No Match on custom trace attributes (segment annotations).

Per-group fields (groups[*])

Name Type Default Required Description
group_name string Yes Unique name of the X-Ray group.
filter_expression string Yes X-Ray filter expression scoping the group (e.g. http.status >= 500).
insights_enabled bool false No Enable X-Ray Insights anomaly detection for the group.
notifications_enabled bool false No Send Insights notifications (requires insights_enabled = true).

Outputs

Name Description
sampling_rule_arns Map of rule_name to sampling rule ARN.
sampling_rule_names List of sampling rule names managed by the module.
group_arns Map of group_name to X-Ray group ARN.
group_names List of X-Ray group names managed by the module.
encryption_type Active encryption type (KMS or NONE), if managed here.
encryption_key_id KMS key used for trace encryption, if managed and set to KMS.

Enterprise scenario

A retail company runs ~40 microservices behind an API Gateway, all instrumented with the ADOT collector. The platform team deploys this module once per region from a shared observability Terraform stack: a sampling matrix captures 100% of any request with a fault, 10% of checkout and payment traffic, and a 2% catch-all everywhere else, keeping the X-Ray bill predictable during Black Friday while never dropping an error trace. Named groups (faults-5xx, checkout-funnel, per-domain service groups) drive team-specific Service Map dashboards and Insights, and manage_encryption_config = true with a customer-managed KMS key satisfies the PCI requirement that trace payloads — which can carry request metadata — are encrypted with a key the company controls and can rotate.

Best practices

TerraformAWSX-RayModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading