Terraform Module: AWS ElastiCache — Production-Ready Redis Replication Groups with Failover and Encryption

Quick take — A reusable Terraform module for AWS ElastiCache (Redis OSS / Valkey) that provisions Multi-AZ replication groups with automatic failover, encryption in transit and at rest, parameter groups, and a hardened subnet group. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "elasticache" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"

  name_prefix = "..."           # Short prefix for all resource names (service/app name).
  environment = "..."           # One of `dev`, `staging`, `prod`; used in naming and tag…
  vpc_id      = "..."           # VPC in which the cache security group is created.
  subnet_ids  = ["...", "..."]  # Private subnet IDs across >= 2 AZs for the subnet group.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

AWS ElastiCache is a managed in-memory data store that runs the Redis OSS and Valkey engines (and Memcached, which this module does not cover). It is the default choice when an application needs sub-millisecond reads for session state, hot cache entries, rate-limit counters, leaderboards, or pub/sub fan-out, and you do not want to operate Redis on EC2 yourself. The aws_elasticache_replication_group resource is the modern, recommended way to model a Redis/Valkey cluster: it represents one or more shards (node groups), each with a primary and zero-or-more read replicas, and it is the only path to features like Multi-AZ automatic failover, online cluster resizing, and encryption.

Wrapping it in a reusable module matters because a correct ElastiCache deployment has a lot of moving, easy-to-get-wrong parts that rarely change between teams: a dedicated subnet group spanning private subnets across at least two AZs, a security group that locks port 6379 to your application tier, encryption-in-transit plus at-rest, an auth_token (or, better, IAM/RBAC) so the endpoint is not open to anyone inside the VPC, a parameter group that pins maxmemory-policy, and automatic-failover wiring that requires automatic_failover_enabled = true whenever you have replicas. Centralising all of that into one versioned module means every cache your organisation ships is Multi-AZ, encrypted, and access-controlled by default — instead of someone hand-rolling a single-node, unencrypted group that becomes a 2 a.m. incident.

When to use it

You need a Redis OSS or Valkey cache in AWS and want it Multi-AZ with automatic failover from day one, not retrofitted later.
You want encryption in transit and at rest, plus an auth_token, enforced by default rather than left to reviewer vigilance.
You are deploying the same cache shape repeatedly — per environment (dev/stage/prod) or per service — and want one audited definition with environment-specific node sizes.
You want to optionally turn on cluster mode (multiple shards) for datasets that exceed a single node’s memory, using the same module by flipping the shard count.
You do not need a Memcached cluster (use aws_elasticache_cluster for that) or a serverless cache (use aws_elasticache_serverless_cache) — this module is specifically for replication-group-based Redis/Valkey.

Module structure

terraform-module-aws-elasticache/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # subnet group, SG, parameter group, replication group
├── variables.tf     # var-driven inputs with validations
└── outputs.tf       # ids, endpoints, port, SG id

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.environment}"

  # Cluster mode (sharded) requires >1 node group OR num_node_groups set explicitly.
  cluster_mode_enabled = var.num_node_groups > 1

  tags = merge(
    {
      Name        = local.name
      Environment = var.environment
      Engine      = var.engine
      ManagedBy   = "terraform"
      Module      = "terraform-module-aws-elasticache"
    },
    var.tags,
  )
}

# Subnet group: spans the private subnets ElastiCache nodes are placed into.
resource "aws_elasticache_subnet_group" "this" {
  name        = "${local.name}-subnets"
  description = "Subnet group for ${local.name} ElastiCache"
  subnet_ids  = var.subnet_ids
  tags        = local.tags
}

# Dedicated security group; ingress is opened only to the CIDRs/SGs you pass in.
resource "aws_security_group" "this" {
  name        = "${local.name}-cache-sg"
  description = "Access to ${local.name} ElastiCache on port ${var.port}"
  vpc_id      = var.vpc_id
  tags        = local.tags
}

resource "aws_security_group_rule" "ingress_cidr" {
  count = length(var.allowed_cidr_blocks) > 0 ? 1 : 0

  type              = "ingress"
  description       = "Redis/Valkey from allowed CIDRs"
  from_port         = var.port
  to_port           = var.port
  protocol          = "tcp"
  cidr_blocks       = var.allowed_cidr_blocks
  security_group_id = aws_security_group.this.id
}

resource "aws_security_group_rule" "ingress_sg" {
  for_each = toset(var.allowed_security_group_ids)

  type                     = "ingress"
  description              = "Redis/Valkey from app security group"
  from_port                = var.port
  to_port                  = var.port
  protocol                 = "tcp"
  source_security_group_id = each.value
  security_group_id        = aws_security_group.this.id
}

resource "aws_security_group_rule" "egress_all" {
  type              = "egress"
  description       = "Allow all outbound"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks       = ["0.0.0.0/0"]
  security_group_id = aws_security_group.this.id
}

# Parameter group: pins engine-family behaviour (eviction policy, keyspace events, etc.).
resource "aws_elasticache_parameter_group" "this" {
  name        = "${local.name}-params"
  family      = var.parameter_group_family
  description = "Parameter group for ${local.name}"

  parameter {
    name  = "maxmemory-policy"
    value = var.maxmemory_policy
  }

  dynamic "parameter" {
    for_each = var.parameters
    content {
      name  = parameter.value.name
      value = parameter.value.value
    }
  }

  tags = local.tags

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_elasticache_replication_group" "this" {
  replication_group_id = local.name
  description          = var.description

  engine         = var.engine
  engine_version = var.engine_version
  node_type      = var.node_type
  port           = var.port

  # Topology: either replicas-per-primary (non-clustered) or sharded (cluster mode).
  num_node_groups         = local.cluster_mode_enabled ? var.num_node_groups : null
  replicas_per_node_group = local.cluster_mode_enabled ? var.replicas_per_node_group : null
  num_cache_clusters      = local.cluster_mode_enabled ? null : var.num_cache_clusters

  # Multi-AZ + failover. automatic_failover MUST be on whenever there is >1 node.
  automatic_failover_enabled = var.automatic_failover_enabled
  multi_az_enabled           = var.multi_az_enabled

  subnet_group_name  = aws_elasticache_subnet_group.this.name
  security_group_ids = concat([aws_security_group.this.id], var.extra_security_group_ids)
  parameter_group_name = aws_elasticache_parameter_group.this.name

  # Encryption: at rest (optionally with a CMK) and in transit (TLS).
  at_rest_encryption_enabled = var.at_rest_encryption_enabled
  kms_key_id                 = var.kms_key_id
  transit_encryption_enabled = var.transit_encryption_enabled
  auth_token                 = var.auth_token
  auth_token_update_strategy = var.auth_token != null ? "ROTATE" : null

  # Operational guardrails.
  maintenance_window       = var.maintenance_window
  snapshot_window          = var.snapshot_window
  snapshot_retention_limit = var.snapshot_retention_limit
  apply_immediately        = var.apply_immediately
  auto_minor_version_upgrade = var.auto_minor_version_upgrade

  # Push slow/engine logs to CloudWatch when a destination is supplied.
  dynamic "log_delivery_configuration" {
    for_each = var.log_delivery_configurations
    content {
      destination      = log_delivery_configuration.value.destination
      destination_type = log_delivery_configuration.value.destination_type
      log_format       = log_delivery_configuration.value.log_format
      log_type         = log_delivery_configuration.value.log_type
    }
  }

  tags = local.tags

  lifecycle {
    # auth_token rotations and engine_version bumps can otherwise force noisy diffs.
    ignore_changes = [num_cache_clusters]
  }
}

variables.tf

variable "name_prefix" {
  description = "Short prefix for all resource names (e.g. the service or app name)."
  type        = string

  validation {
    condition     = can(regex("^[a-z][a-z0-9-]{1,30}$", var.name_prefix))
    error_message = "name_prefix must be lowercase alphanumeric/hyphens, start with a letter, 2-31 chars."
  }
}

variable "environment" {
  description = "Deployment environment, used in naming and tags."
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of: dev, staging, prod."
  }
}

variable "description" {
  description = "Human-readable description for the replication group."
  type        = string
  default     = "Managed by Terraform"
}

variable "vpc_id" {
  description = "VPC in which the cache security group is created."
  type        = string
}

variable "subnet_ids" {
  description = "Private subnet IDs (>= 2 AZs) for the ElastiCache subnet group."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) >= 2
    error_message = "Provide at least two subnets in different AZs for Multi-AZ failover."
  }
}

variable "engine" {
  description = "Cache engine: 'redis' or 'valkey'."
  type        = string
  default     = "redis"

  validation {
    condition     = contains(["redis", "valkey"], var.engine)
    error_message = "engine must be 'redis' or 'valkey'."
  }
}

variable "engine_version" {
  description = "Engine version (e.g. '7.1' for Redis OSS, '8.0' for Valkey)."
  type        = string
  default     = "7.1"
}

variable "parameter_group_family" {
  description = "Parameter group family matching the engine/version (e.g. 'redis7', 'valkey8')."
  type        = string
  default     = "redis7"
}

variable "node_type" {
  description = "Instance class for each node (e.g. cache.t4g.micro, cache.r7g.large)."
  type        = string
  default     = "cache.t4g.micro"

  validation {
    condition     = can(regex("^cache\\.", var.node_type))
    error_message = "node_type must be an ElastiCache instance class starting with 'cache.'."
  }
}

variable "port" {
  description = "TCP port the cache listens on."
  type        = number
  default     = 6379
}

# --- Topology (non-clustered) ---
variable "num_cache_clusters" {
  description = "Number of nodes (1 primary + N-1 replicas) when cluster mode is OFF."
  type        = number
  default     = 2

  validation {
    condition     = var.num_cache_clusters >= 1 && var.num_cache_clusters <= 6
    error_message = "num_cache_clusters must be between 1 and 6."
  }
}

# --- Topology (cluster mode / sharded) ---
variable "num_node_groups" {
  description = "Number of shards. >1 enables cluster mode. Leave at 1 for a single-shard group."
  type        = number
  default     = 1

  validation {
    condition     = var.num_node_groups >= 1
    error_message = "num_node_groups must be >= 1."
  }
}

variable "replicas_per_node_group" {
  description = "Read replicas per shard when cluster mode is ON (num_node_groups > 1)."
  type        = number
  default     = 1
}

# --- Failover / HA ---
variable "automatic_failover_enabled" {
  description = "Enable automatic failover. Must be true when there is more than one node."
  type        = bool
  default     = true
}

variable "multi_az_enabled" {
  description = "Place replicas in multiple AZs. Requires automatic_failover_enabled = true."
  type        = bool
  default     = true
}

# --- Encryption / auth ---
variable "at_rest_encryption_enabled" {
  description = "Enable encryption at rest."
  type        = bool
  default     = true
}

variable "kms_key_id" {
  description = "Optional KMS CMK ARN for at-rest encryption. Null = AWS-managed key."
  type        = string
  default     = null
}

variable "transit_encryption_enabled" {
  description = "Enable TLS in transit. Required to use auth_token."
  type        = bool
  default     = true
}

variable "auth_token" {
  description = "Redis AUTH token (16-128 printable chars). Requires transit encryption. Pass via a secret, never hardcode."
  type        = string
  default     = null
  sensitive   = true

  validation {
    condition     = var.auth_token == null || length(var.auth_token) >= 16
    error_message = "auth_token must be at least 16 characters when set."
  }
}

# --- Networking ---
variable "allowed_cidr_blocks" {
  description = "CIDR blocks permitted to reach the cache port."
  type        = list(string)
  default     = []
}

variable "allowed_security_group_ids" {
  description = "Source security group IDs (app tier) permitted to reach the cache port."
  type        = list(string)
  default     = []
}

variable "extra_security_group_ids" {
  description = "Additional pre-existing SG IDs to attach to the replication group."
  type        = list(string)
  default     = []
}

# --- Parameters ---
variable "maxmemory_policy" {
  description = "Eviction policy when memory is full (e.g. allkeys-lru, volatile-lru, noeviction)."
  type        = string
  default     = "volatile-lru"
}

variable "parameters" {
  description = "Extra engine parameters to set in the parameter group."
  type = list(object({
    name  = string
    value = string
  }))
  default = []
}

# --- Operations ---
variable "maintenance_window" {
  description = "Weekly maintenance window (UTC), e.g. 'sun:05:00-sun:06:00'."
  type        = string
  default     = "sun:05:00-sun:06:00"
}

variable "snapshot_window" {
  description = "Daily window for automatic snapshots (UTC), e.g. '03:00-04:00'."
  type        = string
  default     = "03:00-04:00"
}

variable "snapshot_retention_limit" {
  description = "Days to retain automatic snapshots. 0 disables snapshots."
  type        = number
  default     = 7
}

variable "apply_immediately" {
  description = "Apply modifications immediately instead of during the maintenance window."
  type        = bool
  default     = false
}

variable "auto_minor_version_upgrade" {
  description = "Allow automatic minor engine version upgrades during maintenance."
  type        = bool
  default     = true
}

variable "log_delivery_configurations" {
  description = "CloudWatch/Kinesis log delivery configs (slow-log, engine-log)."
  type = list(object({
    destination      = string
    destination_type = string
    log_format       = string
    log_type         = string
  }))
  default = []
}

variable "tags" {
  description = "Additional tags merged onto every resource."
  type        = map(string)
  default     = {}
}

outputs.tf

output "replication_group_id" {
  description = "The ElastiCache replication group ID."
  value       = aws_elasticache_replication_group.this.id
}

output "arn" {
  description = "ARN of the replication group."
  value       = aws_elasticache_replication_group.this.arn
}

output "primary_endpoint_address" {
  description = "Primary write endpoint (non-cluster mode)."
  value       = aws_elasticache_replication_group.this.primary_endpoint_address
}

output "reader_endpoint_address" {
  description = "Reader endpoint that load-balances across replicas (non-cluster mode)."
  value       = aws_elasticache_replication_group.this.reader_endpoint_address
}

output "configuration_endpoint_address" {
  description = "Configuration endpoint (cluster mode / sharded only)."
  value       = aws_elasticache_replication_group.this.configuration_endpoint_address
}

output "port" {
  description = "Port the cache listens on."
  value       = aws_elasticache_replication_group.this.port
}

output "member_clusters" {
  description = "Identifiers of all individual cache nodes in the group."
  value       = aws_elasticache_replication_group.this.member_clusters
}

output "security_group_id" {
  description = "ID of the security group created for the cache."
  value       = aws_security_group.this.id
}

output "subnet_group_name" {
  description = "Name of the created subnet group."
  value       = aws_elasticache_subnet_group.this.name
}

output "parameter_group_name" {
  description = "Name of the created parameter group."
  value       = aws_elasticache_parameter_group.this.name
}

How to use it

# Generate and store the AUTH token in Secrets Manager; never hardcode it.
resource "random_password" "redis_auth" {
  length  = 32
  special = false # Redis AUTH tokens disallow some symbols; alphanumeric is safe.
}

resource "aws_secretsmanager_secret" "redis_auth" {
  name = "checkout/redis-auth-token"
}

resource "aws_secretsmanager_secret_version" "redis_auth" {
  secret_id     = aws_secretsmanager_secret.redis_auth.id
  secret_string = random_password.redis_auth.result
}

module "elasticache" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"

  name_prefix = "checkout-cache"
  environment = "prod"
  description = "Session + idempotency cache for the checkout service"

  vpc_id     = module.network.vpc_id
  subnet_ids = module.network.private_subnet_ids

  engine                 = "valkey"
  engine_version         = "8.0"
  parameter_group_family = "valkey8"
  node_type              = "cache.r7g.large"

  # Non-clustered: 1 primary + 2 replicas across 3 AZs.
  num_cache_clusters         = 3
  automatic_failover_enabled = true
  multi_az_enabled           = true

  # Lock down access to the app tier only.
  allowed_security_group_ids = [module.checkout_service.app_security_group_id]

  # Security.
  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = aws_secretsmanager_secret_version.redis_auth.secret_string

  # Sensible eviction for a session cache.
  maxmemory_policy         = "volatile-lru"
  snapshot_retention_limit = 7

  tags = {
    Team      = "payments"
    CostCenter = "cc-4471"
  }
}

# Downstream: hand the endpoint + secret to the ECS task definition.
resource "aws_ssm_parameter" "redis_endpoint" {
  name  = "/checkout/redis/primary-endpoint"
  type  = "String"
  value = "${module.elasticache.primary_endpoint_address}:${module.elasticache.port}"
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/elasticache/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  environment = "..."
  vpc_id = "..."
  subnet_ids = ["...", "..."]
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/elasticache && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
`name_prefix`	`string`	—	Yes	Short prefix for all resource names (service/app name).
`environment`	`string`	—	Yes	One of `dev`, `staging`, `prod`; used in naming and tags.
`description`	`string`	`"Managed by Terraform"`	No	Description for the replication group.
`vpc_id`	`string`	—	Yes	VPC in which the cache security group is created.
`subnet_ids`	`list(string)`	—	Yes	Private subnet IDs across >= 2 AZs for the subnet group.
`engine`	`string`	`"redis"`	No	Cache engine: `redis` or `valkey`.
`engine_version`	`string`	`"7.1"`	No	Engine version (e.g. `7.1`, `8.0`).
`parameter_group_family`	`string`	`"redis7"`	No	Parameter group family matching engine/version.
`node_type`	`string`	`"cache.t4g.micro"`	No	Instance class for each node.
`port`	`number`	`6379`	No	TCP port the cache listens on.
`num_cache_clusters`	`number`	`2`	No	Node count (primary + replicas) when cluster mode is off.
`num_node_groups`	`number`	`1`	No	Shard count; `>1` enables cluster mode.
`replicas_per_node_group`	`number`	`1`	No	Replicas per shard when cluster mode is on.
`automatic_failover_enabled`	`bool`	`true`	No	Enable automatic failover (required when >1 node).
`multi_az_enabled`	`bool`	`true`	No	Spread replicas across AZs (requires failover).
`at_rest_encryption_enabled`	`bool`	`true`	No	Enable encryption at rest.
`kms_key_id`	`string`	`null`	No	KMS CMK ARN for at-rest encryption; null = AWS-managed key.
`transit_encryption_enabled`	`bool`	`true`	No	Enable TLS in transit (required for `auth_token`).
`auth_token`	`string`	`null`	No	Redis AUTH token (>= 16 chars); pass via a secret.
`allowed_cidr_blocks`	`list(string)`	`[]`	No	CIDRs permitted to reach the cache port.
`allowed_security_group_ids`	`list(string)`	`[]`	No	Source SG IDs (app tier) permitted to reach the cache.
`extra_security_group_ids`	`list(string)`	`[]`	No	Additional existing SGs to attach to the group.
`maxmemory_policy`	`string`	`"volatile-lru"`	No	Eviction policy when memory is full.
`parameters`	`list(object)`	`[]`	No	Extra engine parameters for the parameter group.
`maintenance_window`	`string`	`"sun:05:00-sun:06:00"`	No	Weekly maintenance window (UTC).
`snapshot_window`	`string`	`"03:00-04:00"`	No	Daily automatic snapshot window (UTC).
`snapshot_retention_limit`	`number`	`7`	No	Days to retain snapshots; 0 disables.
`apply_immediately`	`bool`	`false`	No	Apply changes immediately vs. during maintenance.
`auto_minor_version_upgrade`	`bool`	`true`	No	Allow automatic minor version upgrades.
`log_delivery_configurations`	`list(object)`	`[]`	No	Slow-log/engine-log delivery to CloudWatch/Kinesis.
`tags`	`map(string)`	`{}`	No	Additional tags merged onto every resource.

Outputs

Name	Description
`replication_group_id`	The ElastiCache replication group ID.
`arn`	ARN of the replication group.
`primary_endpoint_address`	Primary write endpoint (non-cluster mode).
`reader_endpoint_address`	Reader endpoint load-balancing across replicas (non-cluster mode).
`configuration_endpoint_address`	Configuration endpoint (cluster mode only).
`port`	Port the cache listens on.
`member_clusters`	Identifiers of all individual cache nodes in the group.
`security_group_id`	ID of the security group created for the cache.
`subnet_group_name`	Name of the created subnet group.
`parameter_group_name`	Name of the created parameter group.

Enterprise scenario

A payments platform runs its checkout service on ECS Fargate and needs a low-latency store for user sessions and idempotency keys that absolutely cannot lose writes during an AZ outage. The platform team consumes this module pinned at ref=v1.0.0 to stand up a Valkey 8 replication group with one primary and two replicas spread across three AZs (num_cache_clusters = 3, multi_az_enabled = true), TLS plus an auth_token sourced from Secrets Manager, and maxmemory-policy = volatile-lru so only TTL’d session keys are evicted under pressure. Because the module enforces automatic_failover_enabled = true and a hardened security group scoped to the checkout app SG, every team that adopts it inherits a Multi-AZ, encrypted, access-controlled cache without re-deriving the dozen settings that make ElastiCache production-safe.

Best practices

Always pair transit_encryption_enabled = true with an auth_token (or migrate to ElastiCache IAM/RBAC users). TLS without auth leaves the endpoint reachable by anything inside the VPC; store the token in Secrets Manager and rotate it with the ROTATE update strategy rather than recreating the group.
Keep automatic_failover_enabled and multi_az_enabled on for any group with replicas and spread subnet_ids across at least two AZs. A single-node group has no failover target — reserve num_cache_clusters = 1 for throwaway dev caches only.
Pin maxmemory-policy deliberately per workload. Use volatile-lru/volatile-ttl for caches where keys carry TTLs, allkeys-lru for a pure cache, and noeviction only when the cache is a system of record you must not silently drop.
Right-size and reserve for cost. Use Graviton-based cache.r7g/cache.t4g node types, scale reads with replicas_per_node_group rather than oversizing the primary, and buy ElastiCache Reserved Nodes for steady prod workloads to cut roughly a third off on-demand pricing.
Choose topology by data size, not habit. Stay non-clustered (num_node_groups = 1) until a single node’s memory is the bottleneck, then enable cluster mode and use the configuration_endpoint_address output — application clients must be cluster-aware to follow shard slots.
Name and tag consistently via name_prefix + environment, enable snapshots (snapshot_retention_limit >= 1) for any stateful use, and ship slow-log/engine-log to CloudWatch through log_delivery_configurations so latency regressions are observable before they page you.