Quick take — A reusable Terraform module for AWS ElastiCache (Redis OSS / Valkey) that provisions Multi-AZ replication groups with automatic failover, encryption in transit and at rest, parameter groups, and a hardened subnet group. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "elasticache" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"
name_prefix = "..." # Short prefix for all resource names (service/app name).
environment = "..." # One of `dev`, `staging`, `prod`; used in naming and tag…
vpc_id = "..." # VPC in which the cache security group is created.
subnet_ids = ["...", "..."] # Private subnet IDs across >= 2 AZs for the subnet group.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
AWS ElastiCache is a managed in-memory data store that runs the Redis OSS and Valkey engines (and Memcached, which this module does not cover). It is the default choice when an application needs sub-millisecond reads for session state, hot cache entries, rate-limit counters, leaderboards, or pub/sub fan-out, and you do not want to operate Redis on EC2 yourself. The aws_elasticache_replication_group resource is the modern, recommended way to model a Redis/Valkey cluster: it represents one or more shards (node groups), each with a primary and zero-or-more read replicas, and it is the only path to features like Multi-AZ automatic failover, online cluster resizing, and encryption.
Wrapping it in a reusable module matters because a correct ElastiCache deployment has a lot of moving, easy-to-get-wrong parts that rarely change between teams: a dedicated subnet group spanning private subnets across at least two AZs, a security group that locks port 6379 to your application tier, encryption-in-transit plus at-rest, an auth_token (or, better, IAM/RBAC) so the endpoint is not open to anyone inside the VPC, a parameter group that pins maxmemory-policy, and automatic-failover wiring that requires automatic_failover_enabled = true whenever you have replicas. Centralising all of that into one versioned module means every cache your organisation ships is Multi-AZ, encrypted, and access-controlled by default — instead of someone hand-rolling a single-node, unencrypted group that becomes a 2 a.m. incident.
When to use it
- You need a Redis OSS or Valkey cache in AWS and want it Multi-AZ with automatic failover from day one, not retrofitted later.
- You want encryption in transit and at rest, plus an
auth_token, enforced by default rather than left to reviewer vigilance. - You are deploying the same cache shape repeatedly — per environment (dev/stage/prod) or per service — and want one audited definition with environment-specific node sizes.
- You want to optionally turn on cluster mode (multiple shards) for datasets that exceed a single node’s memory, using the same module by flipping the shard count.
- You do not need a Memcached cluster (use
aws_elasticache_clusterfor that) or a serverless cache (useaws_elasticache_serverless_cache) — this module is specifically for replication-group-based Redis/Valkey.
Module structure
terraform-module-aws-elasticache/
├── versions.tf # provider + Terraform version pins
├── main.tf # subnet group, SG, parameter group, replication group
├── variables.tf # var-driven inputs with validations
└── outputs.tf # ids, endpoints, port, SG id
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
name = "${var.name_prefix}-${var.environment}"
# Cluster mode (sharded) requires >1 node group OR num_node_groups set explicitly.
cluster_mode_enabled = var.num_node_groups > 1
tags = merge(
{
Name = local.name
Environment = var.environment
Engine = var.engine
ManagedBy = "terraform"
Module = "terraform-module-aws-elasticache"
},
var.tags,
)
}
# Subnet group: spans the private subnets ElastiCache nodes are placed into.
resource "aws_elasticache_subnet_group" "this" {
name = "${local.name}-subnets"
description = "Subnet group for ${local.name} ElastiCache"
subnet_ids = var.subnet_ids
tags = local.tags
}
# Dedicated security group; ingress is opened only to the CIDRs/SGs you pass in.
resource "aws_security_group" "this" {
name = "${local.name}-cache-sg"
description = "Access to ${local.name} ElastiCache on port ${var.port}"
vpc_id = var.vpc_id
tags = local.tags
}
resource "aws_security_group_rule" "ingress_cidr" {
count = length(var.allowed_cidr_blocks) > 0 ? 1 : 0
type = "ingress"
description = "Redis/Valkey from allowed CIDRs"
from_port = var.port
to_port = var.port
protocol = "tcp"
cidr_blocks = var.allowed_cidr_blocks
security_group_id = aws_security_group.this.id
}
resource "aws_security_group_rule" "ingress_sg" {
for_each = toset(var.allowed_security_group_ids)
type = "ingress"
description = "Redis/Valkey from app security group"
from_port = var.port
to_port = var.port
protocol = "tcp"
source_security_group_id = each.value
security_group_id = aws_security_group.this.id
}
resource "aws_security_group_rule" "egress_all" {
type = "egress"
description = "Allow all outbound"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
security_group_id = aws_security_group.this.id
}
# Parameter group: pins engine-family behaviour (eviction policy, keyspace events, etc.).
resource "aws_elasticache_parameter_group" "this" {
name = "${local.name}-params"
family = var.parameter_group_family
description = "Parameter group for ${local.name}"
parameter {
name = "maxmemory-policy"
value = var.maxmemory_policy
}
dynamic "parameter" {
for_each = var.parameters
content {
name = parameter.value.name
value = parameter.value.value
}
}
tags = local.tags
lifecycle {
create_before_destroy = true
}
}
resource "aws_elasticache_replication_group" "this" {
replication_group_id = local.name
description = var.description
engine = var.engine
engine_version = var.engine_version
node_type = var.node_type
port = var.port
# Topology: either replicas-per-primary (non-clustered) or sharded (cluster mode).
num_node_groups = local.cluster_mode_enabled ? var.num_node_groups : null
replicas_per_node_group = local.cluster_mode_enabled ? var.replicas_per_node_group : null
num_cache_clusters = local.cluster_mode_enabled ? null : var.num_cache_clusters
# Multi-AZ + failover. automatic_failover MUST be on whenever there is >1 node.
automatic_failover_enabled = var.automatic_failover_enabled
multi_az_enabled = var.multi_az_enabled
subnet_group_name = aws_elasticache_subnet_group.this.name
security_group_ids = concat([aws_security_group.this.id], var.extra_security_group_ids)
parameter_group_name = aws_elasticache_parameter_group.this.name
# Encryption: at rest (optionally with a CMK) and in transit (TLS).
at_rest_encryption_enabled = var.at_rest_encryption_enabled
kms_key_id = var.kms_key_id
transit_encryption_enabled = var.transit_encryption_enabled
auth_token = var.auth_token
auth_token_update_strategy = var.auth_token != null ? "ROTATE" : null
# Operational guardrails.
maintenance_window = var.maintenance_window
snapshot_window = var.snapshot_window
snapshot_retention_limit = var.snapshot_retention_limit
apply_immediately = var.apply_immediately
auto_minor_version_upgrade = var.auto_minor_version_upgrade
# Push slow/engine logs to CloudWatch when a destination is supplied.
dynamic "log_delivery_configuration" {
for_each = var.log_delivery_configurations
content {
destination = log_delivery_configuration.value.destination
destination_type = log_delivery_configuration.value.destination_type
log_format = log_delivery_configuration.value.log_format
log_type = log_delivery_configuration.value.log_type
}
}
tags = local.tags
lifecycle {
# auth_token rotations and engine_version bumps can otherwise force noisy diffs.
ignore_changes = [num_cache_clusters]
}
}
variables.tf
variable "name_prefix" {
description = "Short prefix for all resource names (e.g. the service or app name)."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,30}$", var.name_prefix))
error_message = "name_prefix must be lowercase alphanumeric/hyphens, start with a letter, 2-31 chars."
}
}
variable "environment" {
description = "Deployment environment, used in naming and tags."
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "environment must be one of: dev, staging, prod."
}
}
variable "description" {
description = "Human-readable description for the replication group."
type = string
default = "Managed by Terraform"
}
variable "vpc_id" {
description = "VPC in which the cache security group is created."
type = string
}
variable "subnet_ids" {
description = "Private subnet IDs (>= 2 AZs) for the ElastiCache subnet group."
type = list(string)
validation {
condition = length(var.subnet_ids) >= 2
error_message = "Provide at least two subnets in different AZs for Multi-AZ failover."
}
}
variable "engine" {
description = "Cache engine: 'redis' or 'valkey'."
type = string
default = "redis"
validation {
condition = contains(["redis", "valkey"], var.engine)
error_message = "engine must be 'redis' or 'valkey'."
}
}
variable "engine_version" {
description = "Engine version (e.g. '7.1' for Redis OSS, '8.0' for Valkey)."
type = string
default = "7.1"
}
variable "parameter_group_family" {
description = "Parameter group family matching the engine/version (e.g. 'redis7', 'valkey8')."
type = string
default = "redis7"
}
variable "node_type" {
description = "Instance class for each node (e.g. cache.t4g.micro, cache.r7g.large)."
type = string
default = "cache.t4g.micro"
validation {
condition = can(regex("^cache\\.", var.node_type))
error_message = "node_type must be an ElastiCache instance class starting with 'cache.'."
}
}
variable "port" {
description = "TCP port the cache listens on."
type = number
default = 6379
}
# --- Topology (non-clustered) ---
variable "num_cache_clusters" {
description = "Number of nodes (1 primary + N-1 replicas) when cluster mode is OFF."
type = number
default = 2
validation {
condition = var.num_cache_clusters >= 1 && var.num_cache_clusters <= 6
error_message = "num_cache_clusters must be between 1 and 6."
}
}
# --- Topology (cluster mode / sharded) ---
variable "num_node_groups" {
description = "Number of shards. >1 enables cluster mode. Leave at 1 for a single-shard group."
type = number
default = 1
validation {
condition = var.num_node_groups >= 1
error_message = "num_node_groups must be >= 1."
}
}
variable "replicas_per_node_group" {
description = "Read replicas per shard when cluster mode is ON (num_node_groups > 1)."
type = number
default = 1
}
# --- Failover / HA ---
variable "automatic_failover_enabled" {
description = "Enable automatic failover. Must be true when there is more than one node."
type = bool
default = true
}
variable "multi_az_enabled" {
description = "Place replicas in multiple AZs. Requires automatic_failover_enabled = true."
type = bool
default = true
}
# --- Encryption / auth ---
variable "at_rest_encryption_enabled" {
description = "Enable encryption at rest."
type = bool
default = true
}
variable "kms_key_id" {
description = "Optional KMS CMK ARN for at-rest encryption. Null = AWS-managed key."
type = string
default = null
}
variable "transit_encryption_enabled" {
description = "Enable TLS in transit. Required to use auth_token."
type = bool
default = true
}
variable "auth_token" {
description = "Redis AUTH token (16-128 printable chars). Requires transit encryption. Pass via a secret, never hardcode."
type = string
default = null
sensitive = true
validation {
condition = var.auth_token == null || length(var.auth_token) >= 16
error_message = "auth_token must be at least 16 characters when set."
}
}
# --- Networking ---
variable "allowed_cidr_blocks" {
description = "CIDR blocks permitted to reach the cache port."
type = list(string)
default = []
}
variable "allowed_security_group_ids" {
description = "Source security group IDs (app tier) permitted to reach the cache port."
type = list(string)
default = []
}
variable "extra_security_group_ids" {
description = "Additional pre-existing SG IDs to attach to the replication group."
type = list(string)
default = []
}
# --- Parameters ---
variable "maxmemory_policy" {
description = "Eviction policy when memory is full (e.g. allkeys-lru, volatile-lru, noeviction)."
type = string
default = "volatile-lru"
}
variable "parameters" {
description = "Extra engine parameters to set in the parameter group."
type = list(object({
name = string
value = string
}))
default = []
}
# --- Operations ---
variable "maintenance_window" {
description = "Weekly maintenance window (UTC), e.g. 'sun:05:00-sun:06:00'."
type = string
default = "sun:05:00-sun:06:00"
}
variable "snapshot_window" {
description = "Daily window for automatic snapshots (UTC), e.g. '03:00-04:00'."
type = string
default = "03:00-04:00"
}
variable "snapshot_retention_limit" {
description = "Days to retain automatic snapshots. 0 disables snapshots."
type = number
default = 7
}
variable "apply_immediately" {
description = "Apply modifications immediately instead of during the maintenance window."
type = bool
default = false
}
variable "auto_minor_version_upgrade" {
description = "Allow automatic minor engine version upgrades during maintenance."
type = bool
default = true
}
variable "log_delivery_configurations" {
description = "CloudWatch/Kinesis log delivery configs (slow-log, engine-log)."
type = list(object({
destination = string
destination_type = string
log_format = string
log_type = string
}))
default = []
}
variable "tags" {
description = "Additional tags merged onto every resource."
type = map(string)
default = {}
}
outputs.tf
output "replication_group_id" {
description = "The ElastiCache replication group ID."
value = aws_elasticache_replication_group.this.id
}
output "arn" {
description = "ARN of the replication group."
value = aws_elasticache_replication_group.this.arn
}
output "primary_endpoint_address" {
description = "Primary write endpoint (non-cluster mode)."
value = aws_elasticache_replication_group.this.primary_endpoint_address
}
output "reader_endpoint_address" {
description = "Reader endpoint that load-balances across replicas (non-cluster mode)."
value = aws_elasticache_replication_group.this.reader_endpoint_address
}
output "configuration_endpoint_address" {
description = "Configuration endpoint (cluster mode / sharded only)."
value = aws_elasticache_replication_group.this.configuration_endpoint_address
}
output "port" {
description = "Port the cache listens on."
value = aws_elasticache_replication_group.this.port
}
output "member_clusters" {
description = "Identifiers of all individual cache nodes in the group."
value = aws_elasticache_replication_group.this.member_clusters
}
output "security_group_id" {
description = "ID of the security group created for the cache."
value = aws_security_group.this.id
}
output "subnet_group_name" {
description = "Name of the created subnet group."
value = aws_elasticache_subnet_group.this.name
}
output "parameter_group_name" {
description = "Name of the created parameter group."
value = aws_elasticache_parameter_group.this.name
}
How to use it
# Generate and store the AUTH token in Secrets Manager; never hardcode it.
resource "random_password" "redis_auth" {
length = 32
special = false # Redis AUTH tokens disallow some symbols; alphanumeric is safe.
}
resource "aws_secretsmanager_secret" "redis_auth" {
name = "checkout/redis-auth-token"
}
resource "aws_secretsmanager_secret_version" "redis_auth" {
secret_id = aws_secretsmanager_secret.redis_auth.id
secret_string = random_password.redis_auth.result
}
module "elasticache" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"
name_prefix = "checkout-cache"
environment = "prod"
description = "Session + idempotency cache for the checkout service"
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
engine = "valkey"
engine_version = "8.0"
parameter_group_family = "valkey8"
node_type = "cache.r7g.large"
# Non-clustered: 1 primary + 2 replicas across 3 AZs.
num_cache_clusters = 3
automatic_failover_enabled = true
multi_az_enabled = true
# Lock down access to the app tier only.
allowed_security_group_ids = [module.checkout_service.app_security_group_id]
# Security.
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = aws_secretsmanager_secret_version.redis_auth.secret_string
# Sensible eviction for a session cache.
maxmemory_policy = "volatile-lru"
snapshot_retention_limit = 7
tags = {
Team = "payments"
CostCenter = "cc-4471"
}
}
# Downstream: hand the endpoint + secret to the ECS task definition.
resource "aws_ssm_parameter" "redis_endpoint" {
name = "/checkout/redis/primary-endpoint"
type = "String"
value = "${module.elasticache.primary_endpoint_address}:${module.elasticache.port}"
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/elasticache/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-elasticache?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
environment = "..."
vpc_id = "..."
subnet_ids = ["...", "..."]
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/elasticache && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
name_prefix |
string |
— | Yes | Short prefix for all resource names (service/app name). |
environment |
string |
— | Yes | One of dev, staging, prod; used in naming and tags. |
description |
string |
"Managed by Terraform" |
No | Description for the replication group. |
vpc_id |
string |
— | Yes | VPC in which the cache security group is created. |
subnet_ids |
list(string) |
— | Yes | Private subnet IDs across >= 2 AZs for the subnet group. |
engine |
string |
"redis" |
No | Cache engine: redis or valkey. |
engine_version |
string |
"7.1" |
No | Engine version (e.g. 7.1, 8.0). |
parameter_group_family |
string |
"redis7" |
No | Parameter group family matching engine/version. |
node_type |
string |
"cache.t4g.micro" |
No | Instance class for each node. |
port |
number |
6379 |
No | TCP port the cache listens on. |
num_cache_clusters |
number |
2 |
No | Node count (primary + replicas) when cluster mode is off. |
num_node_groups |
number |
1 |
No | Shard count; >1 enables cluster mode. |
replicas_per_node_group |
number |
1 |
No | Replicas per shard when cluster mode is on. |
automatic_failover_enabled |
bool |
true |
No | Enable automatic failover (required when >1 node). |
multi_az_enabled |
bool |
true |
No | Spread replicas across AZs (requires failover). |
at_rest_encryption_enabled |
bool |
true |
No | Enable encryption at rest. |
kms_key_id |
string |
null |
No | KMS CMK ARN for at-rest encryption; null = AWS-managed key. |
transit_encryption_enabled |
bool |
true |
No | Enable TLS in transit (required for auth_token). |
auth_token |
string |
null |
No | Redis AUTH token (>= 16 chars); pass via a secret. |
allowed_cidr_blocks |
list(string) |
[] |
No | CIDRs permitted to reach the cache port. |
allowed_security_group_ids |
list(string) |
[] |
No | Source SG IDs (app tier) permitted to reach the cache. |
extra_security_group_ids |
list(string) |
[] |
No | Additional existing SGs to attach to the group. |
maxmemory_policy |
string |
"volatile-lru" |
No | Eviction policy when memory is full. |
parameters |
list(object) |
[] |
No | Extra engine parameters for the parameter group. |
maintenance_window |
string |
"sun:05:00-sun:06:00" |
No | Weekly maintenance window (UTC). |
snapshot_window |
string |
"03:00-04:00" |
No | Daily automatic snapshot window (UTC). |
snapshot_retention_limit |
number |
7 |
No | Days to retain snapshots; 0 disables. |
apply_immediately |
bool |
false |
No | Apply changes immediately vs. during maintenance. |
auto_minor_version_upgrade |
bool |
true |
No | Allow automatic minor version upgrades. |
log_delivery_configurations |
list(object) |
[] |
No | Slow-log/engine-log delivery to CloudWatch/Kinesis. |
tags |
map(string) |
{} |
No | Additional tags merged onto every resource. |
Outputs
| Name | Description |
|---|---|
replication_group_id |
The ElastiCache replication group ID. |
arn |
ARN of the replication group. |
primary_endpoint_address |
Primary write endpoint (non-cluster mode). |
reader_endpoint_address |
Reader endpoint load-balancing across replicas (non-cluster mode). |
configuration_endpoint_address |
Configuration endpoint (cluster mode only). |
port |
Port the cache listens on. |
member_clusters |
Identifiers of all individual cache nodes in the group. |
security_group_id |
ID of the security group created for the cache. |
subnet_group_name |
Name of the created subnet group. |
parameter_group_name |
Name of the created parameter group. |
Enterprise scenario
A payments platform runs its checkout service on ECS Fargate and needs a low-latency store for user sessions and idempotency keys that absolutely cannot lose writes during an AZ outage. The platform team consumes this module pinned at ref=v1.0.0 to stand up a Valkey 8 replication group with one primary and two replicas spread across three AZs (num_cache_clusters = 3, multi_az_enabled = true), TLS plus an auth_token sourced from Secrets Manager, and maxmemory-policy = volatile-lru so only TTL’d session keys are evicted under pressure. Because the module enforces automatic_failover_enabled = true and a hardened security group scoped to the checkout app SG, every team that adopts it inherits a Multi-AZ, encrypted, access-controlled cache without re-deriving the dozen settings that make ElastiCache production-safe.
Best practices
- Always pair
transit_encryption_enabled = truewith anauth_token(or migrate to ElastiCache IAM/RBAC users). TLS without auth leaves the endpoint reachable by anything inside the VPC; store the token in Secrets Manager and rotate it with theROTATEupdate strategy rather than recreating the group. - Keep
automatic_failover_enabledandmulti_az_enabledon for any group with replicas and spreadsubnet_idsacross at least two AZs. A single-node group has no failover target — reservenum_cache_clusters = 1for throwaway dev caches only. - Pin
maxmemory-policydeliberately per workload. Usevolatile-lru/volatile-ttlfor caches where keys carry TTLs,allkeys-lrufor a pure cache, andnoevictiononly when the cache is a system of record you must not silently drop. - Right-size and reserve for cost. Use Graviton-based
cache.r7g/cache.t4gnode types, scale reads withreplicas_per_node_grouprather than oversizing the primary, and buy ElastiCache Reserved Nodes for steady prod workloads to cut roughly a third off on-demand pricing. - Choose topology by data size, not habit. Stay non-clustered (
num_node_groups = 1) until a single node’s memory is the bottleneck, then enable cluster mode and use theconfiguration_endpoint_addressoutput — application clients must be cluster-aware to follow shard slots. - Name and tag consistently via
name_prefix+environment, enable snapshots (snapshot_retention_limit >= 1) for any stateful use, and ship slow-log/engine-log to CloudWatch throughlog_delivery_configurationsso latency regressions are observable before they page you.