Most “Redis is down” pages I have been dragged into were not Redis failing. They were a client library that opened a single connection to a single node, hardcoded a regional hostname, and treated MOVED as a fatal error instead of a routing hint. Azure Cache for Redis Enterprise gives you clustering, multi-region active-active replication, and durable persistence – but every one of those features changes the contract your client must honor. Cross slots are no longer free. A node can move under you mid-request. Two regions can both accept a write to the same key and you have to decide who wins. This guide wires up the Enterprise tier correctly and, just as importantly, builds the client-side behavior that survives the day the topology shifts.
Everything here targets the Enterprise and Enterprise Flash tiers (the ones built on Redis Enterprise software), with notes on where Premium diverges. The provider resource is Microsoft.Cache/redisEnterprise, which is a different ARM resource type from the Microsoft.Cache/redis you use for Basic/Standard/Premium. That distinction trips up Terraform and Bicep modules constantly.
1. Tier selection: Standard, Premium, and Enterprise/Enterprise Flash
Pick the tier from your durability and topology requirements, not from raw memory size. The tiers are not a linear ladder – Enterprise is a separate runtime.
| Capability | Standard | Premium | Enterprise | Enterprise Flash |
|---|---|---|---|---|
| Runtime | OSS Redis | OSS Redis | Redis Enterprise | Redis Enterprise |
| SLA | 99.9% | 99.9% (99.99% zone-redundant) | up to 99.999% | up to 99.999% |
| Clustering | no | OSS only | OSS or Enterprise policy | OSS or Enterprise policy |
| Active geo-replication | no | passive (geo-replica link) | active-active (CRDB) | active-active (CRDB) |
| Persistence | no | RDB + AOF | RDB + AOF | RDB + AOF |
| Redis modules (Search, JSON, etc.) | no | no | yes | yes |
| Storage medium | RAM | RAM | RAM | RAM + NVMe tier |
The deciding factors in practice:
- Enterprise Flash keeps hot keys in RAM and tiers colder values to local NVMe. It is dramatically cheaper per GB for large datasets with skewed access (session stores, large caches). It is the wrong choice for uniformly hot workloads – the flash hop adds latency you will see at p99.
- Active geo-replication (true multi-write, conflict-free) is Enterprise-only. Premium offers a passive geo-replica: a one-directional link where the secondary is read-mostly and you fail over manually. If you need both regions to accept writes, you need Enterprise.
- Redis modules (RediSearch, RedisJSON, RedisTimeSeries, RedisBloom) only exist on Enterprise. If your “cache” is actually doing secondary-index queries, this is the line.
# Enterprise tier uses a distinct resource: redisenterprise, with a child database
az redisenterprise create \
--name kv-redis-prod \
--resource-group rg-data-prod \
--location eastus2 \
--sku Enterprise_E10 \
--capacity 2 \
--zones 1 2 3
# The database (the actual Redis endpoint) is a child resource
az redisenterprise database create \
--cluster-name kv-redis-prod \
--resource-group rg-data-prod \
--client-protocol Encrypted \
--clustering-policy EnterpriseCluster \
--eviction-policy NoEviction \
--persistence aof-enabled=true aof-frequency=1s
The SKU name encodes both the engine (
Enterprise_E10,EnterpriseFlash_F300) and a capacity unit.--capacitymust be an even number for Enterprise SKUs because nodes are deployed in pairs for HA. Always pass--zones 1 2 3at create time; you cannot add zone redundancy to an existing cluster in place.
2. Clustering policies (OSS vs Enterprise) and key distribution
This is the single most consequential decision and it is permanent for the database’s lifetime: you choose it at creation and it dictates how your client connects and how multi-key operations behave.
OSS clustering policy exposes the native Redis Cluster API. The client discovers all shards, computes the CRC16 hash slot for each key (16384 slots), and connects directly to the owning node. This gives the lowest latency and highest throughput because there is no proxy hop – but it requires a cluster-aware client, and the client sees every node’s address, which complicates private networking.
Enterprise clustering policy puts a proxy in front of the shards. The client connects to a single endpoint as if it were a standalone Redis. The proxy routes commands to the correct shard. This is far simpler for clients (any standard client works, no cluster mode) and for networking (one endpoint), at the cost of a proxy hop.
The behavior that surprises people is multi-key commands. In any clustered Redis, a command touching multiple keys requires all those keys to live in the same hash slot:
# This fails across slots -- the keys hash to different slots
MSET user:1001 alice user:1002 bob # CROSSSLOT error under OSS policy
# Hash tags force keys into the same slot using the {...} substring
MSET user:{tenant42}:1001 alice user:{tenant42}:1002 bob # both hash on "tenant42"
Only the substring inside the first {} is hashed. Design your keyspace with hash tags around the entity you co-access (a tenant, an order, a session) so transactions and MGET/MSET stay single-slot. Under the Enterprise policy, the proxy makes some cross-slot multi-key commands appear to work by fanning out, but MULTI/EXEC transactions and Lua scripts still require single-slot keys – so the hash-tag discipline is non-negotiable either way.
Choose OSS policy when you control the client and want maximum performance, and you are comfortable with cluster-aware libraries (StackExchange.Redis, Lettuce, redis-py with cluster mode, go-redis
ClusterClient). Choose Enterprise policy when you need a single endpoint for private networking simplicity, or your client cannot do cluster mode. You cannot change it later without recreating the database.
3. Active geo-replication topologies and conflict handling
Enterprise active geo-replication builds an active-active database (an Active-Active CRDB – conflict-free replicated database). Every participating cluster accepts both reads and writes, and changes replicate to all peers. There is no primary. A region outage means you keep serving from the survivors with no failover step.
The mechanism that makes concurrent writes safe is CRDTs (conflict-free replicated data types). Redis Enterprise reimplements each data type as a CRDT so concurrent writes in different regions converge deterministically:
- Strings /
SET: last-write-wins by timestamp. Concurrent writes resolve to the latest wall-clock write; you can lose one of two concurrent writes to the same key. - Counters (
INCR/DECRBY): additive. Concurrent increments in two regions both apply – this is the killer feature, no lost updates. - Sets, Hashes, Sorted Sets: element-level merge (observed-remove semantics). Adds and removes converge per element rather than per key.
# Create an active geo-replication group spanning two regions.
# Each region is its own redisenterprise cluster + database; you link them
# via a shared group name and mutual linkedDatabase references.
az redisenterprise database create \
--cluster-name kv-redis-eastus2 \
--resource-group rg-data-eastus2 \
--client-protocol Encrypted \
--clustering-policy EnterpriseCluster \
--group-nickname global-sessions \
--linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Cache/redisEnterprise/kv-redis-eastus2/databases/default" \
--linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-westeurope/providers/Microsoft.Cache/redisEnterprise/kv-redis-westeurope/databases/default"
The --linked-databases list must include this database plus every peer, and the same group nickname must be used on every member. Designing the topology:
- Keep the geo group to regions you can tolerate replicating all writes to – replication is full mesh, so N regions means each write fans out to N-1 peers. Bandwidth and cross-region latency cost scale with the mesh.
- Active-active forces
NoEvictionsemantics conceptually: do not run an active-active cache as an LRU eviction cache, because evictions are local and create divergence. Use it for data you intend to keep (sessions, counters, feature flags), and size for the full working set. - Conflict resolution is per data type and automatic. You do not get to plug in custom logic the way Cosmos DB lets you. If LWW on strings is unacceptable for a key, model it as a counter, set, or hash instead.
4. Data persistence with RDB/AOF and durability tradeoffs
Enterprise supports both persistence mechanisms, and they answer different questions. Persistence is about surviving a full cluster restart; it is orthogonal to replication, which is about surviving node loss.
RDB (snapshot) writes a point-in-time dump on an interval (e.g., every 1h/6h/12h). Cheap, low overhead, but you lose everything since the last snapshot on a hard failure.
AOF (append-only file) logs every write. With fsync every second (aof-frequency=1s), worst-case data loss is ~1 second. The cost is write amplification and larger files. This is the right default for anything you cannot regenerate.
# AOF with per-second fsync -- the resilient default for stateful caches
az redisenterprise database update \
--cluster-name kv-redis-prod \
--resource-group rg-data-prod \
--persistence aof-enabled=true aof-frequency=1s
# RDB hourly -- acceptable only for regenerable caches where restart speed matters
az redisenterprise database update \
--cluster-name kv-redis-prod \
--resource-group rg-data-prod \
--persistence rdb-enabled=true rdb-frequency=1h
Two correctness notes. First, in an active-active geo group you generally rely on the peer regions for recovery and persistence is a secondary safety net – a surviving region rehydrates a recovered one. Second, persistence is not a backup. It protects against process restart, not against a bad
FLUSHALLor a logic bug that corrupts data; that is what export/snapshot-to-storage is for. Enterprise persists to the cluster’s local managed disks, not to your storage account, so treat exports separately if you need point-in-time backups.
5. Private endpoint, VNet injection, and TLS hardening
Never expose a production cache to the public internet. The Enterprise tier supports Private Link, which projects the cache into your VNet via a private endpoint and a private IP – the public FQDN resolves to a private address through Private DNS.
resource cache 'Microsoft.Cache/redisEnterprise@2024-09-01-preview' = {
name: 'kv-redis-prod'
location: 'eastus2'
sku: { name: 'Enterprise_E10', capacity: 2 }
zones: ['1', '2', '3']
}
resource db 'Microsoft.Cache/redisEnterprise/databases@2024-09-01-preview' = {
parent: cache
name: 'default'
properties: {
clientProtocol: 'Encrypted' // TLS-only; rejects plaintext
clusteringPolicy: 'EnterpriseCluster'
evictionPolicy: 'NoEviction'
port: 10000
persistence: { aofEnabled: true, aofFrequency: '1s' }
}
}
resource pe 'Microsoft.Network/privateEndpoints@2024-05-01' = {
name: 'pe-kv-redis-prod'
location: 'eastus2'
properties: {
subnet: { id: dataSubnetId }
privateLinkServiceConnections: [
{
name: 'redis'
properties: {
privateLinkServiceId: cache.id
groupIds: ['redisEnterprise']
}
}
]
}
}
Hardening checklist that actually matters:
clientProtocol: 'Encrypted'forces TLS. The Enterprise tier listens on port 10000 (not 6380 like Premium) – a frequent connection-string bug when migrating. Set your client’s TLS port accordingly.- Wire a Private DNS zone (
privatelink.redisenterprise.cache.azure.net) linked to the VNet so the public FQDN resolves privately. Without the zone link, in-VNet clients still resolve the public IP and the private endpoint does nothing for them. - Prefer Microsoft Entra ID (token) authentication over the access key where your client supports it; it removes the long-lived shared secret. The access key still exists as a fallback – rotate it and store it in Key Vault, never in app config.
6. Client resilience: connection multiplexing, retries, and reconnect
This is where most outages are actually caused or prevented. A correctly provisioned cluster behind a broken client is still an outage.
Multiplex one connection, do not pool-per-request. Redis clients like StackExchange.Redis are built around a single long-lived multiplexer that pipelines all commands over a few connections. Opening a connection per operation exhausts ports and ignores the library’s pipelining. Create the multiplexer once as a singleton:
// Singleton ConnectionMultiplexer -- created once, shared process-wide.
var config = new ConfigurationOptions
{
EndPoints = { "kv-redis-prod.eastus2.redisenterprise.cache.azure.net:10000" },
Ssl = true,
AbortOnConnectFail = false, // keep retrying instead of throwing at startup
ConnectRetry = 5,
ConnectTimeout = 15000,
KeepAlive = 30,
ReconnectRetryPolicy = new ExponentialRetry(5000)
};
// Token auth (Entra ID) instead of an access key:
await config.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());
var muxer = await ConnectionMultiplexer.ConnectAsync(config);
The non-obvious settings that matter on Azure:
AbortOnConnectFail = falseis mandatory. The defaulttruethrows permanently if the first connect fails (e.g., during a maintenance window), and the multiplexer never recovers. Withfalse, it reconnects in the background.- During scaling and patching, Azure issues a brief connection blip per node. Your code must retry the operation, not just rely on the multiplexer reconnecting. Wrap commands in a bounded retry (Polly) that handles
RedisConnectionExceptionandRedisTimeoutExceptionwith jittered backoff. - Under OSS clustering policy, the client must follow
MOVED/ASKredirects automatically – every mainstream cluster client does, but only if you enabled cluster mode. AMOVEDreaching your application code means the client is misconfigured.
# redis-py against the Enterprise (proxy) policy -- a single endpoint, TLS, retry on timeout
from redis import Redis
from redis.retry import Retry
from redis.backoff import ExponentialBackoff
from redis.exceptions import ConnectionError, TimeoutError
r = Redis(
host="kv-redis-prod.eastus2.redisenterprise.cache.azure.net",
port=10000, ssl=True,
socket_timeout=5, socket_connect_timeout=5,
retry=Retry(ExponentialBackoff(cap=2, base=0.1), retries=3),
retry_on_error=[ConnectionError, TimeoutError],
health_check_interval=30,
)
health_check_intervalsends a periodicPINGso idle connections that were silently dropped (by a node move or an Azure load-balancer idle timeout) are detected and rebuilt before a real request hits the dead socket. Without it, the first request after an idle period eats the failure.
7. Scaling, reshard operations, and zero-downtime maintenance
Enterprise scales two ways: scale up (a bigger SKU – E10 to E20) and scale out (more capacity units, which add shards and rebalance slots). Both are online operations, but “online” assumes a resilient client (section 6).
# Scale up the SKU (more memory/throughput per node)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
--sku Enterprise_E20
# Scale out capacity (adds nodes/shards; triggers a reshard/rebalance)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
--capacity 4
What happens during a reshard, and how to survive it:
- Hash slots migrate between shards. Under OSS policy, in-flight keys briefly answer
ASK/MOVEDand the client re-routes – transparent only if the client handles redirects. Under Enterprise policy, the proxy absorbs this and clients see at most brief latency. - A small number of connections drop as nodes are added. This is exactly the blip your retry policy exists for. Validate by running a scale operation in a load test and confirming zero application errors, only a latency bump.
- Maintenance windows. Enterprise patches the OS and Redis software with rolling, one-node-at-a-time updates so the database stays available. Configure a maintenance window aligned to your low-traffic hours, and never assume “no failover during maintenance” – assume a connection reset per node and make the client idempotent. Caches are naturally idempotent for reads; for write paths, ensure a retried
SET/INCRis safe (anINCRretried after a successful-but-unacknowledged write double-counts, so use idempotency keys orSETwith a known value for critical counters).
8. Monitoring memory pressure, evictions, and latency percentiles
Redis fails loudly on CPU and silently on memory. Watch both, and alert on the leading indicators rather than the outage.
The metrics that predict incidents (all available in Azure Monitor for the Enterprise resource):
- Used Memory Percentage – the leading indicator. Above ~80% with
NoEviction, writes start returning OOM errors; with an eviction policy, you start losing keys. Alert at 75%. - Evicted Keys / Expired Keys – a rising eviction rate means the cache is undersized for its working set. On an active-active database, evictions are a correctness problem (divergence), not just a hit-rate problem.
- Server Load – the percentage of time the Redis main thread was busy. Sustained > 80% means you are CPU-bound; scale up or shard out, because a single slow
KEYSor largeMGETcan stall everything. - Connections Created Per Second – a high, sustained value is the fingerprint of a client opening connections per request (section 6). Healthy multiplexed clients create a handful and reuse them.
// Memory pressure trend + eviction correlation over the last 24h
AzureMetrics
| where ResourceProvider == "MICROSOFT.CACHE"
| where ResourceId contains "kv-redis-prod"
| where MetricName in ("usedmemorypercentage", "evictedkeys", "serverLoad")
| summarize avg(Average), max(Maximum) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc
For latency, do not trust server-side averages – measure client-side percentiles, because an average of 1ms hides a p99 of 200ms caused by a single hot shard or a GC pause in your own process. Track p50/p99 per operation from the application, and correlate p99 spikes against serverLoad and reshard events. A latency cliff that lines up with a scaling operation is your retry policy working; one that does not is a hot key or a cross-slot fan-out.
Verify
Prove the cluster behaves before you depend on it. Run these against the deployed Enterprise database from inside the VNet:
# 1. TLS-only on port 10000; plaintext must be refused
redis-cli -h kv-redis-prod.eastus2.redisenterprise.cache.azure.net -p 10000 --tls PING
# -> PONG (and a non-TLS connect to 10000 should fail)
# 2. Cluster topology (OSS policy) -- confirm shards and slot coverage
redis-cli -h <host> -p 10000 --tls CLUSTER SHARDS
# 3. Cross-slot discipline -- this SHOULD fail, proving keys are distributed
redis-cli -h <host> -p 10000 --tls MSET a 1 b 2 # CROSSSLOT (OSS policy)
redis-cli -h <host> -p 10000 --tls MSET k:{t1} 1 j:{t1} 2 # OK -- hash tag co-locates
# 4. Persistence is on
redis-cli -h <host> -p 10000 --tls CONFIG GET appendonly # -> appendonly yes
# 5. Active-active counter convergence: INCR in region A and region B,
# then read from either -- the value is the SUM, not a lost update.
redis-cli -h <hostA> -p 10000 --tls INCR global:signups
redis-cli -h <hostB> -p 10000 --tls INCR global:signups
redis-cli -h <hostA> -p 10000 --tls GET global:signups # -> 2
Then run a load test through a scale-out (--capacity 4) and confirm application error count stays at zero while p99 shows only a transient bump. That single test validates sections 6, 7, and 8 at once – it is the closest thing to a real failover you can run on demand.
Enterprise scenario
A payments platform ran a global idempotency cache on Premium with a passive geo-replica: EU writes went to West Europe, a one-way replica fed East US for reads, and failover was a manual DNS swap. During a West Europe zone incident the replica was read-only, so for eleven minutes every in-flight payment in the US that needed to check “have I already processed this request id?” either blocked on the manual failover or fell back to the database and ran at a fraction of normal throughput. Worse, after failover, a handful of duplicate captures slipped through because the idempotency keys written in the US during the gap had not replicated back.
The fix was to move to Enterprise active-active geo-replication across West Europe and East US, with idempotency state modeled as CRDT sets keyed by request id. Both regions now accept writes; a request id recorded in either region converges to the other with no primary and no manual failover. Because adds to a CRDT set are commutative and observed-remove, two regions independently recording the same request id merge cleanly instead of conflicting. They kept AOF at 1s as a restart safety net and sized for NoEviction (idempotency keys carry a TTL via SET ... EX, never LRU eviction, so divergence is impossible).
# Idempotency check, region-local, on an active-active CRDB.
# SET NX EX is the primitive: succeeds only if the key is new, with a TTL.
# Converges across regions because string LWW + NX gives "first writer in either region wins".
SET payment:idem:7f3c-9a21 processing NX EX 86400
# -> OK (first time, in either region: proceed)
# -> nil (already seen anywhere in the mesh: this is a duplicate, reject)
The measurable result: regional zone failure became a non-event (no manual step, p99 unchanged), and the duplicate-capture class of bug was designed out rather than monitored for. The lesson was not “Enterprise is better” – it was that passive replication is a DR tool, not an availability tool, and a system that must never lose a write across regions has to be active-active and conflict-free by construction.