Azure Databases

Azure Cache for Redis Enterprise: Clustering, Active Geo-Replication, and Resilient Failover Patterns

Most “Redis is down” pages I have been dragged into were not Redis failing. They were a client library that opened a single connection to a single node, hardcoded a regional hostname, and treated MOVED as a fatal error instead of a routing hint. Azure Cache for Redis Enterprise gives you clustering, multi-region active-active replication, and durable persistence – but every one of those features changes the contract your client must honor. Cross slots are no longer free. A node can move under you mid-request. Two regions can both accept a write to the same key and you have to decide who wins. This guide wires up the Enterprise tier correctly and, just as importantly, builds the client-side behavior that survives the day the topology shifts.

Everything here targets the Enterprise and Enterprise Flash tiers (the ones built on Redis Enterprise software), with notes on where Premium diverges. The provider resource is Microsoft.Cache/redisEnterprise, which is a different ARM resource type from the Microsoft.Cache/redis you use for Basic/Standard/Premium. That distinction trips up Terraform and Bicep modules constantly.

1. Tier selection: Standard, Premium, and Enterprise/Enterprise Flash

Pick the tier from your durability and topology requirements, not from raw memory size. The tiers are not a linear ladder – Enterprise is a separate runtime.

Capability Standard Premium Enterprise Enterprise Flash
Runtime OSS Redis OSS Redis Redis Enterprise Redis Enterprise
SLA 99.9% 99.9% (99.99% zone-redundant) up to 99.999% up to 99.999%
Clustering no OSS only OSS or Enterprise policy OSS or Enterprise policy
Active geo-replication no passive (geo-replica link) active-active (CRDB) active-active (CRDB)
Persistence no RDB + AOF RDB + AOF RDB + AOF
Redis modules (Search, JSON, etc.) no no yes yes
Storage medium RAM RAM RAM RAM + NVMe tier

The deciding factors in practice:

# Enterprise tier uses a distinct resource: redisenterprise, with a child database
az redisenterprise create \
  --name kv-redis-prod \
  --resource-group rg-data-prod \
  --location eastus2 \
  --sku Enterprise_E10 \
  --capacity 2 \
  --zones 1 2 3

# The database (the actual Redis endpoint) is a child resource
az redisenterprise database create \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --eviction-policy NoEviction \
  --persistence aof-enabled=true aof-frequency=1s

The SKU name encodes both the engine (Enterprise_E10, EnterpriseFlash_F300) and a capacity unit. --capacity must be an even number for Enterprise SKUs because nodes are deployed in pairs for HA. Always pass --zones 1 2 3 at create time; you cannot add zone redundancy to an existing cluster in place.

2. Clustering policies (OSS vs Enterprise) and key distribution

This is the single most consequential decision and it is permanent for the database’s lifetime: you choose it at creation and it dictates how your client connects and how multi-key operations behave.

OSS clustering policy exposes the native Redis Cluster API. The client discovers all shards, computes the CRC16 hash slot for each key (16384 slots), and connects directly to the owning node. This gives the lowest latency and highest throughput because there is no proxy hop – but it requires a cluster-aware client, and the client sees every node’s address, which complicates private networking.

Enterprise clustering policy puts a proxy in front of the shards. The client connects to a single endpoint as if it were a standalone Redis. The proxy routes commands to the correct shard. This is far simpler for clients (any standard client works, no cluster mode) and for networking (one endpoint), at the cost of a proxy hop.

The behavior that surprises people is multi-key commands. In any clustered Redis, a command touching multiple keys requires all those keys to live in the same hash slot:

# This fails across slots -- the keys hash to different slots
MSET user:1001 alice user:1002 bob   # CROSSSLOT error under OSS policy

# Hash tags force keys into the same slot using the {...} substring
MSET user:{tenant42}:1001 alice user:{tenant42}:1002 bob   # both hash on "tenant42"

Only the substring inside the first {} is hashed. Design your keyspace with hash tags around the entity you co-access (a tenant, an order, a session) so transactions and MGET/MSET stay single-slot. Under the Enterprise policy, the proxy makes some cross-slot multi-key commands appear to work by fanning out, but MULTI/EXEC transactions and Lua scripts still require single-slot keys – so the hash-tag discipline is non-negotiable either way.

Choose OSS policy when you control the client and want maximum performance, and you are comfortable with cluster-aware libraries (StackExchange.Redis, Lettuce, redis-py with cluster mode, go-redis ClusterClient). Choose Enterprise policy when you need a single endpoint for private networking simplicity, or your client cannot do cluster mode. You cannot change it later without recreating the database.

3. Active geo-replication topologies and conflict handling

Enterprise active geo-replication builds an active-active database (an Active-Active CRDB – conflict-free replicated database). Every participating cluster accepts both reads and writes, and changes replicate to all peers. There is no primary. A region outage means you keep serving from the survivors with no failover step.

The mechanism that makes concurrent writes safe is CRDTs (conflict-free replicated data types). Redis Enterprise reimplements each data type as a CRDT so concurrent writes in different regions converge deterministically:

# Create an active geo-replication group spanning two regions.
# Each region is its own redisenterprise cluster + database; you link them
# via a shared group name and mutual linkedDatabase references.

az redisenterprise database create \
  --cluster-name kv-redis-eastus2 \
  --resource-group rg-data-eastus2 \
  --client-protocol Encrypted \
  --clustering-policy EnterpriseCluster \
  --group-nickname global-sessions \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-eastus2/providers/Microsoft.Cache/redisEnterprise/kv-redis-eastus2/databases/default" \
  --linked-databases id="/subscriptions/<sub>/resourceGroups/rg-data-westeurope/providers/Microsoft.Cache/redisEnterprise/kv-redis-westeurope/databases/default"

The --linked-databases list must include this database plus every peer, and the same group nickname must be used on every member. Designing the topology:

4. Data persistence with RDB/AOF and durability tradeoffs

Enterprise supports both persistence mechanisms, and they answer different questions. Persistence is about surviving a full cluster restart; it is orthogonal to replication, which is about surviving node loss.

RDB (snapshot) writes a point-in-time dump on an interval (e.g., every 1h/6h/12h). Cheap, low overhead, but you lose everything since the last snapshot on a hard failure.

AOF (append-only file) logs every write. With fsync every second (aof-frequency=1s), worst-case data loss is ~1 second. The cost is write amplification and larger files. This is the right default for anything you cannot regenerate.

# AOF with per-second fsync -- the resilient default for stateful caches
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence aof-enabled=true aof-frequency=1s

# RDB hourly -- acceptable only for regenerable caches where restart speed matters
az redisenterprise database update \
  --cluster-name kv-redis-prod \
  --resource-group rg-data-prod \
  --persistence rdb-enabled=true rdb-frequency=1h

Two correctness notes. First, in an active-active geo group you generally rely on the peer regions for recovery and persistence is a secondary safety net – a surviving region rehydrates a recovered one. Second, persistence is not a backup. It protects against process restart, not against a bad FLUSHALL or a logic bug that corrupts data; that is what export/snapshot-to-storage is for. Enterprise persists to the cluster’s local managed disks, not to your storage account, so treat exports separately if you need point-in-time backups.

5. Private endpoint, VNet injection, and TLS hardening

Never expose a production cache to the public internet. The Enterprise tier supports Private Link, which projects the cache into your VNet via a private endpoint and a private IP – the public FQDN resolves to a private address through Private DNS.

resource cache 'Microsoft.Cache/redisEnterprise@2024-09-01-preview' = {
  name: 'kv-redis-prod'
  location: 'eastus2'
  sku: { name: 'Enterprise_E10', capacity: 2 }
  zones: ['1', '2', '3']
}

resource db 'Microsoft.Cache/redisEnterprise/databases@2024-09-01-preview' = {
  parent: cache
  name: 'default'
  properties: {
    clientProtocol: 'Encrypted'        // TLS-only; rejects plaintext
    clusteringPolicy: 'EnterpriseCluster'
    evictionPolicy: 'NoEviction'
    port: 10000
    persistence: { aofEnabled: true, aofFrequency: '1s' }
  }
}

resource pe 'Microsoft.Network/privateEndpoints@2024-05-01' = {
  name: 'pe-kv-redis-prod'
  location: 'eastus2'
  properties: {
    subnet: { id: dataSubnetId }
    privateLinkServiceConnections: [
      {
        name: 'redis'
        properties: {
          privateLinkServiceId: cache.id
          groupIds: ['redisEnterprise']
        }
      }
    ]
  }
}

Hardening checklist that actually matters:

6. Client resilience: connection multiplexing, retries, and reconnect

This is where most outages are actually caused or prevented. A correctly provisioned cluster behind a broken client is still an outage.

Multiplex one connection, do not pool-per-request. Redis clients like StackExchange.Redis are built around a single long-lived multiplexer that pipelines all commands over a few connections. Opening a connection per operation exhausts ports and ignores the library’s pipelining. Create the multiplexer once as a singleton:

// Singleton ConnectionMultiplexer -- created once, shared process-wide.
var config = new ConfigurationOptions
{
    EndPoints = { "kv-redis-prod.eastus2.redisenterprise.cache.azure.net:10000" },
    Ssl = true,
    AbortOnConnectFail = false,          // keep retrying instead of throwing at startup
    ConnectRetry = 5,
    ConnectTimeout = 15000,
    KeepAlive = 30,
    ReconnectRetryPolicy = new ExponentialRetry(5000)
};
// Token auth (Entra ID) instead of an access key:
await config.ConfigureForAzureWithTokenCredentialAsync(new DefaultAzureCredential());

var muxer = await ConnectionMultiplexer.ConnectAsync(config);

The non-obvious settings that matter on Azure:

# redis-py against the Enterprise (proxy) policy -- a single endpoint, TLS, retry on timeout
from redis import Redis
from redis.retry import Retry
from redis.backoff import ExponentialBackoff
from redis.exceptions import ConnectionError, TimeoutError

r = Redis(
    host="kv-redis-prod.eastus2.redisenterprise.cache.azure.net",
    port=10000, ssl=True,
    socket_timeout=5, socket_connect_timeout=5,
    retry=Retry(ExponentialBackoff(cap=2, base=0.1), retries=3),
    retry_on_error=[ConnectionError, TimeoutError],
    health_check_interval=30,
)

health_check_interval sends a periodic PING so idle connections that were silently dropped (by a node move or an Azure load-balancer idle timeout) are detected and rebuilt before a real request hits the dead socket. Without it, the first request after an idle period eats the failure.

7. Scaling, reshard operations, and zero-downtime maintenance

Enterprise scales two ways: scale up (a bigger SKU – E10 to E20) and scale out (more capacity units, which add shards and rebalance slots). Both are online operations, but “online” assumes a resilient client (section 6).

# Scale up the SKU (more memory/throughput per node)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --sku Enterprise_E20

# Scale out capacity (adds nodes/shards; triggers a reshard/rebalance)
az redisenterprise update --name kv-redis-prod --resource-group rg-data-prod \
  --capacity 4

What happens during a reshard, and how to survive it:

8. Monitoring memory pressure, evictions, and latency percentiles

Redis fails loudly on CPU and silently on memory. Watch both, and alert on the leading indicators rather than the outage.

The metrics that predict incidents (all available in Azure Monitor for the Enterprise resource):

// Memory pressure trend + eviction correlation over the last 24h
AzureMetrics
| where ResourceProvider == "MICROSOFT.CACHE"
| where ResourceId contains "kv-redis-prod"
| where MetricName in ("usedmemorypercentage", "evictedkeys", "serverLoad")
| summarize avg(Average), max(Maximum) by MetricName, bin(TimeGenerated, 5m)
| order by TimeGenerated desc

For latency, do not trust server-side averages – measure client-side percentiles, because an average of 1ms hides a p99 of 200ms caused by a single hot shard or a GC pause in your own process. Track p50/p99 per operation from the application, and correlate p99 spikes against serverLoad and reshard events. A latency cliff that lines up with a scaling operation is your retry policy working; one that does not is a hot key or a cross-slot fan-out.

Verify

Prove the cluster behaves before you depend on it. Run these against the deployed Enterprise database from inside the VNet:

# 1. TLS-only on port 10000; plaintext must be refused
redis-cli -h kv-redis-prod.eastus2.redisenterprise.cache.azure.net -p 10000 --tls PING
# -> PONG   (and a non-TLS connect to 10000 should fail)

# 2. Cluster topology (OSS policy) -- confirm shards and slot coverage
redis-cli -h <host> -p 10000 --tls CLUSTER SHARDS

# 3. Cross-slot discipline -- this SHOULD fail, proving keys are distributed
redis-cli -h <host> -p 10000 --tls MSET a 1 b 2          # CROSSSLOT (OSS policy)
redis-cli -h <host> -p 10000 --tls MSET k:{t1} 1 j:{t1} 2 # OK -- hash tag co-locates

# 4. Persistence is on
redis-cli -h <host> -p 10000 --tls CONFIG GET appendonly  # -> appendonly yes

# 5. Active-active counter convergence: INCR in region A and region B,
#    then read from either -- the value is the SUM, not a lost update.
redis-cli -h <hostA> -p 10000 --tls INCR global:signups
redis-cli -h <hostB> -p 10000 --tls INCR global:signups
redis-cli -h <hostA> -p 10000 --tls GET  global:signups   # -> 2

Then run a load test through a scale-out (--capacity 4) and confirm application error count stays at zero while p99 shows only a transient bump. That single test validates sections 6, 7, and 8 at once – it is the closest thing to a real failover you can run on demand.

Enterprise scenario

A payments platform ran a global idempotency cache on Premium with a passive geo-replica: EU writes went to West Europe, a one-way replica fed East US for reads, and failover was a manual DNS swap. During a West Europe zone incident the replica was read-only, so for eleven minutes every in-flight payment in the US that needed to check “have I already processed this request id?” either blocked on the manual failover or fell back to the database and ran at a fraction of normal throughput. Worse, after failover, a handful of duplicate captures slipped through because the idempotency keys written in the US during the gap had not replicated back.

The fix was to move to Enterprise active-active geo-replication across West Europe and East US, with idempotency state modeled as CRDT sets keyed by request id. Both regions now accept writes; a request id recorded in either region converges to the other with no primary and no manual failover. Because adds to a CRDT set are commutative and observed-remove, two regions independently recording the same request id merge cleanly instead of conflicting. They kept AOF at 1s as a restart safety net and sized for NoEviction (idempotency keys carry a TTL via SET ... EX, never LRU eviction, so divergence is impossible).

# Idempotency check, region-local, on an active-active CRDB.
# SET NX EX is the primitive: succeeds only if the key is new, with a TTL.
# Converges across regions because string LWW + NX gives "first writer in either region wins".
SET payment:idem:7f3c-9a21 processing NX EX 86400
# -> OK    (first time, in either region: proceed)
# -> nil   (already seen anywhere in the mesh: this is a duplicate, reject)

The measurable result: regional zone failure became a non-event (no manual step, p99 unchanged), and the duplicate-capture class of bug was designed out rather than monitored for. The lesson was not “Enterprise is better” – it was that passive replication is a DR tool, not an availability tool, and a system that must never lose a write across regions has to be active-active and conflict-free by construction.

Checklist

AzureRedisCachingHigh AvailabilityGeo-Replication

Comments

Keep Reading