Amazon ElastiCache High Availability and Global Design
Explore how to design highly available Amazon ElastiCache clusters using Multi-AZ automatic failover for node and Availability Zone resilience, and Global Datastore for cross-Region disaster recovery and regional read scaling. Understand failover mechanics, replica roles, endpoint behavior, and replication lag to develop resilient caching architectures that maintain performance during node failures or regional outages.
With the node-based cluster architecture, replication groups, shards, replicas, and endpoints already established, the next critical question emerges naturally: Once a working replication group is running in production, how do you keep it available when a primary node crashes, an entire Availability Zone goes offline, or your application needs a presence in another AWS Region? This lesson addresses that question by covering two distinct mechanisms that solve fundamentally different problems. Multi-AZ with automatic failover protects a cluster against node and AZ failures within a single Region, while Global Datastore extends a node-based cluster across Regions for low-latency regional reads and disaster recovery.
These two mechanisms are complementary layers, not interchangeable options. Conflating them is one of the most common architectural mistakes in ElastiCache design.
Before diving in, several authoritative terms will recur throughout this discussion. A
Multi-AZ and automatic failover
For production node-based Valkey and Redis OSS clusters, Multi-AZ with automatic failover is the AWS-preferred mechanism for surviving AZ-level failures. Enabling it requires at least one replica per shard, and that replica should reside in a different Availability Zone from the shard's primary node.
How failover executes
When a primary node fails, ElastiCache detects the failure through continuous health monitoring. The service then selects the replica with the least ...