Search⌘ K
AI Features

Amazon ElastiCache High Availability and Global Design

Explore how to design highly available Amazon ElastiCache clusters using Multi-AZ automatic failover for node and Availability Zone resilience, and Global Datastore for cross-Region disaster recovery and regional read scaling. Understand failover mechanics, replica roles, endpoint behavior, and replication lag to develop resilient caching architectures that maintain performance during node failures or regional outages.

With the node-based cluster architecture, replication groups, shards, replicas, and endpoints already established, the next critical question emerges naturally: Once a working replication group is running in production, how do you keep it available when a primary node crashes, an entire Availability Zone goes offline, or your application needs a presence in another AWS Region? This lesson addresses that question by covering two distinct mechanisms that solve fundamentally different problems. Multi-AZ with automatic failover protects a cluster against node and AZ failures within a single Region, while Global Datastore extends a node-based cluster across Regions for low-latency regional reads and disaster recovery.

These two mechanisms are complementary layers, not interchangeable options. Conflating them is one of the most common architectural mistakes in ElastiCache design.

Before diving in, several authoritative terms will recur throughout this discussion. A replication groupA logical collection consisting of a primary node and its read replicas that share the same data, forming the unit of replication and failover in ElastiCache for Valkey and Redis OSS. exposes a primary endpoint that abstracts the current primary node for writes and a reader endpoint that distributes read connections across replicas. Automatic failover is the managed promotion of a replica to primary when the existing primary becomes unavailable. Replica lag measures the delay between a write landing on the primary and appearing on a replica. Finally, cross-Region asynchronous replication describes how Global Datastore copies data between Regions with an inherent delay. By the end of this lesson, you will be able to design for AZ-level resilience and Region-level disaster recovery as independent architectural concerns.

Multi-AZ and automatic failover

For production node-based Valkey and Redis OSS clusters, Multi-AZ with automatic failover is the AWS-preferred mechanism for surviving AZ-level failures. Enabling it requires at least one replica per shard, and that replica should reside in a different Availability Zone from the shard's primary node.

How failover executes

When a primary node fails, ElastiCache detects the failure through continuous health monitoring. The service then selects the replica with the least ...