Amazon ElastiCache High Availability and Global Design

Explore Amazon ElastiCache's high availability strategies including multi-AZ with automatic failover to protect against node and Availability Zone failures within a region. Understand how Global Datastore enables asynchronous cross-region replication for disaster recovery and low-latency regional reads. Learn the operational roles of replicas, endpoint behavior during failover, and how these layers complement each other in resilient cache architectures.

We'll cover the following...

Multi-AZ and automatic failover
- How failover executes
- Cluster mode and blast radius
Replica roles and endpoint behavior
- Endpoint mechanics during normal operation and failover
  - Replica lag and data loss exposure
  - Placement trade-offs per shard
Global Datastore for cross-region design
- Use cases and replication model
- Promotion and write path
Separating HA from DR in design
Conclusion

With the node-based cluster architecture, replication groups, shards, replicas, and endpoints already established, the next critical question emerges naturally: Once a working replication group is running in production, how do you keep it available when a primary node crashes, an entire Availability Zone goes offline, or your application needs a presence in another AWS Region? This lesson addresses that question by covering two distinct mechanisms that solve fundamentally different problems. Multi-AZ with automatic failover protects a cluster against node and AZ failures within a single region, while Global Datastore extends a node-based cluster across regions for low-latency regional reads and disaster recovery.

These two mechanisms are complementary layers, not interchangeable options. Conflating them is one of the most common architectural mistakes in ElastiCache design.

Before diving in, several authoritative terms will recur throughout this discussion. A replication groupA logical collection consisting of a primary node and its read replicas that share the same data, forming the unit of replication and failover in ElastiCache for Valkey and Redis OSS. exposes a primary endpoint that abstracts the current primary node for writes and a reader endpoint that distributes read connections across replicas. Automatic failover is the managed promotion of a replica to primary when the existing primary becomes unavailable. Replica lag measures the delay between a write landing on the primary and appearing on a replica. Finally, cross-region asynchronous replication describes how Global Datastore copies data between regions with an inherent delay. By the end of this lesson, you will be able to design for AZ-level resilience and region-level disaster recovery as independent architectural concerns.

Multi-AZ and automatic failover

For production node-based Valkey and Redis OSS clusters, multi-AZ with automatic failover is the AWS-preferred mechanism for surviving AZ-level failures. Enabling it requires at least one replica per shard, and ...

1.Introduction

2.Common Foundation for All AWS Database Study

Cloud Lab

3.Amazon RDS

Cloud Lab

Cloud Lab

4.Amazon Aurora

Cloud Lab

5.Amazon DocumentDB

Cloud Lab

Cloud Lab

6.Amazon DynamoDB

Cloud Lab

Cloud Lab

7.Amazon ElastiCache

Cloud Lab

8.Amazon KeySpaces

Cloud Lab

9.Amazon MemoryDB

Cloud Lab

10.Amazon Neptune

Cloud Lab

11.Amazon Timestream

Cloud Lab

12.Conclusion

Amazon ElastiCache High Availability and Global Design

Multi-AZ and automatic failover