Search⌘ K
AI Features

Durability, HA, and DR

Explore how Amazon MemoryDB achieves data durability with its transaction log, supports high availability through Multi-AZ replica promotion, and enables disaster recovery via snapshots and multi-Region replication. Understand key concepts like RPO, RTO, and how to design resilient in-memory databases for cloud applications.

The previous lesson focused on performance and scale, covering shard sizing, replica count, throughput tuning, and latency optimization. Those techniques ensure Amazon MemoryDB for Redis delivers sub-millisecond reads and high write throughput. But speed means nothing if data disappears after a failure. A cache that loses its contents during a node crash forces the application to rebuild state from a slower backend, and for workloads that treat the in-memory layer as the primary database, that loss is unacceptable. This lesson shifts the conversation from how fast MemoryDB operates to how reliably it preserves data and recovers from failures.

Traditional in-memory stores like ElastiCache for Redis treat data as ephemeral. If a node fails, the data in memory is gone, and the application must repopulate the cache from its source of truth. MemoryDB breaks that assumption by functioning as a durable primary database that happens to serve data from memory. Understanding how it achieves this requires examining four resilience pillars that build on each other.

  • Durability model: The transaction log mechanism that persists every write before acknowledging it to the client.

  • High availability via Multi-AZ: Automatic replica promotion that keeps the cluster operational when a primary node fails.

  • Backup and restore through snapshots: Cluster-level point-in-time captures that protect against logical errors and support operational recovery.

  • Multi-Region awareness: Cross-Region replication patterns that extend disaster recovery beyond a single Region.

Along the way, you will encounter key terms such as RPO (Recovery Point Objective)The maximum acceptable amount of data loss measured in time, representing how far back in time a recovery point can be from the moment of failure. and RTO (Recovery Time Objective)The maximum acceptable duration of downtime before a service must be restored to operational status.. By the end of this lesson, you will be able to distinguish AZ-level resilience from Region-level resilience and select the right mechanism for each failure scenario.

The MemoryDB durability model

MemoryDB is not just another Redis cache with persistence bolted on. Every mutating operation, whether a simple SET command or a complex sorted-set update, is written to a distributed transaction logA multi-AZ append-only log that durably records every write operation before the primary node acknowledges success to the client, enabling data reconstruction even after complete node loss. before the primary node acknowledges the write back to the client. This commit-before-acknowledge design is what elevates MemoryDB from a cache to a durable primary database.

How the write path works

When a client sends a write request, the primary node forwards the operation to the transaction log, ...