Cross-regional Level of Memcache

Learn about the challenges and their solutions while scaling Memcache across data centers.

We'll cover the following...

Introduction to the cross-regional level
Overview of design problems at the cross-regional level
Writes from a primary region
Writes from a secondary region
Summary

Introduction to the cross-regional level

At the regional level, latency was not a huge problem as the latencies inside a data center are around one millisecond, but as soon as we go to the cross-regional level, the latencies might go to around a 100 milliseconds. Due to this, and unlike in previous layers the CAP theorem comes in full effect and we have to choose between availability and consistency.

Cross-regional replication brings many benefits to our system:

Firstly, it reduces latency by allowing clients to communicate with local Memcached and database servers.
Secondly, it can mitigate the effects of natural disasters like earthquakes or hurricanes.
Thirdly, having vastly different geographical locations can have economic incentives like cheaper electricity or land.

Replication on this level poses a challenge in maintaining the consistency between the primary and secondary regions.

Overview of design problems at the cross-regional level

The storage layer is fully replicated across data centers. We use a primary-secondary set up to replicate data at the storage layer. One might think that once data is available in a data center, the Memcache layer can trivially work, though more care is needed to deal with a few subtle data consistency issues.

When data centers are available worldwide, we must manage the lag between them for data replication. Two problems occur when replication is in progress:

Writes from a primary region: One problem that happens when a replication is occurring is how an invalidation from a primary region arrives before the data has been completely replicated to that region. Such a scenario can happen when the storage layer replication is lagging behind invalidation traffic from the front-end clusters.
Writes from a secondary region: The other problem that occurs is that of updating data when it is in a secondary region. The user can update the data in a local replica–but due to a cache refill might show stale data because all writes and updates need to go to the primary storage and will be relayed back to the secondaries. Such a scenario can be confusing for the end users (for example, clients might see an item that they just deleted!).

The source of both problems stems from the storage layer lag.