...

/

Cross-regional Level of Memcache

Cross-regional Level of Memcache

Learn about the challenges and their solutions while scaling Memcache across data centers.

Introduction to the cross-regional level

At the regional level, latency was not a huge problem as the latencies inside a data center are around one millisecond, but as soon as we go to the cross-regional level, the latencies might go to around a 100 milliseconds. Due to this, and unlike in previous layers the CAP theorem comes in full effect and we have to choose between availability and consistency.

Cross-regional replication brings many benefits to our system:

  • Firstly, it reduces latency by allowing clients to communicate with local Memcached and database servers.

  • Secondly, it can mitigate the effects of natural disasters like earthquakes or hurricanes.

  • Thirdly, having vastly different geographical locations can have economic incentives like cheaper electricity or land.

Replication on this level poses a challenge in maintaining the consistency between the primary and secondary regions.

Overview of design problems at the cross-regional level

The storage layer is fully replicated across data centers. We use a primary-secondary set up to replicate data at the storage layer. One might think that once data is available in a data center, the Memcache layer can trivially work, though more care is needed to deal with a few subtle data consistency issues.

When data centers are available worldwide, we must manage the lag between them for data replication. Two problems occur when replication is in progress:

  • Writes from a primary region: One problem that happens when a replication is occurring is how an invalidation from a primary region arrives before the data has been completely replicated to that region. Such a scenario can happen when the storage layer replication is lagging behind invalidation traffic from the front-end clusters.

  • Writes from a secondary region: The other problem that occurs is that of updating data when it is in a secondary region. The user can update the data in a local replica–but due to a cache refill might show stale data because all writes and updates need to go to the primary storage and will be relayed back to the secondaries. Such a scenario can be confusing for the end users (for example, clients might see an item that they just deleted!).

The source of both problems stems from the storage layer lag.