Facebook: Optimized Datacenter Resource Allowance System

Introduction

Cluster managers run on a set of nodes and manage a cluster. It works with cluster agents who handle the complete cluster, including placing and managing containers or virtual machines on servers. The challenging task for cluster managers is to efficiently allocate resources in data centers understudy in the past decades.

Public clouds have acquired various techniques, including open-source systems such as Kubernetes and proprietary systems such as Google’s Borg, Facebook’s Twin, and Microsoft’s Protean.

The capacity reservation allows us to reserve computing instances in advance so that they can be used during critical events such as unscheduled maintenance, disaster recovery, or unusual workload incorporation.

In recent approaches, the problem is that there is a lack of knowledge on how to provide guaranteed capacity despite large-scale failures in data centers.

In this lesson, we describe how Facebook solved this problem for their on-premise infrastructure.

Challenges in providing guaranteed capacity

There are numerous challenges involved in providing guaranteed capacity. Initially, it needs to consider the independent and correlated failures across various components, including clusters, servers, rack, network switch, power row, and cooling systems. Hence, increasing the buffer capacity to handle all the potential failures is expensive in each aspect.

Second, the cluster manager stand-in needs to sustain a capacity guarantee despite ongoing infrastructure management events such as OS kernel upgrades, software updates, and hardware refresh. The cluster manager needs to promptly adopt replacement servers as each can cause a different extent of server capacity loss.

Third, as the nature of the workloads are different, and there might be various kinds of hardware installed in a cluster leading to hardware heterogeneity. Therefore, a cluster manager should provide capacity to meet the workloads constraints and hardware heterogeneity.

Lastly, there exists an inherited tradeoff between the quality and the speed of resource allocation, e.g., if we optimize for speed, we might not be able to provide guarantees to Large-scale failures. For example, to provide fast container allocation, we might also get an unbalanced spread across MSBs, concurrently. In conclusion, an MSB failure could be catastrophic for the reliability of the services.

Prior solutions

The most common approach of assigning servers to clusters is based on static scopes. For example, all servers in a data- center may belong to one cluster. Servers may be added to or removed from a cluster, but often these changes are manually initiated.

Currently, common techniques to assign servers to a cluster are performed statically. Often, a server may be added or removed from a cluster manually. The advantage of this method is that it reduces the candidate servers to be evaluated on the critical path of container placement. Hence, it enables new containers to be quickly deployed within a few seconds by using existing servers within a cluster.

However, this approach has some drawbacks. The static assignment of the server to a cluster let some cluster run out of capacity while others are underutilized. Secondly, the allocation of servers may be suboptimal due to variation in power and network consumption of workloads and different hardware requirements. Finally, service owners have to tackle the data center-scale failures by themselves individually.

The previous approach used by Facebook is to use a shared mega server pool that consists of all servers from data centers in a geographical region connected via a low-latency network. Twine arrange server into logical clusters called entitlements. When a new container needs to be placed but cannot fit on any existing server in an entitlement, a free server is added to entitlement taken greedily from a shared region-level free-server pool to host the new container. The server is returned to the shared free-server pool when the last container is decommissioned. On one side, the advantage of this approach is that a single server pool removes server capacity stranded in many smaller physical clusters. On the other side, it assigns a whole region’s server-to-entitlement on the critical path of container placement. In conclusion, Facebook had to adopt simple techniques to allow quick server-assignment decisions, which could lead to sub-optimal server assignment and could not provide guaranteed capacity in the event of correlated failures. Hence, both approaches are efficient but have their limitations. Ideally, a cluster manager should combine their advantages instead of their limitations.

RAS solution by Facebook

This lesson describes Twine’s new server-allocation component, called Resource Allowance System (RAS). RAS dynamically assign servers to a logical cluster called reservation. A reservation provides its workloads with a certain amount of guaranteed capacity that considers random and correlated failures, maintenance events, heterogeneous hardware resources, and compound workload requirements and characteristics.

RAS breaks resource allocation into the following two levels.

Assignment of servers to reservations off the critical path.
Placement of containers to servers within each reservation.

Through this approach, server-assignment constraints are removed from the latency-sensitive container-placement process. Further, they are evaluated at the reservation-creation time and maintained continuously.

Furthermore, through the two-level approach, each reservation is treated as a separate cluster, enabling multiple container allocators to run independently for better scalability. At last, each reservation incorporates the buffer capacity required for managing large-scale failures and maintenance, removing server-to-reservation assignments from the critical path of these operations.

RAS has several benefits over the previous solution. First, it eliminates the drawbacks of statically-scoped clusters and capacity stranded in clusters, including the responsibilities of service owners to prepare for large-scale failures individually. RAS resolves the issue by dynamically allocating servers to reservations based on workload characteristics and underlying infrastructure changes. Moreover, it also embeds and optimizes failure and maintenance buffers as part of reservations. The second advantage is that RAS eliminates the limitation of Twine’s previous approach of allocating servers on the critical path of container placement by assigning a reservation’s full capacity ahead of time. So, container placement can instantly use a free server already in the reservation. Finally, RAS provides the simple abstraction of workloads running on a reservation that offers guaranteed capacity and supports stacking. Apart from this, RAS also handles random and correlated failures, data center maintenance, heterogeneous hardware, and other data center constraints and realities.

Resource management realities

Various challenges arise to resource allocation due to the capacity scale, complexities of datacenters, and varying workload characteristics while providing guaranteed capacity within a region.

Region layout

Facebook operates in many regions around the globe. The following figure denotes the organization of a region. Each region consists of several data center buildings. Each of them is connected via high bandwidth and low latency network. As shown in the figure below, each data center building is composed of failure domains called the Main Switch Board (MSB) designed to fail independently. An MSB is composed of tens of thousands of servers.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy