Lesson-04: RAS Evaluation
RAS evaluation
RAS has been in production on Facebook’s fleet for almost two years and has achieved nearly full allocated regions. In this lesson, we evaluated the RAS considering various aspects, which include the following:
- Performance and scalability
- Effect on correlated failure buffers.
- Effect on the cross-datacenter networks.
RAS performance and scalability
In the following sections, the allocation time of RAS in one production region has shown for performance. For scalability, the allocation time in several production regions has been presented as a function of the number of assignment variables.
Allocation time distribution
The following figure demonstrates the distribution of the resource allocation time over the period of three months in a region composed of several hundreds of thousands of servers.
The average allocation time is around 30 minutes. Moreover, within one hour of SLO, the 95th percentile is at 36.67 minutes, and the 99th percentile is at 40.84 minutes. The moderate hardware pool changes are incorporated between the solves, due to which this distribution is slightly tight.
The allocation time is divided into 2 phases. Each phase is divided into the following four steps.
-
RAS build step
This step builds the objectives and constraints required by RAS.
-
Solver build step
This step identifies and builds the objectives and constraints based on the requirements and performs the symmetric-server optimization.
-
Initial state step ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy