Introduction

Cluster managers run on a set of nodes and manage a cluster. It works with cluster agents who handle the complete cluster, including placing and managing containers or virtual machines on servers. The challenging task for cluster managers is to allocate resources in data centers efficiently. The capacity reservation allows us to reserve computing instances in advance to be used during critical events such as unscheduled maintenance, disaster recovery, or unusual workload incorporation.

Recent approaches are unable to provide guaranteed capacity dynamically during critical events, especially large-scale failures.

This series of lessons describes how Facebook solved this problem for their on-premise infrastructure by introducing a novel system. We will study the architecture of the proposed system in detail in upcoming lessons.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy