Lesson-01: Resource Management

Introduction

Cluster managers run on a set of nodes and manage a cluster. It works with cluster agents who handle the complete cluster, including placing and managing containers or virtual machines on servers. The challenging task for cluster managers is to allocate resources in data centers efficiently. The capacity reservation allows us to reserve computing instances in advance to be used during critical events such as unscheduled maintenance, disaster recovery, or unusual workload incorporation.

Recent approaches are unable to provide guaranteed capacity dynamically during critical events, especially large-scale failures.

This series of lessons describes how Facebook solved this problem for their on-premise infrastructure by introducing a novel system. We will study the architecture of the proposed system in detail in upcoming lessons.

Challenges in providing guaranteed capacity

There are numerous challenges involved in providing guaranteed capacity. Each of these challenges is given below.

It needs to consider the independent and correlated failures across various components of the data center. Hence, increasing the stand-by capacity to handle all the potential shortcomings is prohibitively expensive.
The server manager should acquire replacement servers in normal infrastructure lifecycle events such as OS kernel upgrades, software updates, hardware refresh, and other physical maintenance to avoid server capacity loss ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Lesson-01: Resource Management

Introduction

Challenges in providing guaranteed capacity