System Design Deep Dive: Real-World Distributed Systems/

...

GFS Deep Dive for System Design

Understand the requirements that led to the development of GFS and learn about its architecture.

We'll cover the following...

Problems with traditional file systems
GFS
- Functional requirements
- Non functional requirements
Architecture
Bird’s eye view

Problems with traditional file systems

Before Google File System (GFS), there were single-node file systems, network-attached storage (NAS) systems, and storage area networks (SAN). These file systems served well at the time they were designed. However, with the growing demands of data storage and processing needs of large distributed data-intensive applications, these systems had limitations, some of which have been discussed below.

A single-node file system is a system that runs on a single computer and manages the storage attached to it. A single server has limited resources like storage space, processing power, as well as I/O operations that can be performed on a storage disk per second. We can attach substantial storage capacity to a single server, increase the RAM, and upgrade the processor, but there are limits to this type of vertical scaling. A single server also has limitations regarding the number of data reads and writes, and how quickly data is stored and accessed. These limitations restrict the system's ability to process large datasets and serve a large number of clients simultaneously. We also, can't ignore the fact that a single-node system is a single point of failure where the system becomes unavailable to the users. The focus should be on high throughputThroughput measures how much data is being sent through your network at any given moment. It is frequently measured in bits per second. [source: https://networkshardware.com/throughput-vs-latency/] rather than low latencyLatency refers to the amount of time it takes for your data to be sent to its intended destination. It’s also referred to as the delay in moving data between two clients. It is usually measured in milliseconds (ms). The lower it is, the better. [source: https://networkshardware.com/throughput-vs-latency/] for applications requiring large datasets processing.

The network-attached storage (NAS) system consists of a file-level storage server with multiple clients connected to it over a network running the network file system (NFS) protocol. Clients can store and access the files on this remote storage server like the local files. The NAS system has the same limitations as a single-node file system. Setting up and managing a NAS system is easy but expensive to scale. This system can also suffer from throughput issues while accessing large files from a single server.

The storage area network (SAN) system consists of a cluster of commodity storage devices connected to each other, providing block-level data storage to the clients over the network. SAN systems are easy to scale by adding more storage devices. However, these systems are difficult to manage because of the complexity of the second network — the Fiber Channel (FC). To set up the Fiber Channel, we need dedicated host bus adapters (HBAs) to be deployed on each host server, switches, and specialized cabling. It is difficult to monitor where failure has occurred in this complex system. Data inconsistency issues among replicas may appear in this case. Rebalancing the load on the storage devices might also be difficult to handle with this architecture.

Note: SAN deployments are special-purpose networks apart from the usual Ethernet networks. This duplicate network, while good for segregating storage traffic, is expensive in terms of dollar cost.

Functional requirements

Functional requirements include the following:

Data storage: The system should allow users to store their data on GFS.
Data retrieval: The system should be able to give data back to users when they request it.

Non functional requirements

Non functional requirements include the following:

Scalability: The system should be able to store an increasing amount of data (hundreds of terabytes and beyond), and handle a large number of clients concurrently.
Availability: A file system is one of the main building blocks of many large systems used for data storage and retrieval. The unavailability of such a system disrupts the operation of the systems that rely on it. Therefore, the file system should be able to respond to the client’s requests all the time.
Fault tolerance: The system’s availability and data integrity shouldn’t be compromised by component failures that are common in large clusters consisting of commodity hardware.
Durability: Once the system has acknowledged to the user that its data has been stored, the data shouldn’t be lost unless the user deletes the data themselves.
Easy operational management: The system should easily be able to store multi-GB files and beyond. It should be easy for the system to handle data replication, re-replication, garbage collection, taking snapshots, and other system-wide management activities. If some data becomes stale, there should be an easy mechanism to detect and fix it. The system should allow multiple independent tenants to use GFS for safe data storage and retrieval.
Performance optimization: The focus should be on high throughput rather than low latency for applications that require processing for large datasets. Additionally, Google’s applications, for which the system was being built, most often append data to the files instead of overwriting the existing data. So, the system should be optimized for append operations. For example, a logging application appends log files with new log entries. Instead of overwriting existing copies of the crawled data within a file, a web crawler appends new web crawl data to a crawl file. All MapReduce outputs write a file from beginning to end by appending key/value pairs to the file(s).

Relaxed consistency model: GFS does not need to comply with POSIXThe Portable Operating System Interface (POSIX) is a set of standards set out by the IEEE Computer Society to support compatibility between operating systems. It defines system and user-level application programming interfaces (API), command line shells, and utility interfaces for software portability among operating system variants. Both application and system developers are encouraged to use POSIX. [Wikipedia] standards because of the unique characteristics of the use cases/applications that it targets to serve. A file system must implement a strong consistency model in order to be POSIX compatible. In POSIX, random write is one of the fundamental operations. In GFS, there are more append operations and very few random writes. That’s why GFS doesn’t comply with POSIX and provides a relaxed consistency model that is optimized for append operations. Data consistency in a distributed setting is hard, and GFS carefully opts for a custom-consistency model for better scalability and performance.

Let’s explore GFS’s architecture, which enables it to fulfill the requirements mentioned above.

Each file is split into fixed-size chunks. The manager assigns each chunk a 64-bit globally unique ID and assigns chunkservers where the chunk is stored and replicated. A manager is like an administrator that manages the file system metadata, including namespaces, file-to-chunk mapping, and chunk location. The metadata is stored in the manager’s memory for good performance. For a persistent record of the metadata, the manager logs the metadata changes in an operation log placed on the manager’s hard disk so that it can recover its state after the restart by replaying the operation log. Besides managing metadata, the manager also handles the following tasks:
- Data replication and rebalancing
- Operational locks to ensure data consistency
- Garbage collection of the deleted data
Note: Though Google calls this manager a “master” in their research paper on GFS, we will use the term “GFS manager”, “manager node”, or simply the “manager” to refer to the same thing.

Chunkservers are commodity storage servers that store chunks as plain Linux files.

The client requests the manager node for metadata information, such as the location of the requested chunks. The manager looks into its metadata and responds to the client with the location of the requested chunks. The client then asks the chunkservers for the data operations. It is important to note that the data doesn't flow through the manager but directly between the client and the chunkserver.

Note: The largest GFS cluster can store up to tens of petabytes of data and can be accessed by hundreds of clients concurrently.

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

GFS Deep Dive for System Design

Problems with traditional file systems

GFS

Functional requirements

Non functional requirements

Architecture

Bird’s eye view