...

>

Design Considerations of a Blob Store

Design Considerations of a Blob Store

Discover the essential System Design considerations for building a reliable blob store. Implement data chunking, metadata management, and user-account-based strategic partitioning. Learn to use multi-level replication and indexing to ensure high availability and fast querying performance.

Introduction

The previous lesson outlined the major components of the blob store. This lesson examines implementation challenges, including storing large blobs, managing replicas, and optimizing retrieval latency. The following table summarizes the lesson objectives.

Summary of the Lesson

Section

Purpose

Blob metadata

This is the metadata that’s maintained to ensure efficient storage and retrieval of blobs.

Partitioning

This determines how blobs are partitioned among different data nodes.

Blob indexing

This shows us how to efficiently search for blobs.

Pagination

This teaches us how to conceive a method for the retrieval of a limited number of blobs to ensure improved readability and loading time.

Replication

This teaches us how to replicate blobs and tells us how many copies we should maintain to improve availability.

Garbage collection

This teaches us how to delete blobs without sacrificing performance.

Streaming

This teaches us how to stream large files chunk-by-chunk to facilitate interactivity for users.

Caching

This shows us how to improve response time and throughput.

Abstraction layers hide internal complexity from users and guide design decisions for routing and sharding. There are three primary layers:

  1. User account: Identifies users via an account_ID. It contains all user containers.

  2. Container: Identifies a set of blobs via a container_ID.

  3. Blob: Identifies specific files via a blob_ID. This layer maintains metadata vital for system availability and reliability.

The table below summarizes these layers.

Layered Information

Level

Uniquely identified by

Information

Sharded by

Mapping

User’s blob store account

account_ID

list of container_ID values

account_ID

Account -> list of containers

Container

container_ID

List of blob_ID values

container_ID

Container -> list of blobs

Blob

blob_ID

{list of chunks, chunkInfo: data node ID's,.. }

blob_ID

Blob -> list of chunks

Note: We generate unique IDs for user accounts, containers, and blobs using a unique ID generator.

The system maintains metadata to manage storage. Let’s examine this data.

Blob metadata

When a user uploads a blob, it is split into small chunksA chunk is the minimum unit of data for writing and reading.. This supports large files that exceed the capacity of a single contiguous disk block or data node. The manager node tracks chunk locations and assigns IDs to facilitate retrieval.

Metadata includes chunk IDs, assigned data nodes, and replica IDs. Replicating chunks ensures reliability in case of node failure.

For example, a 128 MB blob split into two 64 MB chunks would have the following metadata:

Blob Metadata

Chunk

Datanode ID

Replica 1 ID

Replica 2 ID

Replica 3 ID

1

d1b1

r1b1

r2b1

r3b1

2

d1b2

r1b2

r2b2

r3b2

Note: The system should ...