Distributed file systems

Traditional techniques might not suffice to extract meaningful information from big data datasets.

Google realized this problem and a potentially robust solution at an early stage.

That’s why they created a distributed file system, called the Google File System (GFS), and published a paper regarding it.

In a nutshell, GFS is a scalable distributed file system for distributed data-intensive applications. Their ultimate goal is to process massive datasets, index billions of web pages, and extract knowledge from them efficiently.

Another important requirement was that GFS should run on a cluster of commodity servers, servers that use everyday hardware like that of most standard computers.

That’s because specialized hardware is expensive and produced on-demand. Due to the size of clusters that Google needed, this was not an option. The expense of all specialized ...

Before We Begin

Setting The Stage

The Hadoop Ecosystem

Streaming

Apache Spark

Conclusion

A Pragmatic Introduction to Hadoop and MapReduce

Distributed file systems