A Pragmatic Introduction to Hadoop and MapReduce

Distributed file systems

Traditional techniques might not suffice to extract meaningful information from big data datasets.

Google realized this problem and a potentially robust solution at an early stage.

That’s why they created a distributed file system, called the Google File System (GFS), and published a paper regarding it.

In a nutshell, GFS is a scalable distributed file system for distributed data-intensive applications. Their ultimate goal is to process massive datasets, index billions of web pages, and extract knowledge from them efficiently.

Another important requirement was that GFS should run on a cluster of commodity servers, servers that use everyday hardware like that of most standard computers.

That’s because specialized hardware is expensive and produced on-demand. Due to the size of clusters that Google needed, this was not an option. The expense of all specialized hardware would make any initial profits obsolete.

It is worth noting that, contrary to other filesystems, GFS employs a block size of 64MB to increase throughput. Most everyday filesystems use a block size of a few kilobytes.

Of course, a file system alone is not enough. The data is of no use if we cannot make value from it.

MapReduce

Get hands-on with 1200+ tech skills courses.