Introduction

Some of the most significant and impactful innovations of the last decade have been in the field of Big Data. In 2004, this area of study generated wide-spread interest in technical circles with the publication of a white paper on MapReduce by Google researchers. After that, Big Data as a paradigm gradually became more and more popular. This occurred because the volume of produced data exploded and corporations sought to exploit and benefit from the Internet of Things (IoT). This Google Trends graph shows interest in “Big Data” as a search term starting from 2004 to now.

Like all new technologies, Big Data went from obscurity to popularity and then fell back to form a plateau as the space matured. In the process, several startups such as Cloudera, Hortonworks, and MapR had lofty ambitions to upend traditional data processing and storage techniques, along with their peddlers such as Oracle. But in the last few years, reality set-in, in the last few years, Big Data may not be the ultimate panacea for all the imagined problems. Nevertheless, it is extremely useful in several facets of data problems, ranging from discovering new medical cures to detecting money laundering. As Big Data startups have consolidated over time(like the merger of Cloudera and Hortonworks), the technology stack for Big Data also experienced standardization and maturity. Increasingly, cloud vendors (AWS, Azure, GCP) offer cloud-native solutions for big data problems. This makes it extremely easy for smaller companies to leverage the benefits of Big Data without expensive on-premise deployments.

What is Hadoop?

Hadoop is the software manifestation of Big Data.

Hadoop is a reliable, distributed, and scalable platform for storing and analyzing vast amounts of data.

The lure of Hadoop is its ability to run on cheap commodity hardware, while its competitors may need expensive hardware to do the same job. Rather than rely on hardware to deliver high availability, Hadoop is designed to detect and handle failures at the application layer. In essence, Hadoop delivers a highly available service on top of a cluster of computers, each of which may be prone to failures. Other solutions depend on the hardware to provide high reliability. More importantly, Hadoop is open source. Any company with a skilled team and enough determination can run and manage their own deployment of the Hadoop stack without paying a dime in license fees. Note the term Hadoop is sometimes used to refer to a larger ecosystem of Apache projects such as Ambari, HBase, HDFS, Zookeeper, and others.

The name Hadoop by itself doesn’t mean much. Dough Cutting (aka the father of Hadoop) inspired from the Google’s Map Reduce paper, set out to create an open source, distributed data processing platform. He named it after his son’s toy elephant, “Hadoop”. Check out this video with more details about how Hadoop got its name. Eventually the project was donated to the Apache Foundation and sparked a number of other related projects. Together these projects now make for an active Big Data ecosystem.

Note that Hadoop isn’t novel in what it does. Before the advent of Hadoop, the high performance computing (HPC) and grid computing communities had been doing large scale data processing for years. They primarily used APIs such as the Message Passing Interface (MPI). Broadly, the HPC approach distributes the work across a cluster of machines. Then, they access a shared filesystem, hosted by a storage area network (SAN). This setup works well for predominantly compute-intensive jobs; it becomes a problem when nodes need to access large data volumes, hundreds of gigabytes because the network bandwidth will become the bottleneck and compute nodes become idle. This is the inflection point where Hadoop shines.

What to expect?

In this course we explore the various Big Data concepts and tools from a beginner’s perspective. This course is an ideal jumpstart for a career in Big Data. Using the in-browser docker container and hands-on practice, hard-to-grasp concepts become easy to learn!

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Introduction

Introduction

What is Hadoop?

What to expect?