Get started with the basics of big data and Hadoop clusters and how to use Yarn, MapReduce, and HDFS in Hadoop clusters.

Become a Big Data Professional

This module offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of big Data. Throughout this module, we will have plenty of opportunities to get our hands-on experience with functioning Hadoop clusters.

We will start off by learning about the rise of big data as well as the different types of data like structured, unstructured, and semi-structured data. We will then dive into the fundamentals of big data such as YARN (yet another resource manager), MapReduce, and HDFS (Hadoop Distributed File System). By the end of this module, we will have the foundations in place to start working with big data.

Introduction to Big Data and Hadoop

## Running MapReduce End to End

We know how to run a MapReduce job using the code widget. In this lesson, we'll learn to submit the job to a Hadoop cluster. For this purpose, we use a pseudo-distributed Hadoop cluster running in a docker environment.


Conceptually, the end-to-end MapReduce program works as follows:

The following code snippet lists the commands to execute that run a MapReduce job. Each command is explained later in the lesson. You may read the explanation first, and then execute the commands in the terminal. At the end of the lesson, a video shows the execution run of these commands.

# Click on the terminal below and execute the commands in order

/DataJek/startHadoop.sh

jps

hdfs dfs -copyFromLocal /DataJek/cars.data /

hdfs dfs -ls /

hadoop jar JarDependencies/MapReduceJarDependencies/MapReduce-1.0-SNAPSHOT.jar io.datajek.mapreduce.Driver /cars.data /MyJobResult

hdfs dfs -ls /MyJobResult

hdfs dfs -text /MyJobResult/part-r-00000

## Explanation

1. Copy and execute the following command in the terminal:

   ```shell
   /DataJek/startHadoop.sh
   ```


2. Once the script finishes running, execute the following command:

   ```shell
   jps
   ```

   The jps command lists all the running Java processes. If you see the following six processes running, then the pesudo-distributed Hadoop cluster is working correctly:


   + Namenode
   + Datanode
   + NodeManager
   + ResourceManager
   + SecondaryNameNode
   + JobHistoryServer

   Here's a screenshot of the expected outcome.

# Running MapReduce End to End

We know how to run a MapReduce job using the code widget. In this lesson, we'll learn to submit the job to a Hadoop cluster. For this purpose, we use a pseudo-distributed Hadoop cluster running in a docker environment.


Conceptually, the end-to-end MapReduce program works as follows:

# Explanation

1. Copy and execute the following command in the terminal:

   ```shell
   /DataJek/startHadoop.sh
   ```


2. Once the script finishes running, execute the following command:

   ```shell
   jps
   ```

   The jps command lists all the running Java processes. If you see the following six processes running, then the pesudo-distributed Hadoop cluster is working correctly:


   + Namenode
   + Datanode
   + NodeManager
   + ResourceManager
   + SecondaryNameNode
   + JobHistoryServer

   Here's a screenshot of the expected outcome.

This lesson demonstrates running a MapReduce job in a cluster.

Hadoop

Yarn

Map Reduce

HDFS

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Conclusion

Running MapReduce End to End

Running MapReduce End to End

Exercise

Explanation