Running MapReduce End to End
This lesson demonstrates running a MapReduce job in a cluster.
Running MapReduce End to End
We know how to run a MapReduce job using the code widget. In this lesson, we’ll learn to submit the job to a Hadoop cluster. For this purpose, we use a pseudo-distributed Hadoop cluster running in a docker environment.
Conceptually, the end-to-end MapReduce program works as follows:
The following code snippet lists the commands to execute that run a MapReduce job. Each command is explained later in the lesson. You may read the explanation first, and then execute the commands in the terminal. At the end of the lesson, a video shows the execution run of these commands.
Exercise
# Click on the terminal below and execute the commands in order/DataJek/startHadoop.shjpshdfs dfs -copyFromLocal /DataJek/cars.data /hdfs dfs -ls /hadoop jar JarDependencies/MapReduceJarDependencies/MapReduce-1.0-SNAPSHOT.jar io.datajek.mapreduce.Driver /cars.data /MyJobResulthdfs dfs -ls /MyJobResulthdfs dfs -text /MyJobResult/part-r-00000
Explanation
-
Copy and execute the following command in the terminal:
/DataJek/startHadoop.sh
-
Once the script finishes running, execute the following command:
jps
The jps command lists all the running Java processes. If you see the following six processes running, then the pesudo-distributed Hadoop cluster is working correctly:
- Namenode
- Datanode
- NodeManager
- ResourceManager
- SecondaryNameNode