Testing MapReduce Program

In this lesson, we'll demonstrate how to test a MapReduce program on a local machine.

We'll cover the following

Testing MapReduce

Testing MapReduce

So far, we have learned how to write mapper and reducer classes and their corresponding unit tests. But ideally, we want to test our MapReduce job end to end. There are different ways to run a MapReduce job:

Using the ToolRunner class to run a MapReduce job on a local machine. The MapReduce job must implement the interface Tool. This doesn’t require any running Hadoop daemons.
Setup a Hadoop cluster on a local machine in pseudo distributed mode and then submit the job to the cluster.
Submit the MapReduce job to an actual cluster consisting of many machines.

We’ll start with the first option by writing a program that implements the Tool interface. However, any class implementing Tool interface must also implement the interface Configurable, which is in turn extended by Tool. The easiest way for a MapReduce job is to derive from one of the Hadoop’s helper class Configured which already implements the interface Configurable. If we name our MapReduce job the class signature would look as follows:

public class CarCounterMrProgram extends Configured implements Tool {
    // ... class body
}

The input to the program will live on the local disk. Create an object of type Configuration. Specify this information, along with setting mapreduce.framework.name, to local. These changes are shown on lines 27 and 28 in the class CarCounterMrProgram below in the code widget. The class CarCounterMrProgram also represents our MapReduce job. We carry over the mapper and reducer classes created in the previous sections without any changes. We also create a class CarMRInputGenerator to generate random data. Read the comments in the code widget below and examine the various classes. Unfortunately, the code is not runnable because of hostname resolution challenges in a VM, required for testing the MapReduce program.

Get hands-on with 1200+ tech skills courses.

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Testing MapReduce Program

Testing MapReduce