Distcp

This lesson talks about the distributed copy tool.

We'll cover the following

Distcp

Distributed Copy tool, also known as distcp, is one of the important tools of Hadoop. Commonly used in the industry for moving data around, it is as an example of a problem that MapReduce can solve. The Distcp tool allows for parallel processing of files on the same Hadoop cluster or between two Hadoop clusters. It can copy files or directories. Distcp is implemented as a map reduce job with no reduce phase. The mappers run in parallel across the cluster to perform the copy. This reduces the time required to copy the same data sequentially. Each file is copied by one map task; the smallest unit of work for Distcp is a file. If the number of mappers is set to one, the lone mapper writes one replica of each block of a file on the running node, and the second and third replicas spreads out across the cluster. This creates an imbalance where the node running the map task will hold all the copy data until its disk becomes full.

By default, Distcp tries to assign each map task a fixed set of files, so each map task copies an equal number of bytes. However, since a file is copied by one map task, it is impossible to split-up a single file among different mappers to copy. Therefore, the load for each map task may not be exactly even. You can use a dynamic strategy for copying, where the files are divided into several buckets whose number is more than the number of mappers. Each bucket is processed by a map task and if any map task finishes its bucket early, it can pick up one of the remaining buckets. The strategy can be specified as a command line argument.

Get hands-on with 1200+ tech skills courses.