Distcp

Distributed Copy tool, also known as distcp, is one of the important tools of Hadoop. Commonly used in the industry for moving data around, it is as an example of a problem that MapReduce can solve. The Distcp tool allows for parallel processing of files on the same Hadoop cluster or between two Hadoop clusters. It can copy files or directories. Distcp is implemented as a map reduce job with no reduce phase. The mappers run in parallel across the cluster to perform the copy. This reduces the time required to copy the same data sequentially. Each file is copied by one map task; the smallest unit of work for Distcp is a file. If the number of ...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Distcp

Distcp