Introduction

Pig is a language for parallel data processing. One analogy is the existence of higher-level programming languages built indirectly on top of assembly-language. Programs can be of equal quality when written in an assembly language as they can be in languages or Java, C++, etc. However, the former requires a great effort. Pig’s purpose is like the purpose of higher level programming languages; i.e. it provides an abstraction over MapReduce and other frameworks for easily expressing data analysis jobs. MapReduce paradigm involves writing a map function followed by a reduce. This can be challenging to implement as a programmer when working with complex workflows such as joins. Pig makes it easy to express a join for a user and by hiding underlying MapReduce complexity from the user.

Pig’s language layer consists of a textual language called Pig Latin used to express dataflows. Pig’s infrastructure layer refers to the environment where Pig Latin programs are executed. It consists of a compiler that produces sequences of Map-Reduce programs run on an execution engine like MapReduce, Spark, or Tez. Pig is not tied to a particular parallel framework, but was first implemented on Hadoop. Originally developed at Yahoo Research in 2006, it offered for researchers an ad-hoc way of creating and executing MapReduce jobs on large data ...

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Pig: Overview

Introduction