Introduction

Pig is a language for parallel data processing. One analogy is the existence of higher-level programming languages built indirectly on top of assembly-language. Programs can be of equal quality when written in an assembly language as they can be in languages or Java, C++, etc. However, the former requires a great effort. Pig’s purpose is like the purpose of higher level programming languages; i.e. it provides an abstraction over MapReduce and other frameworks for easily expressing data analysis jobs. MapReduce paradigm involves writing a map function followed by a reduce. This can be challenging to implement as a programmer when working with complex workflows such as joins. Pig makes it easy to express a join for a user and by hiding underlying MapReduce complexity from the user.

Pig’s language layer consists of a textual language called Pig Latin used to express dataflows. Pig’s infrastructure layer refers to the environment where Pig Latin programs are executed. It consists of a compiler that produces sequences of Map-Reduce programs run on an execution engine like MapReduce, Spark, or Tez. Pig is not tied to a particular parallel framework, but was first implemented on Hadoop. Originally developed at Yahoo Research in 2006, it offered for researchers an ad-hoc way of creating and executing MapReduce jobs on large data sets. In 2007, it moved into the Apache Software Foundation. The The unconventional name came from the guiding principles of the project, which have a lot in common with pigs. Pig is fast, easy-to-use, compatible with different compute engines, and works with structured or unstructured data.

Execution modes

Pig has the following execution modes:

  • local
  • MapReduce
  • Spark
  • Tez

Get hands-on with 1200+ tech skills courses.